Model
|
IND | ||||||||
|
|
|
|
|
|
|
|||
Accuracy | Recall | Precision | F1 | Specificity | NPR | #failed | |||
GPT-4 Turbo | 0.826 | 0.604 | 1.000 | 0.753 | 1.000 | 0.763 | 0 | ||
Gemini Pro | 0.897 | 0.766 | 1.000 | 0.867 | 1.000 | 0.845 | 0 | ||
General-purpose LLMs
|
Claude 2.1 | 0.711 | 0.360 | 0.952 | 0.523 | 0.986 | 0.664 | 1 | |
Llama-2 13B-chat | 0.474 | 0.459 | 0.411 | 0.434 | 0.486 | 0.535 | 0 | ||
Mistral-7B-Instruct-v0.2 | 0.755 | 0.441 | 1.000 | 0.613 | 1.000 | 0.696 | 0 | ||
E-commerce LLM
|
EcomGPT | 0.648 | 0.739 | 0.577 | 0.648 | 0.577 | 0.739 | 0 | |
SoTA task-specific model
|
BERT | 0.996 | 0.991 | 1.000 | 0.995 | 1.000 | 0.993 | 0 | |
DeBERTaV3 | 0.577 | 1.000 | 0.509 | 0.675 | 0.246 | 1.000 | 0 | ||
Flan-T5 XXL | 0.996 | 0.991 | 1.000 | 0.995 | 1.000 | 0.993 | 0 | ||
Llama-2 13B-chat | 0.996 | 0.991 | 1.000 | 0.995 | 1.000 | 0.993 | 0 | ||
Task-specific | Llama-2 7B-chat | 0.992 | 0.991 | 0.991 | 0.991 | 0.993 | 0.993 | 0 | |
Mistral-7B Instruct-v0.2 | 0.988 | 0.991 | 0.982 | 0.987 | 0.986 | 0.993 | 0 | ||
Flan-T5 XL | 0.960 | 0.910 | 1.000 | 0.953 | 1.000 | 0.934 | 0 | ||
eCeLLM | Phi-2 | 0.992 | 0.991 | 0.991 | 0.991 | 0.993 | 0.993 | 0 | |
|
|
|
|
|
|
|
|
|
|
Flan-T5 XXL | 0.996 | 0.991 | 1.000 | 0.995 | 1.000 | 0.993 | 0 | ||
Llama-2 13B-chat | 0.996 | 0.991 | 1.000 | 0.995 | 1.000 | 0.993 | 0 | ||
Generalist | Llama-2 7B-chat | 0.996 | 0.991 | 1.000 | 0.995 | 1.000 | 0.993 | 0 | |
Mistral-7B Instruct-v0.2 | 0.996 | 0.991 | 1.000 | 0.995 | 1.000 | 0.993 | 0 | ||
Flan-T5 XL | 0.996 | 0.991 | 1.000 | 0.995 | 1.000 | 0.993 | 0 | ||
Phi-2 | 0.992 | 0.991 | 0.991 | 0.991 | 0.993 | 0.993 | 0 | ||