|
Model
|
IND | ||||||||
|
|
|
|
|
|
|
|
|||
| Accuracy | Recall | Precision | F1 | Specificity | NPR | #failed | |||
| GPT-4 Turbo | 0.826 | 0.604 | 1.000 | 0.753 | 1.000 | 0.763 | 0 | ||
| Gemini Pro | 0.897 | 0.766 | 1.000 | 0.867 | 1.000 | 0.845 | 0 | ||
|
General-purpose LLMs
|
Claude 2.1 | 0.711 | 0.360 | 0.952 | 0.523 | 0.986 | 0.664 | 1 | |
| Llama-2 13B-chat | 0.474 | 0.459 | 0.411 | 0.434 | 0.486 | 0.535 | 0 | ||
| Mistral-7B-Instruct-v0.2 | 0.755 | 0.441 | 1.000 | 0.613 | 1.000 | 0.696 | 0 | ||
|
E-commerce LLM
|
EcomGPT | 0.648 | 0.739 | 0.577 | 0.648 | 0.577 | 0.739 | 0 | |
|
SoTA task-specific model
|
BERT | 0.996 | 0.991 | 1.000 | 0.995 | 1.000 | 0.993 | 0 | |
| DeBERTaV3 | 0.577 | 1.000 | 0.509 | 0.675 | 0.246 | 1.000 | 0 | ||
| Flan-T5 XXL | 0.996 | 0.991 | 1.000 | 0.995 | 1.000 | 0.993 | 0 | ||
| Llama-2 13B-chat | 0.996 | 0.991 | 1.000 | 0.995 | 1.000 | 0.993 | 0 | ||
| Task-specific | Llama-2 7B-chat | 0.992 | 0.991 | 0.991 | 0.991 | 0.993 | 0.993 | 0 | |
| Mistral-7B Instruct-v0.2 | 0.988 | 0.991 | 0.982 | 0.987 | 0.986 | 0.993 | 0 | ||
| Flan-T5 XL | 0.960 | 0.910 | 1.000 | 0.953 | 1.000 | 0.934 | 0 | ||
| eCeLLM | Phi-2 | 0.992 | 0.991 | 0.991 | 0.991 | 0.993 | 0.993 | 0 | |
|
|
|
|
|
|
|
|
|
|
|
| Flan-T5 XXL | 0.996 | 0.991 | 1.000 | 0.995 | 1.000 | 0.993 | 0 | ||
| Llama-2 13B-chat | 0.996 | 0.991 | 1.000 | 0.995 | 1.000 | 0.993 | 0 | ||
| Generalist | Llama-2 7B-chat | 0.996 | 0.991 | 1.000 | 0.995 | 1.000 | 0.993 | 0 | |
| Mistral-7B Instruct-v0.2 | 0.996 | 0.991 | 1.000 | 0.995 | 1.000 | 0.993 | 0 | ||
| Flan-T5 XL | 0.996 | 0.991 | 1.000 | 0.995 | 1.000 | 0.993 | 0 | ||
| Phi-2 | 0.992 | 0.991 | 0.991 | 0.991 | 0.993 | 0.993 | 0 | ||