Model
|
IND | ||||||||
|
|
|
|
|
|
|
|||
Accuracy | Recall | Precision | F1 | Specificity | NPR | #failed | |||
GPT-4 Turbo | 0.289 | 0.374 | 0.132 | 0.195 | 0.264 | 0.585 | 0 | ||
Gemini Pro | 0.296 | 0.504 | 0.164 | 0.248 | 0.234 | 0.612 | 0 | ||
General-purpose LLMs
|
Claude 2.1 | 0.291 | 0.578 | 0.179 | 0.273 | 0.205 | 0.620 | 1 | |
Llama-2 13B-chat | 0.649 | 0.257 | 0.247 | 0.252 | 0.766 | 0.775 | 0 | ||
Mistral-7B-Instruct-v0.2 | 0.361 | 0.609 | 0.203 | 0.305 | 0.287 | 0.711 | 0 | ||
E-commerce LLM
|
EcomGPT | 0.630 | 0.165 | 0.176 | 0.170 | 0.769 | 0.755 | 13 | |
SoTA task-specific model
|
BERT | 0.761 | 0.330 | 0.472 | 0.389 | 0.890 | 0.816 | 0 | |
DeBERTaV3 | 0.769 | 0.000 | 0.000 | 0.000 | 0.999 | 0.770 | 0 | ||
Flan-T5 XXL | 0.766 | 0.013 | 0.300 | 0.025 | 0.991 | 0.771 | 0 | ||
Llama-2 13B-chat | 0.770 | 0.000 | 0.000 | 0.000 | 1.000 | 0.770 | 0 | ||
Task-specific | Llama-2 7B-chat | 0.770 | 0.017 | 0.500 | 0.034 | 0.995 | 0.772 | 0 | |
Mistral-7B Instruct-v0.2 | 0.770 | 0.000 | 0.000 | 0.000 | 1.000 | 0.770 | 0 | ||
Flan-T5 XL | 0.768 | 0.000 | 0.000 | 0.000 | 0.997 | 0.770 | 0 | ||
eCeLLM | Phi-2 | 0.770 | 0.000 | 0.000 | 0.000 | 1.000 | 0.770 | 0 | |
|
|
|
|
|
|
|
|
|
|
Flan-T5 XXL | 0.771 | 0.300 | 0.504 | 0.376 | 0.912 | 0.813 | 0 | ||
Llama-2 13B-chat | 0.795 | 0.448 | 0.569 | 0.501 | 0.899 | 0.845 | 0 | ||
Generalist | Llama-2 7B-chat | 0.781 | 0.283 | 0.546 | 0.372 | 0.930 | 0.813 | 0 | |
Mistral-7B Instruct-v0.2 | 0.790 | 0.200 | 0.639 | 0.305 | 0.966 | 0.802 | 0 | ||
Flan-T5 XL | 0.773 | 0.022 | 0.714 | 0.042 | 0.997 | 0.773 | 0 | ||
Phi-2 | 0.777 | 0.313 | 0.526 | 0.392 | 0.916 | 0.817 | 0 | ||