Model
IND







Accuracy Recall Precision F1 Specificity NPR #failed
GPT-4 Turbo 0.289 0.374 0.132 0.195 0.264 0.585 0
Gemini Pro 0.296 0.504 0.164 0.248 0.234 0.612 0
General-purpose LLMs
Claude 2.1 0.291 0.578 0.179 0.273 0.205 0.620 1
Llama-2 13B-chat 0.649 0.257 0.247 0.252 0.766 0.775 0
Mistral-7B-Instruct-v0.2 0.361 0.609 0.203 0.305 0.287 0.711 0
E-commerce LLM
EcomGPT 0.630 0.165 0.176 0.170 0.769 0.755 13
SoTA task-specific model
BERT 0.761 0.330 0.472 0.389 0.890 0.816 0
DeBERTaV3 0.769 0.000 0.000 0.000 0.999 0.770 0
Flan-T5 XXL 0.766 0.013 0.300 0.025 0.991 0.771 0
Llama-2 13B-chat 0.770 0.000 0.000 0.000 1.000 0.770 0
Task-specific Llama-2 7B-chat 0.770 0.017 0.500 0.034 0.995 0.772 0
Mistral-7B Instruct-v0.2 0.770 0.000 0.000 0.000 1.000 0.770 0
Flan-T5 XL 0.768 0.000 0.000 0.000 0.997 0.770 0
eCeLLM Phi-2 0.770 0.000 0.000 0.000 1.000 0.770 0









Flan-T5 XXL 0.771 0.300 0.504 0.376 0.912 0.813 0
Llama-2 13B-chat 0.795 0.448 0.569 0.501 0.899 0.845 0
Generalist Llama-2 7B-chat 0.781 0.283 0.546 0.372 0.930 0.813 0
Mistral-7B Instruct-v0.2 0.790 0.200 0.639 0.305 0.966 0.802 0
Flan-T5 XL 0.773 0.022 0.714 0.042 0.997 0.773 0
Phi-2 0.777 0.313 0.526 0.392 0.916 0.817 0