Model
IND







Accuracy Recall Precision F1 Specificity NPR #failed
GPT-4 Turbo 0.826 0.604 1.000 0.753 1.000 0.763 0
Gemini Pro 0.897 0.766 1.000 0.867 1.000 0.845 0
General-purpose LLMs
Claude 2.1 0.711 0.360 0.952 0.523 0.986 0.664 1
Llama-2 13B-chat 0.474 0.459 0.411 0.434 0.486 0.535 0
Mistral-7B-Instruct-v0.2 0.755 0.441 1.000 0.613 1.000 0.696 0
E-commerce LLM
EcomGPT 0.648 0.739 0.577 0.648 0.577 0.739 0
SoTA task-specific model
BERT 0.996 0.991 1.000 0.995 1.000 0.993 0
DeBERTaV3 0.577 1.000 0.509 0.675 0.246 1.000 0
Flan-T5 XXL 0.996 0.991 1.000 0.995 1.000 0.993 0
Llama-2 13B-chat 0.996 0.991 1.000 0.995 1.000 0.993 0
Task-specific Llama-2 7B-chat 0.992 0.991 0.991 0.991 0.993 0.993 0
Mistral-7B Instruct-v0.2 0.988 0.991 0.982 0.987 0.986 0.993 0
Flan-T5 XL 0.960 0.910 1.000 0.953 1.000 0.934 0
eCeLLM Phi-2 0.992 0.991 0.991 0.991 0.993 0.993 0









Flan-T5 XXL 0.996 0.991 1.000 0.995 1.000 0.993 0
Llama-2 13B-chat 0.996 0.991 1.000 0.995 1.000 0.993 0
Generalist Llama-2 7B-chat 0.996 0.991 1.000 0.995 1.000 0.993 0
Mistral-7B Instruct-v0.2 0.996 0.991 1.000 0.995 1.000 0.993 0
Flan-T5 XL 0.996 0.991 1.000 0.995 1.000 0.993 0
Phi-2 0.992 0.991 0.991 0.991 0.993 0.993 0