Model
IND





Accuracy M-Rec M-Pre M-F1 #failed
GPT-4 Turbo 0.611 0.527 0.540 0.487 0
Gemini Pro 0.584 0.471 0.414 0.425 2
General-purpose LLMs
Claude 2.1 0.655 0.464 0.419 0.435 13
Llama-2 13B-chat 0.504 0.250 0.251 0.250 0
Mistral-7B-Instruct-v0.2 0.529 0.395 0.384 0.365 0
E-commerce LLM
EcomGPT 0.540 0.265 0.218 0.223 2
SoTA task-specific model
BERT 0.661 0.381 0.423 0.393 0
DeBERTaV3 0.703 0.436 0.472 0.448 0
Flan-T5 XXL 0.666 0.438 0.412 0.346 0
Llama-2 13B-chat 0.655 0.399 0.410 0.349 0
Task-specific Llama-2 7B-chat 0.659 0.399 0.531 0.330 0
Mistral-7B Instruct-v0.2 0.681 0.406 0.423 0.387 0
Flan-T5 XL 0.648 0.425 0.361 0.327 0
eCeLLM Phi-2 0.646 0.387 0.316 0.321 0







Flan-T5 XXL 0.680 0.431 0.416 0.364 0
Llama-2 13B-chat 0.684 0.440 0.435 0.414 0
Generalist Llama-2 7B-chat 0.679 0.427 0.434 0.398 0
Mistral-7B Instruct-v0.2 0.696 0.450 0.456 0.443 0
Flan-T5 XL 0.663 0.395 0.533 0.332 0
Phi-2 0.650 0.397 0.410 0.335 0