Model
IND
OOD










Accuracy M-Rec M-Pre M-F1 #failed Accuracy M-Rec M-Pre M-F1 #failed
GPT-4 Turbo 0.384 0.487 0.381 0.326 0 0.488 0.496 0.392 0.392 0
Gemini Pro 0.128 0.385 0.352 0.136 1 0.147 0.359 0.390 0.123 0
General-purpose LLMs
Claude 2.1 0.508 0.347 0.344 0.275 10 0.362 0.394 0.400 0.277 4
Llama-2 13B-chat 0.473 0.333 0.333 0.333 0 0.419 0.338 0.339 0.324 0
Mistral-7B-Instruct-v0.2 0.442 0.323 0.325 0.324 0 0.422 0.338 0.351 0.327 0
E-commerce LLM
EcomGPT 0.147 0.101 0.101 0.091 444 0.125 0.125 0.092 0.096 455
SoTA task-specific model
DeBERTaV3 0.762 0.575 0.620 0.588 0 0.658 0.514 0.570 0.507 0
RGCN 0.615 0.665 0.637 0.506 0 0.576 0.373 0.372 0.356 0
Flan-T5 XXL 0.754 0.516 0.511 0.508 0 0.663 0.506 0.468 0.466 0
Llama-2 13B-chat 0.769 0.530 0.517 0.521 0 0.690 0.520 0.472 0.483 0
Task-specific Llama-2 7B-chat 0.774 0.541 0.628 0.537 0 0.695 0.526 0.803 0.498 0
Mistral-7B Instruct-v0.2 0.782 0.547 0.689 0.543 0 0.711 0.532 0.808 0.502 0
Flan-T5 XL 0.704 0.467 0.496 0.460 0 0.592 0.471 0.625 0.427 0
eCeLLM Phi-2 0.584 0.372 0.379 0.348 0 0.406 0.349 0.334 0.251 0













Flan-T5 XXL 0.769 0.531 0.517 0.522 0 0.703 0.533 0.648 0.499 0
Llama-2 13B-chat 0.775 0.599 0.635 0.611 0 0.726 0.564 0.611 0.558 0
Generalist Llama-2 7B-chat 0.797 0.586 0.661 0.595 0 0.703 0.533 0.648 0.499 0
Mistral-7B Instruct-v0.2 0.788 0.555 0.644 0.558 0 0.707 0.537 0.596 0.502 0
Flan-T5 XL 0.757 0.517 0.515 0.511 0 0.678 0.521 0.587 0.489 0
Phi-2 0.747 0.524 0.552 0.518 0 0.710 0.541 0.611 0.520 0