Model
IND
OOD








Recall* Precision* F1* #failed Recall* Precision* F1* #failed
GPT-4 Turbo 0.422 0.598 0.495 6 0.317 0.529 0.397 1
Gemini Pro 0.318 0.523 0.396 4 0.203 0.426 0.275 6
General-purpose LLMs
Claude 2.1 0.310 0.494 0.381 59 0.312 0.600 0.410 66
Llama-2 13B-chat 0.002 0.002 0.002 0 0.000 0.000 0.000 0
Mistral-7B-Instruct-v0.2 0.321 0.435 0.369 69 0.217 0.337 0.264 52
E-commerce LLM
EcomGPT 0.000 0.000 0.000 905 0.001 0.042 0.001 869
SoTA task-specific model
SUOpenTag 0.603 0.500 0.546 0 0.124 0.173 0.144 0
AVEQA 0.425 0.491 0.456 0 0.283 0.257 0.269 0
Flan-T5 XXL 0.298 0.519 0.378 7 0.362 0.701 0.477 0
Llama-2 13B-chat 0.544 0.666 0.599 3 0.448 0.613 0.518 2
Task-specific Llama-2 7B-chat 0.531 0.660 0.588 1 0.323 0.499 0.392 0
Mistral-7B Instruct-v0.2 0.720 0.799 0.757 5 0.374 0.544 0.443 0
Flan-T5 XL 0.258 0.449 0.328 7 0.276 0.538 0.365 3
eCeLLM Phi-2 0.304 0.570 0.397 265 0.288 0.488 0.362 0











Flan-T5 XXL 0.353 0.611 0.447 2 0.360 0.699 0.476 0
Llama-2 13B-chat 0.530 0.646 0.582 1 0.276 0.425 0.335 0
Generalist Llama-2 7B-chat 0.514 0.641 0.571 4 0.236 0.392 0.294 0
Mistral-7B Instruct-v0.2 0.612 0.722 0.662 0 0.304 0.463 0.367 0
Flan-T5 XL 0.297 0.514 0.376 1 0.267 0.518 0.352 1
Phi-2 0.455 0.578 0.509 0 0.237 0.417 0.302 0