Model
|
IND
|
OOD
|
|||||||||
|
|
|
|
||||||||
|
|
|
|
||||||||
Recall* | Precision* | F1* | #failed | Recall* | Precision* | F1* | #failed | ||||
GPT-4 Turbo | 0.422 | 0.598 | 0.495 | 6 | 0.317 | 0.529 | 0.397 | 1 | |||
Gemini Pro | 0.318 | 0.523 | 0.396 | 4 | 0.203 | 0.426 | 0.275 | 6 | |||
General-purpose LLMs
|
Claude 2.1 | 0.310 | 0.494 | 0.381 | 59 | 0.312 | 0.600 | 0.410 | 66 | ||
Llama-2 13B-chat | 0.002 | 0.002 | 0.002 | 0 | 0.000 | 0.000 | 0.000 | 0 | |||
Mistral-7B-Instruct-v0.2 | 0.321 | 0.435 | 0.369 | 69 | 0.217 | 0.337 | 0.264 | 52 | |||
E-commerce LLM
|
EcomGPT | 0.000 | 0.000 | 0.000 | 905 | 0.001 | 0.042 | 0.001 | 869 | ||
SoTA task-specific model
|
SUOpenTag | 0.603 | 0.500 | 0.546 | 0 | 0.124 | 0.173 | 0.144 | 0 | ||
AVEQA | 0.425 | 0.491 | 0.456 | 0 | 0.283 | 0.257 | 0.269 | 0 | |||
Flan-T5 XXL | 0.298 | 0.519 | 0.378 | 7 | 0.362 | 0.701 | 0.477 | 0 | |||
Llama-2 13B-chat | 0.544 | 0.666 | 0.599 | 3 | 0.448 | 0.613 | 0.518 | 2 | |||
Task-specific | Llama-2 7B-chat | 0.531 | 0.660 | 0.588 | 1 | 0.323 | 0.499 | 0.392 | 0 | ||
Mistral-7B Instruct-v0.2 | 0.720 | 0.799 | 0.757 | 5 | 0.374 | 0.544 | 0.443 | 0 | |||
Flan-T5 XL | 0.258 | 0.449 | 0.328 | 7 | 0.276 | 0.538 | 0.365 | 3 | |||
eCeLLM | Phi-2 | 0.304 | 0.570 | 0.397 | 265 | 0.288 | 0.488 | 0.362 | 0 | ||
|
|
|
|
|
|
|
|
|
|
|
|
Flan-T5 XXL | 0.353 | 0.611 | 0.447 | 2 | 0.360 | 0.699 | 0.476 | 0 | |||
Llama-2 13B-chat | 0.530 | 0.646 | 0.582 | 1 | 0.276 | 0.425 | 0.335 | 0 | |||
Generalist | Llama-2 7B-chat | 0.514 | 0.641 | 0.571 | 4 | 0.236 | 0.392 | 0.294 | 0 | ||
Mistral-7B Instruct-v0.2 | 0.612 | 0.722 | 0.662 | 0 | 0.304 | 0.463 | 0.367 | 0 | |||
Flan-T5 XL | 0.297 | 0.514 | 0.376 | 1 | 0.267 | 0.518 | 0.352 | 1 | |||
Phi-2 | 0.455 | 0.578 | 0.509 | 0 | 0.237 | 0.417 | 0.302 | 0 | |||