Model
IND
OOD










Acc M-Rec M-Pre M-F1 #failed Acc M-Rec M-Pre M-F1 #failed
GPT-4 Turbo 0.595 0.575 0.544 0.516 0 0.556 0.586 0.544 0.510 0
Gemini Pro 0.609 0.521 0.453 0.470 2 0.572 0.511 0.444 0.454 1
General-purpose LLMs
Claude 2.1 0.375 0.510 0.474 0.415 2 0.328 0.466 0.447 0.369 1
Llama-2 13B-chat 0.406 0.188 0.191 0.188 0 0.384 0.179 0.180 0.178 0
Mistral-7B-Instruct-v0.2 0.633 0.532 0.551 0.470 0 0.594 0.531 0.494 0.438 0
E-commerce LLM
EcomGPT 0.191 0.362 0.341 0.188 6 0.196 0.375 0.336 0.178 13
BERTweet 0.733 0.503 0.530 0.511 0 0.729 0.507 0.524 0.513 0
SoTA task-specific model
DeBERTaV3 0.768 0.567 0.607 0.573 0 0.764 0.565 0.591 0.567 0
P5 0.611 0.199 0.157 0.156 0 0.620 0.200 0.124 0.153 0
Flan-T5 XXL 0.783 0.619 0.618 0.612 0 0.770 0.604 0.601 0.600 0
Llama-2 13B-chat 0.791 0.616 0.641 0.616 0 0.781 0.627 0.645 0.629 0
Task-specific Llama-2 7B-chat 0.790 0.620 0.652 0.634 0 0.769 0.583 0.599 0.589 0
Mistral-7B Instruct-v0.2 0.801 0.643 0.676 0.655 0 0.789 0.619 0.650 0.632 0
Flan-T5 XL 0.771 0.645 0.638 0.620 0 0.743 0.594 0.592 0.582 0
eCeLLM Phi-2 0.779 0.611 0.618 0.608 0 0.754 0.576 0.594 0.583 0













Flan-T5 XXL 0.797 0.629 0.646 0.628 0 0.787 0.619 0.624 0.619 0
Llama-2 13B-chat 0.796 0.641 0.661 0.648 0 0.785 0.621 0.638 0.629 0
Generalist Llama-2 7B-chat 0.768 0.579 0.589 0.580 0 0.776 0.599 0.626 0.606 0
Mistral-7B Instruct-v0.2 0.781 0.630 0.654 0.639 0 0.784 0.630 0.653 0.640 0
Flan-T5 XL 0.782 0.654 0.655 0.648 0 0.753 0.604 0.598 0.598 0
Phi-2 0.780 0.588 0.619 0.596 0 0.758 0.552 0.590 0.565 0