Model
IND
OOD










R BERTPBERTFBERTBLEURT#failedRBERTPBERTFBERTBLEURT#failed
GPT-4 Turbo 0.847 0.869 0.858 0.280 0 0.852 0.868 0.860 0.283 0
Gemini Pro 0.844 0.867 0.855 0.269 0 0.847 0.866 0.856 0.264 0
General-purpose LLMs
Claude 2.1 0.848 0.835 0.841 0.314 0 0.851 0.833 0.842 0.325 0
Llama-2 13B-chat 0.845 0.780 0.811 0.261 0 0.845 0.775 0.808 0.260 0
Mistral-7B-Instruct-v0.2 0.850 0.856 0.853 0.288 0 0.852 0.851 0.851 0.290 0
E-commerce LLM
EcomGPT 0.675 0.665 0.669 0.290 0 0.729 0.716 0.722 0.296 0
SoTA task-specific model
GPT-4 Turbo 0.847 0.869 0.858 0.280 0 0.852 0.868 0.860 0.283 0
Flan-T5 XXL 0.822 0.864 0.842 0.310 0 0.824 0.865 0.843 0.302 0
Llama-2 13B-chat 0.824 0.861 0.841 0.309 0 0.821 0.860 0.840 0.289 0
Task-specific Llama-2 7B-chat 0.822 0.861 0.841 0.301 0 0.820 0.861 0.840 0.289 0
Mistral-7B Instruct-v0.2 0.823 0.860 0.841 0.310 0 0.823 0.861 0.842 0.298 0
Flan-T5 XL 0.823 0.864 0.843 0.320 0 0.824 0.864 0.843 0.307 0
eCeLLM Phi-2 0.817 0.855 0.835 0.283 0 0.817 0.856 0.835 0.270 0













Flan-T5 XXL 0.824 0.865 0.844 0.224 0 0.823 0.864 0.843 0.206 0
Llama-2 13B-chat 0.823 0.861 0.841 0.215 0 0.822 0.861 0.841 0.195 0
Generalist Llama-2 7B-chat 0.822 0.860 0.840 0.208 0 0.819 0.859 0.838 0.188 0
Mistral-7B Instruct-v0.2 0.822 0.864 0.842 0.213 0 0.821 0.862 0.840 0.194 0
Flan-T5 XL 0.823 0.864 0.843 0.227 0 0.824 0.865 0.844 0.211 0
Phi-2 0.823 0.861 0.842 0.222 0 0.821 0.859 0.840 0.198 0