Model | IND | OOD | |||||||||||
R BERT | PBERT | FBERT | BLEURT | #failed | RBERT | PBERT | FBERT | BLEURT | #failed | ||||
GPT-4 Turbo | 0.847 | 0.869 | 0.858 | 0.280 | 0 | 0.852 | 0.868 | 0.860 | 0.283 | 0 | |||
Gemini Pro | 0.844 | 0.867 | 0.855 | 0.269 | 0 | 0.847 | 0.866 | 0.856 | 0.264 | 0 | |||
General-purpose LLMs | Claude 2.1 | 0.848 | 0.835 | 0.841 | 0.314 | 0 | 0.851 | 0.833 | 0.842 | 0.325 | 0 | ||
Llama-2 13B-chat | 0.845 | 0.780 | 0.811 | 0.261 | 0 | 0.845 | 0.775 | 0.808 | 0.260 | 0 | |||
Mistral-7B-Instruct-v0.2 | 0.850 | 0.856 | 0.853 | 0.288 | 0 | 0.852 | 0.851 | 0.851 | 0.290 | 0 | |||
E-commerce LLM | EcomGPT | 0.675 | 0.665 | 0.669 | 0.290 | 0 | 0.729 | 0.716 | 0.722 | 0.296 | 0 | ||
SoTA task-specific model | GPT-4 Turbo | 0.847 | 0.869 | 0.858 | 0.280 | 0 | 0.852 | 0.868 | 0.860 | 0.283 | 0 | ||
Flan-T5 XXL | 0.822 | 0.864 | 0.842 | 0.310 | 0 | 0.824 | 0.865 | 0.843 | 0.302 | 0 | |||
Llama-2 13B-chat | 0.824 | 0.861 | 0.841 | 0.309 | 0 | 0.821 | 0.860 | 0.840 | 0.289 | 0 | |||
Task-specific | Llama-2 7B-chat | 0.822 | 0.861 | 0.841 | 0.301 | 0 | 0.820 | 0.861 | 0.840 | 0.289 | 0 | ||
Mistral-7B Instruct-v0.2 | 0.823 | 0.860 | 0.841 | 0.310 | 0 | 0.823 | 0.861 | 0.842 | 0.298 | 0 | |||
Flan-T5 XL | 0.823 | 0.864 | 0.843 | 0.320 | 0 | 0.824 | 0.864 | 0.843 | 0.307 | 0 | |||
eCeLLM | Phi-2 | 0.817 | 0.855 | 0.835 | 0.283 | 0 | 0.817 | 0.856 | 0.835 | 0.270 | 0 | ||
Flan-T5 XXL | 0.824 | 0.865 | 0.844 | 0.224 | 0 | 0.823 | 0.864 | 0.843 | 0.206 | 0 | |||
Llama-2 13B-chat | 0.823 | 0.861 | 0.841 | 0.215 | 0 | 0.822 | 0.861 | 0.841 | 0.195 | 0 | |||
Generalist | Llama-2 7B-chat | 0.822 | 0.860 | 0.840 | 0.208 | 0 | 0.819 | 0.859 | 0.838 | 0.188 | 0 | ||
Mistral-7B Instruct-v0.2 | 0.822 | 0.864 | 0.842 | 0.213 | 0 | 0.821 | 0.862 | 0.840 | 0.194 | 0 | |||
Flan-T5 XL | 0.823 | 0.864 | 0.843 | 0.227 | 0 | 0.824 | 0.865 | 0.844 | 0.211 | 0 | |||
Phi-2 | 0.823 | 0.861 | 0.842 | 0.222 | 0 | 0.821 | 0.859 | 0.840 | 0.198 | 0 | |||