Model
IND OOD














Accuracy Recall Precision F1 Specificity NPR #failed Accuracy Recall Precision F1 Specificity NPR #failed
GPT-4 Turbo 0.623 0.550 0.791 0.649 0.749 0.491 5 0.641 0.555 0.876 0.680 0.828 0.460 2
Gemini Pro 0.542 0.371 0.797 0.506 0.837 0.435 13 0.543 0.410 0.844 0.552 0.834 0.393 6
General-purpose LLMs
Claude 2.1 0.424 0.177 0.671 0.280 0.850 0.375 90 0.384 0.146 0.769 0.245 0.904 0.326 73
Llama-2 13B-chat 0.534 0.608 0.638 0.623 0.406 0.375 0 0.541 0.605 0.688 0.644 0.401 0.317 0
Mistral-7B-Instruct-v0.2 0.522 0.539 0.647 0.588 0.493 0.383 120 0.537 0.523 0.725 0.608 0.567 0.352 126
E-commerce LLM
EcomGPT 0.318 0.051 0.283 0.086 0.779 0.322 254 0.286 0.085 0.403 0.140 0.726 0.266 245
SoTA task-specific model
BERT 0.749 0.970 0.726 0.830 0.368 0.877 0 0.803 0.831 0.876 0.853 0.742 0.668 0
DeBERTaV3 0.504 0.310 0.769 0.441 0.839 0.413 0 0.501 0.345 0.826 0.487 0.841 0.370 0
Flan-T5 XXL 0.749 0.875 0.763 0.815 0.531 0.712 0 0.799 0.889 0.830 0.859 0.602 0.713 0
Llama-2 13B-chat 0.801 0.919 0.797 0.854 0.597 0.811 0 0.838 0.924 0.852 0.887 0.650 0.797 0
Task-specific Llama-2 7B-chat 0.741 0.889 0.749 0.813 0.485 0.718 0 0.761 0.864 0.802 0.832 0.535 0.644 0
Mistral-7B Instruct-v0.2 0.821 0.896 0.834 0.864 0.692 0.794 0 0.835 0.891 0.872 0.881 0.713 0.749 0
Flan-T5 XL 0.693 0.684 0.802 0.738 0.708 0.565 0 0.707 0.682 0.862 0.762 0.761 0.523 0
eCeLLM Phi-2 0.765 0.942 0.751 0.835 0.460 0.820 0 0.781 0.950 0.779 0.856 0.411 0.791 0

















Flan-T5 XXL 0.774 0.910 0.773 0.836 0.540 0.776 0 0.814 0.914 0.832 0.871 0.596 0.760 0
Llama-2 13B-chat 0.808 0.864 0.838 0.851 0.711 0.752 0 0.818 0.862 0.872 0.867 0.723 0.705 0
Generalist Llama-2 7B-chat 0.817 0.921 0.814 0.864 0.638 0.824 0 0.836 0.931 0.845 0.886 0.627 0.807 0
Mistral-7B Instruct-v0.2 0.797 0.880 0.814 0.846 0.654 0.759 0 0.832 0.885 0.872 0.878 0.717 0.740 0
Flan-T5 XL 0.765 0.888 0.774 0.827 0.553 0.741 0 0.819 0.891 0.852 0.871 0.662 0.735 0
Phi-2 0.794 0.897 0.801 0.846 0.616 0.777 0 0.823 0.937 0.828 0.879 0.573 0.807 0