Model
|
IND | OOD | |||||||||||||||
|
|
|
|
|
|
|
|||||||||||
|
|
|
|
|
|
|
|||||||||||
Accuracy | Recall | Precision | F1 | Specificity | NPR | #failed | Accuracy | Recall | Precision | F1 | Specificity | NPR | #failed | ||||
GPT-4 Turbo | 0.623 | 0.550 | 0.791 | 0.649 | 0.749 | 0.491 | 5 | 0.641 | 0.555 | 0.876 | 0.680 | 0.828 | 0.460 | 2 | |||
Gemini Pro | 0.542 | 0.371 | 0.797 | 0.506 | 0.837 | 0.435 | 13 | 0.543 | 0.410 | 0.844 | 0.552 | 0.834 | 0.393 | 6 | |||
General-purpose LLMs
|
Claude 2.1 | 0.424 | 0.177 | 0.671 | 0.280 | 0.850 | 0.375 | 90 | 0.384 | 0.146 | 0.769 | 0.245 | 0.904 | 0.326 | 73 | ||
Llama-2 13B-chat | 0.534 | 0.608 | 0.638 | 0.623 | 0.406 | 0.375 | 0 | 0.541 | 0.605 | 0.688 | 0.644 | 0.401 | 0.317 | 0 | |||
Mistral-7B-Instruct-v0.2 | 0.522 | 0.539 | 0.647 | 0.588 | 0.493 | 0.383 | 120 | 0.537 | 0.523 | 0.725 | 0.608 | 0.567 | 0.352 | 126 | |||
E-commerce LLM
|
EcomGPT | 0.318 | 0.051 | 0.283 | 0.086 | 0.779 | 0.322 | 254 | 0.286 | 0.085 | 0.403 | 0.140 | 0.726 | 0.266 | 245 | ||
SoTA task-specific model
|
BERT | 0.749 | 0.970 | 0.726 | 0.830 | 0.368 | 0.877 | 0 | 0.803 | 0.831 | 0.876 | 0.853 | 0.742 | 0.668 | 0 | ||
DeBERTaV3 | 0.504 | 0.310 | 0.769 | 0.441 | 0.839 | 0.413 | 0 | 0.501 | 0.345 | 0.826 | 0.487 | 0.841 | 0.370 | 0 | |||
Flan-T5 XXL | 0.749 | 0.875 | 0.763 | 0.815 | 0.531 | 0.712 | 0 | 0.799 | 0.889 | 0.830 | 0.859 | 0.602 | 0.713 | 0 | |||
Llama-2 13B-chat | 0.801 | 0.919 | 0.797 | 0.854 | 0.597 | 0.811 | 0 | 0.838 | 0.924 | 0.852 | 0.887 | 0.650 | 0.797 | 0 | |||
Task-specific | Llama-2 7B-chat | 0.741 | 0.889 | 0.749 | 0.813 | 0.485 | 0.718 | 0 | 0.761 | 0.864 | 0.802 | 0.832 | 0.535 | 0.644 | 0 | ||
Mistral-7B Instruct-v0.2 | 0.821 | 0.896 | 0.834 | 0.864 | 0.692 | 0.794 | 0 | 0.835 | 0.891 | 0.872 | 0.881 | 0.713 | 0.749 | 0 | |||
Flan-T5 XL | 0.693 | 0.684 | 0.802 | 0.738 | 0.708 | 0.565 | 0 | 0.707 | 0.682 | 0.862 | 0.762 | 0.761 | 0.523 | 0 | |||
eCeLLM | Phi-2 | 0.765 | 0.942 | 0.751 | 0.835 | 0.460 | 0.820 | 0 | 0.781 | 0.950 | 0.779 | 0.856 | 0.411 | 0.791 | 0 | ||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Flan-T5 XXL | 0.774 | 0.910 | 0.773 | 0.836 | 0.540 | 0.776 | 0 | 0.814 | 0.914 | 0.832 | 0.871 | 0.596 | 0.760 | 0 | |||
Llama-2 13B-chat | 0.808 | 0.864 | 0.838 | 0.851 | 0.711 | 0.752 | 0 | 0.818 | 0.862 | 0.872 | 0.867 | 0.723 | 0.705 | 0 | |||
Generalist | Llama-2 7B-chat | 0.817 | 0.921 | 0.814 | 0.864 | 0.638 | 0.824 | 0 | 0.836 | 0.931 | 0.845 | 0.886 | 0.627 | 0.807 | 0 | ||
Mistral-7B Instruct-v0.2 | 0.797 | 0.880 | 0.814 | 0.846 | 0.654 | 0.759 | 0 | 0.832 | 0.885 | 0.872 | 0.878 | 0.717 | 0.740 | 0 | |||
Flan-T5 XL | 0.765 | 0.888 | 0.774 | 0.827 | 0.553 | 0.741 | 0 | 0.819 | 0.891 | 0.852 | 0.871 | 0.662 | 0.735 | 0 | |||
Phi-2 | 0.794 | 0.897 | 0.801 | 0.846 | 0.616 | 0.777 | 0 | 0.823 | 0.937 | 0.828 | 0.879 | 0.573 | 0.807 | 0 | |||