Model | AVE | PRP | PM | SA | SR | MPC | PSI | QPR | AP | AG |
F1* | Macro F1 | F1 | Macro F1 | HR@1 | Accuracy | F1 | NDCG | F1 | FBERT | |
GPT-4 Turbo | 0.495 | 0.326 | 0.753 | 0.516 | 0.387 | 0.611 | 0.195 | 0.875 | 0.649 | 0.858 |
Gemini Pro | 0.396 | 0.136 | 0.867 | 0.470 | 0.269 | 0.584 | 0.248 | 0.821 | 0.506 | 0.855 |
Claude 2.1 | 0.381 | 0.275 | 0.523 | 0.415 | 0.066 | 0.655 | 0.273 | 0.821 | 0.280 | 0.841 |
Llama-2 13B-chat | 0.002 | 0.333 | 0.434 | 0.188 | 0.056 | 0.504 | 0.252 | 0.815 | 0.623 | 0.811 |
Mistral-7B Instruct-v0.2 | 0.369 | 0.324 | 0.613 | 0.470 | 0.164 | 0.529 | 0.305 | 0.842 | 0.588 | 0.853 |
EcomGPT | 0.000 | 0.091 | 0.648 | 0.188 | 0.042 | 0.540 | 0.170 | 0.000 | 0.086 | 0.669 |
SoTA task-specific model | 0.546 | 0.588 | 0.995 | 0.573 | 0.265 | 0.703 | 0.389 | 0.859 | 0.830 | 0.858 |
eCeLLM-L | 0.582 | 0.611 | 0.995 | 0.648 | 0.526 | 0.684 | 0.501 | 0.870 | 0.851 | 0.841 |
eCeLLM-M | 0.662 | 0.558 | 0.995 | 0.639 | 0.542 | 0.696 | 0.305 | 0.876 | 0.846 | 0.842 |
eCeLLM-S | 0.509 | 0.518 | 0.991 | 0.596 | 0.479 | 0.650 | 0.392 | 0.870 | 0.846 | 0.842 |
improvement (%, avg: 10.7) | 21.2 | 3.9 | 0.0 | 13.1 | 40.1 | -1.0 | 28.8 | 0.1 | 2.5 | -1.9 |