Leveraging multimodal data to drive breakthroughs in e-commerce applications through Multimodal Foundation Models (MFMs) is gaining increasing attention from the research community. However, there are significant challenges that hinder the optimal use of multimodal e-commerce data by foundation models: (1) the scarcity of large-scale, high-quality multimodal benchmark datasets; and (2) the lack of effective multimodal information integration methods. To address these challenges, in this paper, we introduce MMECInstruct the first-ever, large-scale, and high-quality multimodal instruction dataset %for developing and evaluating foundation models for e-commerce. We also develop CASLIE, a simple, lightweight, yet effective framework for integrating multimodal information for e-commerce. Leveraging MMECInstruct we fine-tune a series of e-commerce MFMs within CASLIE, denoted as CASLIE models. Our comprehensive evaluation demonstrates that CASLIE models substantially outperform 5 categories of advanced baseline models in the in-domain evaluation. Moreover, CASLIE models show strong generalizability to out-of-domain settings.
Model | IND | OOD | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
AP | CC | PRP | PSI | MPC | SA | SR | AP | CC | PRP | SA | SR | |
F1 | R@1 | M-F1 | F1 | Acc | M-F1 | R@1 | F1 | R@1 | M-F1 | M-F1 | R@1 | |
FashionCLIP | 0.759 | 0.863 | 0.497 | 0.201 | 0.605 | 0.323 | 0.145 | 0.600 | 0.903 | 0.453 | 0.376 | 0.087 |
Llama-2-13B | 0.866 | 0.969 | 0.468 | 0.235 | 0.700 | 0.628 | 0.184 | 0.831 | 0.959 | 0.523 | 0.595 | 0.285 |
Mistral-7B-v0 | 0.876 | 0.971 | 0.533 | 0.312 | 0.725 | 0.617 | 0.218 | 0.847 | 0.965 | 0.530 | 0.659 | 0.312 |
Llama-3 | 0.866 | 0.951 | 0.493 | 0.270 | 0.699 | 0.565 | 0.191 | 0.838 | 0.962 | 0.511 | 0.614 | 0.305 |
eCeLLM-L | 0.872 | 0.870 | 0.519 | 0.178 | 0.706 | 0.613 | 0.188 | 0.860 | 0.916 | 0.531 | 0.584 | 0.304 |
eCeLLM-M | 0.864 | 0.890 | 0.492 | 0.131 | 0.719 | 0.632 | 0.182 | 0.841 | 0.942 | 0.564 | 0.624 | 0.302 |
LLaVA-NExT-Interleave | 0.791 | 0.964 | 0.568 | 0.340 | 0.721 | 0.561 | 0.053 | 0.579 | 0.043 | 0.334 | 0.206 | 0.000 |
SoTA | 0.868 | 0.671 | 0.531 | 0.316 | 0.702 | 0.495 | 0.163 | 0.849 | 0.658 | 0.447 | 0.510 | 0.210 |
CASLIE-L | 0.868 | 0.969 | 0.473 | 0.268 | 0.706 | 0.651 | 0.190 | 0.840 | 0.968 | 0.531 | 0.607 | 0.297 |
CASLIE-M | 0.891 | 0.979 | 0.566 | 0.398 | 0.731 | 0.656 | 0.223 | 0.855 | 0.977 | 0.585 | 0.625 | 0.330 |
CASLIE-S | 0.871 | 0.963 | 0.504 | 0.336 | 0.707 | 0.601 | 0.196 | 0.857 | 0.959 | 0.580 | 0.647 | 0.297 |
@article{ling2024captions,
title={Captions Speak Louder than Images (CASLIE): Generalizing Foundation Models for E-commerce from High-quality Multimodal Instruction Data},
author={Ling, Xinyi and Peng, Bo and Du, Hanwen and Zhu, Zhihui and Ning, Xia},
journal={arXiv preprint arXiv:2410.17337},
year={2024}
}