logo CASLIE

Captions Speak Louder than Images:

Generalizing Foundation Models for E-commerce from High-quality Multimodal Instruction Data

The Ohio State University

Abstract

Leveraging multimodal data to drive breakthroughs in e-commerce applications through Multimodal Foundation Models (MFMs) is gaining increasing attention from the research community. However, there are significant challenges that hinder the optimal use of multimodal e-commerce data by foundation models: (1) the scarcity of large-scale, high-quality multimodal benchmark datasets; and (2) the lack of effective multimodal information integration methods. To address these challenges, in this paper, we introduce MMECInstruct the first-ever, large-scale, and high-quality multimodal instruction dataset %for developing and evaluating foundation models for e-commerce. We also develop CASLIE, a simple, lightweight, yet effective framework for integrating multimodal information for e-commerce. Leveraging MMECInstruct we fine-tune a series of e-commerce MFMs within CASLIE, denoted as CASLIE models. Our comprehensive evaluation demonstrates that CASLIE models substantially outperform 5 categories of advanced baseline models in the in-domain evaluation. Moreover, CASLIE models show strong generalizability to out-of-domain settings.

Overview

Overview Dataset Overview

Table

Model IND OOD
AP CC PRP PSI MPC SA SR AP CC PRP SA SR
F1 R@1 M-F1 F1 Acc M-F1 R@1 F1 R@1 M-F1 M-F1 R@1
FashionCLIP 0.759 0.863 0.497 0.201 0.605 0.323 0.145 0.600 0.903 0.453 0.376 0.087
Llama-2-13B 0.866 0.969 0.468 0.235 0.700 0.628 0.184 0.831 0.959 0.523 0.595 0.285
Mistral-7B-v0 0.876 0.971 0.533 0.312 0.725 0.617 0.218 0.847 0.965 0.530 0.659 0.312
Llama-3 0.866 0.951 0.493 0.270 0.699 0.565 0.191 0.838 0.962 0.511 0.614 0.305
eCeLLM-L 0.872 0.870 0.519 0.178 0.706 0.613 0.188 0.860 0.916 0.531 0.584 0.304
eCeLLM-M 0.864 0.890 0.492 0.131 0.719 0.632 0.182 0.841 0.942 0.564 0.624 0.302
LLaVA-NExT-Interleave 0.791 0.964 0.568 0.340 0.721 0.561 0.053 0.579 0.043 0.334 0.206 0.000
SoTA 0.868 0.671 0.531 0.316 0.702 0.495 0.163 0.849 0.658 0.447 0.510 0.210
CASLIE-L 0.868 0.969 0.473 0.268 0.706 0.651 0.190 0.840 0.968 0.531 0.607 0.297
CASLIE-M 0.891 0.979 0.566 0.398 0.731 0.656 0.223 0.855 0.977 0.585 0.625 0.330
CASLIE-S 0.871 0.963 0.504 0.336 0.707 0.601 0.196 0.857 0.959 0.580 0.647 0.297

BibTeX

@article{ling2024captions,
        title={Captions Speak Louder than Images (CASLIE): Generalizing Foundation Models for E-commerce from High-quality Multimodal Instruction Data},
        author={Ling, Xinyi and Peng, Bo and Du, Hanwen and Zhu, Zhihui and Ning, Xia},
        journal={arXiv preprint arXiv:2410.17337},
        year={2024}
    }