Title: e-CLIP: Large-Scale Vision-Language Representation Learning in E-commerce
Rating: 4/5
Summary
The article proposes a way to train a computer to understand the language and visual properties of products in e-commerce. A contrastive learning is used with raw product text and images to train large-scale models. The authors shows methods for overcoming domain specific (e-commerce) challenges. The pre-trained model outperforms than other methods on several tasks, both for single and multiple modalities.
Strengths
1. The authors exploited real-world data from Naver Shopping, which includes numerous pictures and text uploaded by many different sellers. Therefore, it is likely that the dataset includes some noisy data. Nonetheless, the use of this vast and realistic dataset enhances the credibility of the research’s results by reflecting the real-world conditions of e-commerce.
2. While CLIP used a method of forcing the diagonal entries of the label matrix to 1 and other entries to 0, e-CLIP considers the same products in the dataset and checks duplicate products within the batch by catalog It is interesting to see how CLIP was modified to e-CLIP to reflect properties of the e-commerce data.
3. Section 3.3.3 is interesting as the paper suggests practical solutions, such as multi-stream accumulation and batch size scheduler, to mitigate GPU shortage for real data processing, despite Naver Shopping having abundant GPU resources.
Weaknesses
1. The e-CLIP model’s large size and long training time can make it hard to reproduce the results, as it requires computational resources significantly.
2. Although the e-CLIP model was trained on large datasets, it depends on the product images and text data from the NAVER. This implies that the model’s performance may not be assured in other e-commerce service.
3. This paper presents how well the e-CLIP model does in different tasks. However, some tasks didn’t have enough ways to evaluate them , which made it difficult to see how well they were done. For example, in Task 2, the product clustering task was experimented, but the ways to evaluate it weren’t enough. Therefore, more ways are needed to evaluate tasks
Questions
1. The ViT model used as the architecture is commonly seen in other papers, and I want to read papers related to ViT.
2. How much does e-CLIP framework save resources in NAVER Shopping?
3. How should the e-CLIP model be updated every time a new large number of product images and text are almost continuously uploaded?
Discussions: After reading the e-CLIP paper, I was impressed with how it integrates image and text representations and achieves performance in diffrent downstream tasks related to e- commerce. I read useful insights about pre-processing techniques , and I apprecaite the author’s effort to share their know-how.