Papers
arxiv:2003.13198

InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining

Published on Mar 30, 2020
Authors:
,
,
,
,

Abstract

Multi-modal pretraining for learning high-level multi-modal representation is a further step towards deep learning and artificial intelligence. In this work, we propose a novel model, namely InterBERT (BERT for Interaction), which is the first model of our series of multimodal pretraining methods M6 (MultiModality-to-MultiModality Multitask Mega-transformer). The model owns strong capability of modeling interaction between the information flows of different modalities. The single-stream interaction module is capable of effectively processing information of multiple modalilties, and the two-stream module on top preserves the independence of each modality to avoid performance downgrade in single-modal tasks. We pretrain the model with three pretraining tasks, including masked segment modeling (MSM), masked region modeling (MRM) and image-text matching (ITM); and finetune the model on a series of vision-and-language downstream tasks. Experimental results demonstrate that InterBERT outperforms a series of strong baselines, including the most recent multi-modal pretraining methods, and the analysis shows that MSM and MRM are effective for pretraining and our method can achieve performances comparable to BERT in single-modal tasks. Besides, we propose a large-scale dataset for multi-modal pretraining in Chinese, and we develop the Chinese InterBERT which is the first Chinese multi-modal pretrained model. We pretrain the Chinese InterBERT on our proposed dataset of 3.1M image-text pairs from the mobile Taobao, the largest Chinese e-commerce platform. We finetune the model for text-based image retrieval, and recently we deployed the model online for topic-based recommendation.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2003.13198 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2003.13198 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2003.13198 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.