Multi-Modal Experience Inspired AI Creation

Paper | Data | GitHub

This repository contains the source code and datasets for the ACM MM 2022 paper Multi-Modal Experience Inspired AI Creation by Cao et al.

Abstract

AI creation, such as poem or lyrics generation, has attracted increasing attention from both industry and academic communities, with many promising models proposed in the past few years. Existing methods usually estimate the outputs based on single and independent visual or textual information. However, in reality, humans usually make creations according to their experiences, which may involve different modalities and be sequentially correlated. To model such human capabilities, in this paper, we define and solve a novel AI creation problem based on human experiences.

More (Click me)

More specifically, we study how to generate texts based on sequential multi-modal information. Compared with the previous works, this task is much more difficult because the designed model has to well understand and adapt the semantics among different modalities and effectively convert them into the output in a sequential manner. To alleviate these difficulties, we firstly design a multi-channel sequence-to-sequence architecture equipped with a multi-modal attention network. For more effective optimization, we then propose a curriculum negative sampling strategy tailored for the sequential inputs. To benchmark this problem and demonstrate the effectiveness of our model, we manually labeled a new multi-modal experience dataset. With this dataset, we conduct extensive experiments by comparing our model with a series of representative baselines, where we can demonstrate significant improvements in our model based on both automatic and human-centered metrics.

Before You Start

Please note that this is a work done for AI creation in Chinese, thus the following dataset and model checkpoints are all in Chinese. However, we have tried our model training on the English data, which is constructed on English poems in the same way with our proposed pipeline, and received the same good generated results. You can try to construct some English data (based on English corpora like poems and English text-image datasets like MovieNet) and adapt to your own domain if necessary.
Some parts of our work are based on the large-scale Chinese multimodal pre-trained model WenLan (a.k.a. BriVL). Please refer to this repo for more information of usage. For the English version, you can replace the WenLan with OpenAI CLIP or other multimodal representation model (more details in our paper).

Setup

Create a new virtual environment:

$ git clone https://github.com/Aman-4-Real/MMTG.git
$ cd MMTG/
$ conda create -n mmtg python=3.7
$ conda activate mmtg

Install the Python packages. Change the cudatoolkit version according to your environment if necessary.

$ conda install pytorch==1.10.0 torchvision==0.11.0 torchaudio==0.10.0 cudatoolkit=11.3 -c pytorch -c conda-forge
$ pip install -r requirements.txt

Download

Here are the resources, which you can download at Hugging Face, GoogleDrive or BaiduNetDisk(0dwq):

FileName	Description	Path
_data_.pkl	Train, validation, and test data.	sharing_link/data/
mmtg_ckpt.pth	The checkpoint of MMTG for your reproduction.	sharing_link/ckpts/
GPT2_lyrics_ckpt_epoch00.ckpt	The pre-trained decoder checkpoint. It is based on GPT2 and trained on lyrics corpus.	sharing_link/ckpts/
token_id2emb_dict.pkl	The dict file of each token in vocabulary to WenLan embeddings.	sharing_link/

Data

The dataset used in our paper is released as follows. Due to copyright issues, we only release the visual features of the images used in our dataset. All the .pkl files are in list type and each item of them is in the following format:

{
  'topic': STRING # the topic words
  'topic_emb': LIST # embs of the topic words
  'lyrics': LIST # list of lyrics sentences
  'img_0_emb': LIST # emb of the 1st image
  'r_0': STRING # the 1st text
  'r_0_emb': LIST # emb of the 1st text
  'img_1_emb': LIST # emb of the 2nd image
  'r_1': STRING # the 2nd text
  'r_1_emb': LIST # emb of the 2nd text
  ...,
  'img_4_emb': LIST # emb of the 4th image
  'r_4': STRING # the 4th text
  'r_4_emb': LIST # emb of the 4th text
  'rating': INT # the sample level (range from 1 to 5, 5 refers to the most positive one while 1 refers to the least).
}

For the test data, there are additional keys:

{
  'score_0': {
    'img_rel': [2, 2], # the relevance score of the 1st image and the 1st & 2nd lyrics sentences (range from 1 to 5).
    'r_rel': [1, 1], # the relevence score of the 1st text and the 1st & 2nd lyrics sentences (range from 1 to 5).
    'cmp_rel': [0, 0] # whether the image or the text is more relevant to the lyrics. 0 refers to the image and 2 refers to the text (1 means a tie).
  } # a list above means: [rator1_score, rator2_score]
  ...,
  'score_4': ...
}

You can use this additional labeled information to analyze your parameters (like attention weights) and results.

Checkpoints

mmtg_ckpt.pth: The checkpoint of MMTG for your reproduction. It is trained on the dataset we released. You can simply load it and use it to generate on your own data or for the demo.

GPT2_lyrics_ckpt_epoch00.ckpt: The pre-trained decoder checkpoint. As mentioned in our paper, we use a pre-trained GPT2 to initialize our decoder and fine-tune it on our lyrics corpus (phase 1). While doing the whole training (phase 2), we start from this fine-tuned one.

Other

token_id2emb_dict.pkl: The dict file of each token in vocabulary to WenLan embeddings. It is used to convert the token ids to the corresponding embeddings in phase 1 and phase 2. This is to adapt the text embedding space to the image embedding space. You can also use other pre-trained multimodal representation models (like OpenAI CLIP) to replace WenLan and construct an English one.

Usage

Download the data files, pre-trained GPT2 checkpoint, and token_id2emb_dict.pkl.
Put them in ./data/, ./src/pretrained/ (change the path in ./src/configs.py correspondingly) and ./src/vocab/ respectively.

Training

Change your configs and run:

$ cd src/
$ bash train.sh

Generate

Change your configs and run:

$ cd src/
$ bash generate.sh

This will generate the results of the test data and save them in your save_samples_path. You can also use the checkpoint we released to generate on your own data. The format of the data is the same as the test data (without the scores and ratings). You can refer to ./data/test_data.pkl for more details.

Citation

If you find this paper and repo useful, please cite us in your work:

@inproceedings{10.1145/3503161.3548189,
  author = {Cao, Qian and Chen, Xu and Song, Ruihua and Jiang, Hao and Yang, Guang and Cao, Zhao},
  title = {Multi-Modal Experience Inspired AI Creation},
  year = {2022},
  isbn = {9781450392037},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  url = {https://doi.org/10.1145/3503161.3548189},
  doi = {10.1145/3503161.3548189},
  booktitle = {Proceedings of the 30th ACM International Conference on Multimedia},
  pages = {1445–1454},
  numpages = {10},
  keywords = {AI creation, multi-modal, experience},
  location = {Lisboa, Portugal},
  series = {MM '22}
}

For any questions, please feel free to reach me at [email protected].