Aman
/

MMTG

Model card Files Files and versions Community

Aman commited on May 22, 2024

Commit

fc90e6c

verified ·

1 Parent(s): f773f0f

Update README.md

Browse files

Files changed (1) hide show

README.md +8 -1

README.md CHANGED Viewed

@@ -17,7 +17,8 @@ This repository contains the source code and datasets for the ACM MM 2022 paper
 # Abstract
 AI creation, such as poem or lyrics generation, has attracted increasing attention from both industry and academic communities, with many promising models proposed in the past few years. Existing methods usually estimate the outputs based on single and independent visual or textual information. However, in reality, humans usually make creations according to their experiences, which may involve different modalities and be sequentially correlated. To model such human capabilities, in this paper, we define and solve a novel AI creation problem based on human experiences.
 <details> <summary> More (Click me) </summary> More specifically, we study how to generate texts based on sequential multi-modal information. Compared with the previous works, this task is much more difficult because the designed model has to well understand and adapt the semantics among different modalities and effectively convert them into the output in a sequential manner. To alleviate these difficulties, we firstly design a multi-channel sequence-to-sequence architecture equipped with a multi-modal attention network. For more effective optimization, we then propose a curriculum negative sampling strategy tailored for the sequential inputs. To benchmark this problem and demonstrate the effectiveness of our model, we manually labeled a new multi-modal experience dataset. With this dataset, we conduct extensive experiments by comparing our model with a series of representative baselines, where we can demonstrate significant improvements in our model based on both automatic and human-centered metrics.
-</details>
 # Before You Start
 - Please note that this is a work done for AI creation in **Chinese**, thus the following dataset and model checkpoints are all in Chinese. However, we have tried our model training on the English data, which is constructed on English poems in the same way with our proposed pipeline, and received the same good generated results. You can try to construct some English data (based on English corpora like poems and English text-image datasets like [MovieNet](https://movienet.github.io/)) and adapt to your own domain if necessary.
@@ -47,6 +48,8 @@ Here are the resources, which you can download at [Hugging Face](https://hugging
 | mmtg_ckpt.pth | The checkpoint of MMTG for your reproduction. | _sharing_link_/ckpts/ |
 | GPT2_lyrics_ckpt_epoch00.ckpt | The pre-trained decoder checkpoint. It is based on GPT2 and trained on lyrics corpus. | _sharing_link_/ckpts/ |
 | token_id2emb_dict.pkl | The dict file of each token in vocabulary to WenLan embeddings. | _sharing_link_/ |
 ## Data
 The dataset used in our paper is released as follows. Due to copyright issues, we only release the visual features of the images used in our dataset. All the `.pkl` files are in list type and each item of them is in the following format:
 ```
@@ -81,11 +84,13 @@ For the test data, there are additional keys:
 ```
 You can use this additional labeled information to analyze your parameters (like attention weights) and results.
 ## Checkpoints
 `mmtg_ckpt.pth`: The checkpoint of MMTG for your reproduction. It is trained on the dataset we released. You can simply load it and use it to generate on your own data or for the demo.
 `GPT2_lyrics_ckpt_epoch00.ckpt`: The pre-trained decoder checkpoint. As mentioned in our paper, we use a pre-trained GPT2 to initialize our decoder and fine-tune it on our lyrics corpus (phase 1). While doing the whole training (phase 2), we start from this fine-tuned one.
 ## Other
 `token_id2emb_dict.pkl`: The dict file of each token in vocabulary to WenLan embeddings. It is used to convert the token ids to the corresponding embeddings in phase 1 and phase 2. This is to adapt the text embedding space to the image embedding space. You can also use other pre-trained multimodal representation models (like OpenAI CLIP) to replace WenLan and construct an English one.
@@ -94,6 +99,7 @@ You can use this additional labeled information to analyze your parameters (like
 1. Download the `data files`, `pre-trained GPT2 checkpoint`, and `token_id2emb_dict.pkl`.
 2. Put them in `./data/`, `./src/pretrained/` (change the path in `./src/configs.py` correspondingly) and `./src/vocab/` respectively.
 ## Training
 Change your configs and run:
 ```
@@ -101,6 +107,7 @@ $ cd src/
 $ bash train.sh
 ```
 ## Generate
 Change your configs and run:
 ```

 # Abstract
 AI creation, such as poem or lyrics generation, has attracted increasing attention from both industry and academic communities, with many promising models proposed in the past few years. Existing methods usually estimate the outputs based on single and independent visual or textual information. However, in reality, humans usually make creations according to their experiences, which may involve different modalities and be sequentially correlated. To model such human capabilities, in this paper, we define and solve a novel AI creation problem based on human experiences.
 <details> <summary> More (Click me) </summary> More specifically, we study how to generate texts based on sequential multi-modal information. Compared with the previous works, this task is much more difficult because the designed model has to well understand and adapt the semantics among different modalities and effectively convert them into the output in a sequential manner. To alleviate these difficulties, we firstly design a multi-channel sequence-to-sequence architecture equipped with a multi-modal attention network. For more effective optimization, we then propose a curriculum negative sampling strategy tailored for the sequential inputs. To benchmark this problem and demonstrate the effectiveness of our model, we manually labeled a new multi-modal experience dataset. With this dataset, we conduct extensive experiments by comparing our model with a series of representative baselines, where we can demonstrate significant improvements in our model based on both automatic and human-centered metrics.
+</details> <br>
 # Before You Start
 - Please note that this is a work done for AI creation in **Chinese**, thus the following dataset and model checkpoints are all in Chinese. However, we have tried our model training on the English data, which is constructed on English poems in the same way with our proposed pipeline, and received the same good generated results. You can try to construct some English data (based on English corpora like poems and English text-image datasets like [MovieNet](https://movienet.github.io/)) and adapt to your own domain if necessary.
 | mmtg_ckpt.pth | The checkpoint of MMTG for your reproduction. | _sharing_link_/ckpts/ |
 | GPT2_lyrics_ckpt_epoch00.ckpt | The pre-trained decoder checkpoint. It is based on GPT2 and trained on lyrics corpus. | _sharing_link_/ckpts/ |
 | token_id2emb_dict.pkl | The dict file of each token in vocabulary to WenLan embeddings. | _sharing_link_/ |
 ## Data
 The dataset used in our paper is released as follows. Due to copyright issues, we only release the visual features of the images used in our dataset. All the `.pkl` files are in list type and each item of them is in the following format:
 ```
 ```
 You can use this additional labeled information to analyze your parameters (like attention weights) and results.
 ## Checkpoints
 `mmtg_ckpt.pth`: The checkpoint of MMTG for your reproduction. It is trained on the dataset we released. You can simply load it and use it to generate on your own data or for the demo.
 `GPT2_lyrics_ckpt_epoch00.ckpt`: The pre-trained decoder checkpoint. As mentioned in our paper, we use a pre-trained GPT2 to initialize our decoder and fine-tune it on our lyrics corpus (phase 1). While doing the whole training (phase 2), we start from this fine-tuned one.
 ## Other
 `token_id2emb_dict.pkl`: The dict file of each token in vocabulary to WenLan embeddings. It is used to convert the token ids to the corresponding embeddings in phase 1 and phase 2. This is to adapt the text embedding space to the image embedding space. You can also use other pre-trained multimodal representation models (like OpenAI CLIP) to replace WenLan and construct an English one.
 1. Download the `data files`, `pre-trained GPT2 checkpoint`, and `token_id2emb_dict.pkl`.
 2. Put them in `./data/`, `./src/pretrained/` (change the path in `./src/configs.py` correspondingly) and `./src/vocab/` respectively.
 ## Training
 Change your configs and run:
 ```
 $ bash train.sh
 ```
 ## Generate
 Change your configs and run:
 ```