File size: 6,938 Bytes
62e9ca6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
# VATLM
<!--**Pre-trained models for speech related tasks**-->

 [**VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning**](https://arxiv.org/abs/2211.11275)


- (Done) Nov. 2022: release the code and models
- Nov. 2022: release preprint in [arXiv](https://arxiv.org/abs/2211.11275)

## Pre-Trained and Fine-tuned Models

|    Model    |            Pre-training Dataset            |  Fine-tuning Dataset  |                            Model                             |
| :---------: | :----------------------------------------: | :-------------------: | :----------------------------------------------------------: |
| VatLM Base  |       LRS3 + paired audio+text+audio       |           -           | [Google drive](https://drive.google.com/file/d/121ITJc22prpbd4sCy9bPWpdkKgGikkgm/view?usp=share_link) |
| VatLM Base  |       LRS3 + paired audio+text+audio       | LRS-30h  audio-visual | [Google drive](https://drive.google.com/file/d/1Bfbq0G-tASw3YrI3rzdpYgTE-UV-YaN0/view?usp=share_link) |
| VatLM Base  |       LRS3 + paired audio+text+audio       |    LRS-30h  visual    | [Google drive](https://drive.google.com/file/d/1qALD9obym0zCDoszVn2CzW0U3EUl-4v7/view?usp=share_link) |
| VatLM Base  | VoxCeleb2 + LRS3 + paired audio+text+audio |           -           | [Google drive](https://drive.google.com/file/d/1piae9Row25OEfAekVz5Bxb9YnIVyEP0A/view?usp=share_link) |
| VatLM Base  | VoxCeleb2 + LRS3 + paired audio+text+audio | LRS-30h audio-visual  | [Google drive](https://drive.google.com/file/d/13JVuUi9gIIoUM888XcAOzvN7ioazn-cv/view?usp=share_link) |
| VatLM Base  | VoxCeleb2 + LRS3 + paired audio+text+audio |    LRS-30h  visual    | [Google drive](https://drive.google.com/file/d/1pAQHf60HgqDORGzyqEjdGTIywLKO3Ko5/view?usp=share_link) |
| VatLM Base  | VoxCeleb2 + LRS3 + paired audio+text+audio | LRS-433h audio-visual | [Google drive](https://drive.google.com/file/d/1u9oMnivBelxznQcMDoM_u5EOfJuxnSuL/view?usp=share_link) |
| VatLM Base  | VoxCeleb2 + LRS3 + paired audio+text+audio |    LRS-433h visual    | [Google drive](https://drive.google.com/file/d/1g107k5tL3XyvevSe0BzMqYOQFyFQG7jf/view?usp=share_link) |
| VatLM Large | VoxCeleb2 + LRS3 + paired audio+text+audio |           -           | [Google drive](https://drive.google.com/file/d/1_vbVFpKcaaPcCx2FtI-GyzVvxAhppg_b/view?usp=share_link) |
| VatLM Large | VoxCeleb2 + LRS3 + paired audio+text+audio | LRS-30h  audio-visual | [Google drive](https://drive.google.com/file/d/1LyTCxceTZIqjVdMY6hlJjWolaIAZ0Mhs/view?usp=share_link) |
| VatLM Large | VoxCeleb2 + LRS3 + paired audio+text+audio |    LRS-30h  visual    | [Google drive](https://drive.google.com/file/d/1CuyGg5O14F9Y_WCwpCVoKYbDKVtjBRQU/view?usp=share_link) |
| VatLM Large | VoxCeleb2 + LRS3 + paired audio+text+audio | LRS-433h audio-visual | [Google drive](https://drive.google.com/file/d/12orvO3xBuzdUDrBOqjW0mdGhV2Kmsy0Q/view?usp=share_link) |
| VatLM Large | VoxCeleb2 + LRS3 + paired audio+text+audio |    LRS-433h visual    | [Google drive](https://drive.google.com/file/d/17DDTUPs0BkaJtSUTiJHLBbymt2LCGo6e/view?usp=share_link) |



## Setup

To fine-tune or pre-train more models, please follow the instructions below.

```bash
git clone https://github.com/microsoft/SpeechT5.git
cd SpeechT5/VATLM
git submodule init && git submodule update

cd VATLM/fairseq  && pip install --editable
cd VATLM/vat_hubert && pip install -r requirements.txt
```

## Data preparation

1. For audio or visual data, please follow the steps of AV-HuBERT's [script](https://github.com/facebookresearch/av_hubert/tree/main/avhubert/preparation) to pre-process the data and get the corresponding `train.tsv`,` train.km` files.

2. For unimodal audio data, the visual modality is replaced with a zero vector, and the features are extracted according to this [script](https://github.com/facebookresearch/av_hubert/tree/main/avhubert/preparation) and then kmeans [clustering](https://github.com/facebookresearch/av_hubert/tree/main/avhubert/clustering) is performed to get the corresponding labels.

3. For unimodal text data, we use a small amount of pair text-audio data to obtain paired phone-unit data, and get the corresponding phoneme sequences by looking up the [lexicon](https://drive.google.com/file/d/1dh9NEx_cCF9_Aa0UcKyl9j00GXs6LmLQ/view?usp=sharing), and the unit data are obtained by extracting features and performing kmeans [clustering](https://github.com/facebookresearch/av_hubert/tree/main/avhubert/clustering).  Then follow this [script](https://github.com/microsoft/SpeechT5/tree/main/SpeechLM#hidden-unit-tokenizer-for-text) to train the phone2unit model.

## Pre-train

- VatLM Base model (LRS3 + paired audio+text+audio)

  ```shell
  cd VATLM/vat_hubert/vathubert/scripts/pretrain
  ngpu=32
  updatefreq=1
  save_path=/path/to/save_path
  
  bash base_lsr3_pretrain_iter5.sh ${ngpu} ${updatefreq} ${save_path}
  ```

- VatLM Base model (VoxCeleb2 + paired audio+text+audio)

  ```shell
  cd VATLM/vat_hubert/vathubert/scripts/pretrain
  ngpu=32
  updatefreq=1
  save_path=/path/to/save_path
  
  bash base_vox_pretrain_iter5.sh ${ngpu} ${updatefreq} ${save_path}
  ```

- VatLM Large model (VoxCeleb2 + paired audio+text+audio)

  ```shell
  cd VATLM/vat_hubert/vathubert/scripts/pretrain
  ngpu=32
  updatefreq=2
  save_path=/path/to/save_path
  
  bash large_vox_pretrain_iter5.sh ${ngpu} ${updatefreq} ${save_path}
  ```

## Fine-tune AVSR/VSR

For example, the AVSR model can be obtained by fine-tuning the VatLM model using 30 hours of labeled data.

```shell
cd VATLM/vat_hubert/vathubert/scripts/finetune_avsr
ngpu=8
updatefreq=1
save_path=/path/to/save_path

bash base_lrs3_finetune30_av.sh ${ngpu} ${updatefreq} ${save_path}
```

## Decode

For example, decoding the fine-tuned AVSR model.

```sh
cd VATLM/vat_hubert/vathubert/
data="test"
bash decode_avhubert_lrs3.sh ${data}
```

## License

This project is licensed under the license found in the LICENSE file in the root directory of this source tree.
Portions of the source code are based on the [FAIRSEQ](https://github.com/pytorch/fairseq) and [av_hubert](https://github.com/facebookresearch/av_hubert)

[Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct)

## Reference

If you find our work is useful in your research, please cite the following paper:

```bibtex
@article{zhu2022vatlm,
      title={VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning}, 
      author={Qiushi Zhu and Long Zhou and Ziqiang Zhang and Shujie Liu and Binxing Jiao and Jie Zhang and Lirong Dai and Daxin Jiang and Jinyu Li and Furu Wei},
      year={2022},
      eprint={2211.11275},
      archivePrefix={arXiv},
}
```

### Contact Information

For help or issues using VatLM models, please submit a GitHub issue.

For other communications related to VatLM, please contact Long Zhou (`[email protected]`).