Spaces:

MBZUAI
/

artst-demo-asr

Runtime error

File size: 6,938 Bytes

62e9ca6

# VATLM
<!--**Pre-trained models for speech related tasks**-->

 [**VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning**](https://arxiv.org/abs/2211.11275)


- (Done) Nov. 2022: release the code and models
- Nov. 2022: release preprint in [arXiv](https://arxiv.org/abs/2211.11275)

## Pre-Trained and Fine-tuned Models

|    Model    |            Pre-training Dataset            |  Fine-tuning Dataset  |                            Model                             |
| :---------: | :----------------------------------------: | :-------------------: | :----------------------------------------------------------: |
| VatLM Base  |       LRS3 + paired audio+text+audio       |           -           | [Google drive](https://drive.google.com/file/d/121ITJc22prpbd4sCy9bPWpdkKgGikkgm/view?usp=share_link) |
| VatLM Base  |       LRS3 + paired audio+text+audio       | LRS-30h  audio-visual | [Google drive](https://drive.google.com/file/d/1Bfbq0G-tASw3YrI3rzdpYgTE-UV-YaN0/view?usp=share_link) |
| VatLM Base  |       LRS3 + paired audio+text+audio       |    LRS-30h  visual    | [Google drive](https://drive.google.com/file/d/1qALD9obym0zCDoszVn2CzW0U3EUl-4v7/view?usp=share_link) |
| VatLM Base  | VoxCeleb2 + LRS3 + paired audio+text+audio |           -           | [Google drive](https://drive.google.com/file/d/1piae9Row25OEfAekVz5Bxb9YnIVyEP0A/view?usp=share_link) |
| VatLM Base  | VoxCeleb2 + LRS3 + paired audio+text+audio | LRS-30h audio-visual  | [Google drive](https://drive.google.com/file/d/13JVuUi9gIIoUM888XcAOzvN7ioazn-cv/view?usp=share_link) |
| VatLM Base  | VoxCeleb2 + LRS3 + paired audio+text+audio |    LRS-30h  visual    | [Google drive](https://drive.google.com/file/d/1pAQHf60HgqDORGzyqEjdGTIywLKO3Ko5/view?usp=share_link) |
| VatLM Base  | VoxCeleb2 + LRS3 + paired audio+text+audio | LRS-433h audio-visual | [Google drive](https://drive.google.com/file/d/1u9oMnivBelxznQcMDoM_u5EOfJuxnSuL/view?usp=share_link) |
| VatLM Base  | VoxCeleb2 + LRS3 + paired audio+text+audio |    LRS-433h visual    | [Google drive](https://drive.google.com/file/d/1g107k5tL3XyvevSe0BzMqYOQFyFQG7jf/view?usp=share_link) |
| VatLM Large | VoxCeleb2 + LRS3 + paired audio+text+audio |           -           | [Google drive](https://drive.google.com/file/d/1_vbVFpKcaaPcCx2FtI-GyzVvxAhppg_b/view?usp=share_link) |
| VatLM Large | VoxCeleb2 + LRS3 + paired audio+text+audio | LRS-30h  audio-visual | [Google drive](https://drive.google.com/file/d/1LyTCxceTZIqjVdMY6hlJjWolaIAZ0Mhs/view?usp=share_link) |
| VatLM Large | VoxCeleb2 + LRS3 + paired audio+text+audio |    LRS-30h  visual    | [Google drive](https://drive.google.com/file/d/1CuyGg5O14F9Y_WCwpCVoKYbDKVtjBRQU/view?usp=share_link) |
| VatLM Large | VoxCeleb2 + LRS3 + paired audio+text+audio | LRS-433h audio-visual | [Google drive](https://drive.google.com/file/d/12orvO3xBuzdUDrBOqjW0mdGhV2Kmsy0Q/view?usp=share_link) |
| VatLM Large | VoxCeleb2 + LRS3 + paired audio+text+audio |    LRS-433h visual    | [Google drive](https://drive.google.com/file/d/17DDTUPs0BkaJtSUTiJHLBbymt2LCGo6e/view?usp=share_link) |



## Setup

To fine-tune or pre-train more models, please follow the instructions below.

```bash
git clone https://github.com/microsoft/SpeechT5.git
cd SpeechT5/VATLM
git submodule init && git submodule update

cd VATLM/fairseq  && pip install --editable
cd VATLM/vat_hubert && pip install -r requirements.txt
```

## Data preparation

1. For audio or visual data, please follow the steps of AV-HuBERT's [script](https://github.com/facebookresearch/av_hubert/tree/main/avhubert/preparation) to pre-process the data and get the corresponding `train.tsv`,` train.km` files.

2. For unimodal audio data, the visual modality is replaced with a zero vector, and the features are extracted according to this [script](https://github.com/facebookresearch/av_hubert/tree/main/avhubert/preparation) and then kmeans [clustering](https://github.com/facebookresearch/av_hubert/tree/main/avhubert/clustering) is performed to get the corresponding labels.

3. For unimodal text data, we use a small amount of pair text-audio data to obtain paired phone-unit data, and get the corresponding phoneme sequences by looking up the [lexicon](https://drive.google.com/file/d/1dh9NEx_cCF9_Aa0UcKyl9j00GXs6LmLQ/view?usp=sharing), and the unit data are obtained by extracting features and performing kmeans [clustering](https://github.com/facebookresearch/av_hubert/tree/main/avhubert/clustering).  Then follow this [script](https://github.com/microsoft/SpeechT5/tree/main/SpeechLM#hidden-unit-tokenizer-for-text) to train the phone2unit model.

## Pre-train

- VatLM Base model (LRS3 + paired audio+text+audio)

  ```shell
  cd VATLM/vat_hubert/vathubert/scripts/pretrain
  ngpu=32
  updatefreq=1
  save_path=/path/to/save_path
  
  bash base_lsr3_pretrain_iter5.sh ${ngpu} ${updatefreq} ${save_path}
  ```

- VatLM Base model (VoxCeleb2 + paired audio+text+audio)

  ```shell
  cd VATLM/vat_hubert/vathubert/scripts/pretrain
  ngpu=32
  updatefreq=1
  save_path=/path/to/save_path
  
  bash base_vox_pretrain_iter5.sh ${ngpu} ${updatefreq} ${save_path}
  ```

- VatLM Large model (VoxCeleb2 + paired audio+text+audio)

  ```shell
  cd VATLM/vat_hubert/vathubert/scripts/pretrain
  ngpu=32
  updatefreq=2
  save_path=/path/to/save_path
  
  bash large_vox_pretrain_iter5.sh ${ngpu} ${updatefreq} ${save_path}
  ```

## Fine-tune AVSR/VSR

For example, the AVSR model can be obtained by fine-tuning the VatLM model using 30 hours of labeled data.

```shell
cd VATLM/vat_hubert/vathubert/scripts/finetune_avsr
ngpu=8
updatefreq=1
save_path=/path/to/save_path

bash base_lrs3_finetune30_av.sh ${ngpu} ${updatefreq} ${save_path}
```

## Decode

For example, decoding the fine-tuned AVSR model.

```sh
cd VATLM/vat_hubert/vathubert/
data="test"
bash decode_avhubert_lrs3.sh ${data}
```

## License

This project is licensed under the license found in the LICENSE file in the root directory of this source tree.
Portions of the source code are based on the [FAIRSEQ](https://github.com/pytorch/fairseq) and [av_hubert](https://github.com/facebookresearch/av_hubert)

[Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct)

## Reference

If you find our work is useful in your research, please cite the following paper:

```bibtex
@article{zhu2022vatlm,
      title={VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning}, 
      author={Qiushi Zhu and Long Zhou and Ziqiang Zhang and Shujie Liu and Binxing Jiao and Jie Zhang and Lirong Dai and Daxin Jiang and Jinyu Li and Furu Wei},
      year={2022},
      eprint={2211.11275},
      archivePrefix={arXiv},
}
```

### Contact Information

For help or issues using VatLM models, please submit a GitHub issue.

For other communications related to VatLM, please contact Long Zhou (`[email protected]`).