--- library_name: transformers tags: [] --- # Huggingface Implementation of AV-HuBERT on the MuAViC Dataset This repository contains a Huggingface implementation of the AV-HuBERT (Audio-Visual Hidden Unit BERT) model, specifically trained and tested on the MuAViC (Multilingual Audio-Visual Corpus) dataset. AV-HuBERT is a self-supervised model designed for audio-visual speech recognition, leveraging both audio and visual modalities to achieve robust performance, especially in noisy environments. Key features of this repository include: - Pre-trained Models: Access pre-trained AV-HuBERT models fine-tuned on the MuAViC dataset. The pre-trained model been exported from [MuAViC](https://github.com/facebookresearch/muavic) repository. - Inference scripts: Easily pipelines using Huggingface’s interface. - Data preprocessing scripts: Including normalize frame rate, extract lips and audio. ### Inference code ```sh git clone https://github.com/nguyenvulebinh/AV-HuBERT-S2S.git cd AV-HuBERT-S2S conda create -n avhuberts2s python=3.9 conda activate avhuberts2s pip install -r requirements.txt python run_example.py ``` ```python from src.model.avhubert2text import AV2TextForConditionalGeneration from src.dataset.load_data import load_feature from transformers import Speech2TextTokenizer import torch if __name__ == "__main__": # Load pretrained english model model = AV2TextForConditionalGeneration.from_pretrained('nguyenvulebinh/AV-HuBERT') tokenizer = Speech2TextTokenizer.from_pretrained('nguyenvulebinh/AV-HuBERT') # cuda model = model.cuda().eval() # Load normalized input data sample = load_feature( './example/lip_movement.mp4', "./example/noisy_audio.wav" ) # cuda audio_feats = sample['audio_source'].cuda() video_feats = sample['video_source'].cuda() attention_mask = torch.BoolTensor(audio_feats.size(0), audio_feats.size(-1)).fill_(False).cuda() # Generate output sequence using HF interface output = model.generate( audio_feats, attention_mask=attention_mask, video=video_feats, ) # decode output sequence print(tokenizer.batch_decode(output, skip_special_tokens=True)) # check output assert output.detach().cpu().numpy().tolist() == [[ 2, 16, 130, 516, 8, 339, 541, 808, 210, 195, 541, 79, 130, 317, 269, 4, 2]] print("Example run successfully") ``` ### Data preprocessing scripts ```sh mkdir model-bin cd model-bin wget https://huggingface.co/nguyenvulebinh/AV-HuBERT/resolve/main/20words_mean_face.npy . wget https://huggingface.co/nguyenvulebinh/AV-HuBERT/resolve/main/shape_predictor_68_face_landmarks.dat . # raw video only support 4:3 ratio now cp raw_video.mp4 ./example/ python src/dataset/video_to_audio_lips.py ``` ### Pretrained model
Task | Languages | Huggingface |
---|---|---|
AVSR | ar | TODO |
de | TODO | |
el | TODO | |
en | English Chekpoint | |
es | TODO | |
fr | TODO | |
it | TODO | |
pt | TODO | |
ru | TODO | |
ar,de,el,es,fr,it,pt,ru | TODO | |
AVST | en-el | TODO |
en-es | TODO | |
en-fr | TODO | |
en-it | TODO | |
en-pt | TODO | |
en-ru | TODO | |
el-en | TODO | |
es-en | TODO | |
fr-en | TODO | |
it-en | TODO | |
pt-en | TODO | |
ru-en | TODO | |
{el,es,fr,it,pt,ru}-en | TODO |