tiiuae
/

visper

yasserTII commited on May 28

Commit

8372129

•

1 Parent(s): e5b2e12

Update README.md

Files changed (1) hide show

README.md CHANGED Viewed

@@ -21,7 +21,7 @@ ViSPer is a model for audio visual speech recognition (VSR/AVSR). Trained on 550
 # Training details:
 We use our proposed dataset to train a encoder-decoder model in a fully-supervised manner under a multi-lingual setting. While the encoder size is 12 layers, the decoder size is 6 layers. The hidden size, MLP and number of heads are set to 768, 3072 and 12, respectively. The unigram tokenizers are learned for all languages combined and have a vocabulary size of 21k.
-The models are trained for 150 epochs on 64 Nvidia A100 GPUs (40GB) using AdamW optimizer with max LR of 1e-3 and a weight decay of 0.03. A cosine scheduler with a warm-up of 5 epochs is used for training. The maximum batch size per GPU is set to 2400 video frames.
 # Performance:

 # Training details:
 We use our proposed dataset to train a encoder-decoder model in a fully-supervised manner under a multi-lingual setting. While the encoder size is 12 layers, the decoder size is 6 layers. The hidden size, MLP and number of heads are set to 768, 3072 and 12, respectively. The unigram tokenizers are learned for all languages combined and have a vocabulary size of 21k.
+The models are trained for 150 epochs on 64 Nvidia A100 GPUs (40GB) using AdamW optimizer with max LR of 1e-3 and a weight decay of 0.1. A cosine scheduler with a warm-up of 5 epochs is used for training. The maximum batch size per GPU is set to 1800 video frames.
 # Performance: