Update README.md
Browse files
README.md
CHANGED
@@ -21,7 +21,7 @@ ViSPer is a model for audio visual speech recognition (VSR/AVSR). Trained on 550
|
|
21 |
# Training details:
|
22 |
|
23 |
We use our proposed dataset to train a encoder-decoder model in a fully-supervised manner under a multi-lingual setting. While the encoder size is 12 layers, the decoder size is 6 layers. The hidden size, MLP and number of heads are set to 768, 3072 and 12, respectively. The unigram tokenizers are learned for all languages combined and have a vocabulary size of 21k.
|
24 |
-
The models are trained for 150 epochs on 64 Nvidia A100 GPUs (40GB) using AdamW optimizer with max LR of 1e-3 and a weight decay of 0.
|
25 |
|
26 |
# Performance:
|
27 |
|
|
|
21 |
# Training details:
|
22 |
|
23 |
We use our proposed dataset to train a encoder-decoder model in a fully-supervised manner under a multi-lingual setting. While the encoder size is 12 layers, the decoder size is 6 layers. The hidden size, MLP and number of heads are set to 768, 3072 and 12, respectively. The unigram tokenizers are learned for all languages combined and have a vocabulary size of 21k.
|
24 |
+
The models are trained for 150 epochs on 64 Nvidia A100 GPUs (40GB) using AdamW optimizer with max LR of 1e-3 and a weight decay of 0.1. A cosine scheduler with a warm-up of 5 epochs is used for training. The maximum batch size per GPU is set to 1800 video frames.
|
25 |
|
26 |
# Performance:
|
27 |
|