2021 Update: We are merging this example into the S2T framework, which supports more generic speech-to-text tasks (e.g. speech translation) and more flexible data processing pipelines. Please stay tuned.
Speech Recognition
examples/speech_recognition
is implementing ASR task in Fairseq, along with needed features, datasets, models and loss functions to train and infer model described in Transformers with convolutional context for ASR (Abdelrahman Mohamed et al., 2019).
Additional dependencies
On top of main fairseq dependencies there are couple more additional requirements.
- Please follow the instructions to install torchaudio. This is required to compute audio fbank features.
- Sclite is used to measure WER. Sclite can be downloaded and installed from source from sctk package here. Training and inference doesn't require Sclite dependency.
- sentencepiece is required in order to create dataset with word-piece targets.
Preparing librispeech data
./examples/speech_recognition/datasets/prepare-librispeech.sh $DIR_TO_SAVE_RAW_DATA $DIR_FOR_PREPROCESSED_DATA
Training librispeech data
python train.py $DIR_FOR_PREPROCESSED_DATA --save-dir $MODEL_PATH --max-epoch 80 --task speech_recognition --arch vggtransformer_2 --optimizer adadelta --lr 1.0 --adadelta-eps 1e-8 --adadelta-rho 0.95 --clip-norm 10.0 --max-tokens 5000 --log-format json --log-interval 1 --criterion cross_entropy_acc --user-dir examples/speech_recognition/
Inference for librispeech
$SET
can be test_clean
or test_other
Any checkpoint in $MODEL_PATH
can be selected. In this example we are working with checkpoint_last.pt
python examples/speech_recognition/infer.py $DIR_FOR_PREPROCESSED_DATA --task speech_recognition --max-tokens 25000 --nbest 1 --path $MODEL_PATH/checkpoint_last.pt --beam 20 --results-path $RES_DIR --batch-size 40 --gen-subset $SET --user-dir examples/speech_recognition/
Inference for librispeech
sclite -r ${RES_DIR}/ref.word-checkpoint_last.pt-${SET}.txt -h ${RES_DIR}/hypo.word-checkpoint_last.pt-${SET}.txt -i rm -o all stdout > $RES_REPORT
Sum/Avg
row from first table of the report has WER
Using flashlight (previously called wav2letter) components
flashlight now has integration with fairseq. Currently this includes:
- AutoSegmentationCriterion (ASG)
- flashlight-style Conv/GLU model
- flashlight's beam search decoder
To use these, follow the instructions on this page to install python bindings.
Training librispeech data (flashlight style, Conv/GLU + ASG loss)
Training command:
python train.py $DIR_FOR_PREPROCESSED_DATA --save-dir $MODEL_PATH --max-epoch 100 --task speech_recognition --arch w2l_conv_glu_enc --batch-size 4 --optimizer sgd --lr 0.3,0.8 --momentum 0.8 --clip-norm 0.2 --max-tokens 50000 --log-format json --log-interval 100 --num-workers 0 --sentence-avg --criterion asg_loss --asg-transitions-init 5 --max-replabel 2 --linseg-updates 8789 --user-dir examples/speech_recognition
Note that ASG loss currently doesn't do well with word-pieces. You should prepare a dataset with character targets by setting nbpe=31
in prepare-librispeech.sh
.
Inference for librispeech (flashlight decoder, n-gram LM)
Inference command:
python examples/speech_recognition/infer.py $DIR_FOR_PREPROCESSED_DATA --task speech_recognition --seed 1 --nbest 1 --path $MODEL_PATH/checkpoint_last.pt --gen-subset $SET --results-path $RES_DIR --w2l-decoder kenlm --kenlm-model $KENLM_MODEL_PATH --lexicon $LEXICON_PATH --beam 200 --beam-threshold 15 --lm-weight 1.5 --word-score 1.5 --sil-weight -0.3 --criterion asg_loss --max-replabel 2 --user-dir examples/speech_recognition
$KENLM_MODEL_PATH
should be a standard n-gram language model file. $LEXICON_PATH
should be a flashlight-style lexicon (list of known words and their spellings). For ASG inference, a lexicon line should look like this (note the repetition labels):
doorbell D O 1 R B E L 1 ▁
For CTC inference with word-pieces, repetition labels are not used and the lexicon should have most common spellings for each word (one can use sentencepiece's NBestEncodeAsPieces
for this):
doorbell ▁DOOR BE LL
doorbell ▁DOOR B E LL
doorbell ▁DO OR BE LL
doorbell ▁DOOR B EL L
doorbell ▁DOOR BE L L
doorbell ▁DO OR B E LL
doorbell ▁DOOR B E L L
doorbell ▁DO OR B EL L
doorbell ▁DO O R BE LL
doorbell ▁DO OR BE L L
Lowercase vs. uppercase matters: the word should match the case of the n-gram language model (i.e. $KENLM_MODEL_PATH
), while the spelling should match the case of the token dictionary (i.e. $DIR_FOR_PREPROCESSED_DATA/dict.txt
).
Inference for librispeech (flashlight decoder, viterbi only)
Inference command:
python examples/speech_recognition/infer.py $DIR_FOR_PREPROCESSED_DATA --task speech_recognition --seed 1 --nbest 1 --path $MODEL_PATH/checkpoint_last.pt --gen-subset $SET --results-path $RES_DIR --w2l-decoder viterbi --criterion asg_loss --max-replabel 2 --user-dir examples/speech_recognition