Spaces:
Sleeping
A newer version of the Gradio SDK is available:
5.23.3
ASR-based evaluation
Overall, the life cycle of the ASR-based evaluation for an ULM contains the following steps:
- Training an ULM and sampling from it [description]
- Running UTS on the sampled unit sequences [description]
- Pre-processing for the ASR (down-sampling to 16 KHz, aligning length of the generated audio with ground-truth utterances)
- Running ASR
- Calculation of the post-ASR evaluation metrics
Here we assume that you have already went throught the first two steps and focus on the rest.
Preprocessing
Down-sampling to 16KHz
The bulk conversion can be done by running
python $FAIRSEQ_ROOT/examples/textless_nlp/gslm/unit2speech/convert_to_16k.py $UTS_OUTPUT $UTS_OUTPUT_DOWNSAMPLE
where $UTS_OUTPUT
specifies the directory with the generated audio and $UTS_OUTPUT_DOWNSAMPLE
is the directory where downsampled audio would be saved.
Matching by length
This step is somewhat optional. However, if you want to compare the fluency and diversity of a generated speech utterance to that of the ground-truth speech with the same prefix, it is a good idea to force them to be of the same length.
python $FAIRSEQ_ROOT/examples/textless_nlp/asr_metrics/cut_as.py \
--samples_dir=$UTS_OUTPUT_DOWNSAMPLE --out_dir=$UTS_OUTPUT_DOWNSAMPLE_CUT \
--prompts_description=data/ground_truth_continuation_dev.json
Here ground_truth_continuation_dev.json
is a json file with ground-truth text from LibriSpeech dev-clean, associated with some meta-data (assuming the evaluation is done on dev-clean). This file can be downloaded [here]. A similar file for the test-clean is [here]. These files are used for the evaluation and contain texts for audio sequences that are at least 6s long.
Running ASR
We use a pre-trained wav2vec model to run the ASR step. We firstly need to prepare manifest files which, roughly, tell the ASR system which files we want to transcribe. You can find more details and download the 960h_scratch.pt
checkpoint
[here]). To run ASR, you would also need to
install KenLM, Flashlight decoder, and download the KenLM 4-gram English language model.
python $FAIRSEQ_ROOT/examples/wav2vec/wav2vec_manifest.py \
$UTS_OUTPUT_DOWNSAMPLE_CUT --valid-percent 0.0 --dest $MANIFEST_DIR --ext wav
where $UTS_OUTPUT_DOWNSAMPLE_CUT
speficies the directory with the preprocessed UTS outputs and $MANIFEST_DIR
is the output directory.
We will be running an out-of-the-box evaluation script which requires ground-truth transcripts to measure quality metrics. We are only interested in the transcripts (and we don't have ground-truth outputs for when our ULM generated!), hence we will just generate some dummy transcripts instead:
cp $FAIRSEQ_ROOT/examples/textless_nlp/gslm/asr_metrics/misc/dict.ltr.txt $MANIFEST_DIR
python $FAIRSEQ_ROOT/examples/textless_nlp/gslm/asr_metrics/misc/dummy_asr_data.py --tsv=$MANIFEST_DIR/train.tsv \
--output-dir=$MANIFEST_DIR
Now we are ready for running ASR:
mkdir -p asr
python $FAIRSEQ_ROOT/examples/speech_recognition/infer.py \
$MANIFEST_DIR \
--task audio_pretraining --nbest 1 --path 960h_scratch.pt \
--gen-subset=train --results-path $PATH_TO_ASR_OUTPUT \
--w2l-decoder kenlm --lm-model 4-gram.bin \
--lexicon librispeech/lexicon_ltr.lst --word-score -1 \
--sil-weight 0 --lm-weight 2 --criterion ctc --labels ltr --max-tokens 300000 --remove-bpe letter
where lexicon_ltr.lst
is the LibriSpeech lexicon and $PATH_TO_ASR_OUTPUT
is the output directory (can be downloaded [here]).
Evaluation metrics
We run evaluation on the 1_000 shortest sequences that are at least 6s long. To filter those from the ASR transcript, we additionally provide each metric script with the paths to the manifest and ground_truth_continuation_*
files.
Perplexity (PPX)
To get a PPX metric estimate on an ASR transcript, you need to run the following command:
python ppx.py $PATH_TO_ASR_OUTPUT/hypo.word-960h_scratch.pt-train.txt --cut-tail\
--manifest=$MANIFEST_DIR/train.tsv --prompts-description=data/ground_truth_continuation_dev.json
where --cut-tail
tells the script to ignore the last token on each line (ASR puts the sequence ID there).
Self- and Auto-BLEU
python self_bleu.py $PATH_TO_ASR_OUTPUT/hypo.word-960h_scratch.pt-train.txt --cut-tail \
--manifest=$MANIFEST_DIR/train.tsv --prompts-description=data/ground_truth_continuation_dev.json
Continuation-BLEU
python continuation_eval.py --asr-transcript $PATH_TO_ASR_OUTPUT/hypo.word-960h_scratch.pt-train.txt \
--manifest=$MANIFEST_DIR/train.tsv --prompts-description=data/ground_truth_continuation_dev.json
AUC
Based on the metrics calculated above, we can estimate the AUC of the perplexity/diversity trade-off. We provide an illustration in a Colab notebook.