OFA-vqa

Sleeping

File size: 5,345 Bytes

7e4b981

# ASR-based evaluation

Overall, the life cycle of the ASR-based evaluation for an ULM contains the following steps:
 1. Training an ULM and sampling from it [[description]](./../../ulm)
 2. Running UTS on the sampled unit sequences [[description]](./../../unit2speech)
 3. Pre-processing for the ASR (down-sampling to 16 KHz, aligning length of the generated audio with ground-truth utterances)
 4. Running ASR
 5. Calculation of the post-ASR evaluation metrics

Here we assume that you have already went throught the first two steps and focus on the rest.

## Preprocessing
### Down-sampling to 16KHz
The bulk conversion can be done by running
```bash
 python $FAIRSEQ_ROOT/examples/textless_nlp/gslm/unit2speech/convert_to_16k.py $UTS_OUTPUT $UTS_OUTPUT_DOWNSAMPLE
 ```
 where `$UTS_OUTPUT` specifies the directory with the generated audio and `$UTS_OUTPUT_DOWNSAMPLE` is the directory where downsampled audio would be saved.

 ### Matching by length
This step is somewhat optional. However, if you want to compare the fluency and diversity of a generated speech utterance to that of the ground-truth speech with the same prefix, it is a good idea to force them to be of the same length.
```bash
python $FAIRSEQ_ROOT/examples/textless_nlp/asr_metrics/cut_as.py \
    --samples_dir=$UTS_OUTPUT_DOWNSAMPLE --out_dir=$UTS_OUTPUT_DOWNSAMPLE_CUT \
    --prompts_description=data/ground_truth_continuation_dev.json
```

Here `ground_truth_continuation_dev.json` is a json file with ground-truth text from LibriSpeech dev-clean, associated with some meta-data (assuming the evaluation is done on dev-clean). This file can be downloaded [[here]](https://dl.fbaipublicfiles.com/textless_nlp/gslm/eval_data/ground_truth_continuation_dev.json). A similar file for the test-clean is [[here]](https://dl.fbaipublicfiles.com/textless_nlp/gslm/eval_data/ground_truth_continuation_test.json). These files are used for the evaluation and contain texts for audio sequences that are at least 6s long.

## Running ASR
We use a pre-trained wav2vec model to run the ASR step. We firstly need to prepare manifest files which, roughly, tell the ASR system which files we want to transcribe. You can find more details and download the `960h_scratch.pt` checkpoint
[[here]](https://github.com/pytorch/fairseq/blob/main/examples/wav2vec/README.md)). To run ASR, you would also need to
install KenLM, Flashlight decoder, and download the KenLM 4-gram English language model.

```bash
 python $FAIRSEQ_ROOT/examples/wav2vec/wav2vec_manifest.py  \
    $UTS_OUTPUT_DOWNSAMPLE_CUT --valid-percent 0.0  --dest $MANIFEST_DIR --ext wav
```
where `$UTS_OUTPUT_DOWNSAMPLE_CUT` speficies the directory with the preprocessed UTS outputs and `$MANIFEST_DIR` is the output directory.

We will be running an out-of-the-box evaluation script which requires ground-truth transcripts to measure quality metrics. We are only
interested in the transcripts (and we don't have ground-truth outputs for when our ULM generated!), hence we will just generate
some dummy transcripts instead:
```bash
cp $FAIRSEQ_ROOT/examples/textless_nlp/gslm/asr_metrics/misc/dict.ltr.txt $MANIFEST_DIR
python $FAIRSEQ_ROOT/examples/textless_nlp/gslm/asr_metrics/misc/dummy_asr_data.py  --tsv=$MANIFEST_DIR/train.tsv \
 --output-dir=$MANIFEST_DIR
```

Now we are ready for running ASR:
```
mkdir -p asr
python $FAIRSEQ_ROOT/examples/speech_recognition/infer.py  \
    $MANIFEST_DIR \
    --task audio_pretraining --nbest 1 --path 960h_scratch.pt \
    --gen-subset=train --results-path $PATH_TO_ASR_OUTPUT \
    --w2l-decoder kenlm --lm-model 4-gram.bin \
    --lexicon librispeech/lexicon_ltr.lst --word-score -1 \
    --sil-weight 0 --lm-weight 2 --criterion ctc --labels ltr --max-tokens 300000 --remove-bpe letter
```
where `lexicon_ltr.lst` is the LibriSpeech lexicon and `$PATH_TO_ASR_OUTPUT` is the output directory (can be downloaded [[here]](https://dl.fbaipublicfiles.com/textless_nlp/gslm/eval_data/lexicon_ltr.lst)).

## Evaluation metrics
We run evaluation on the 1_000 shortest sequences that are at least 6s long. To filter those from the ASR transcript, we additionally provide each metric script with the paths to the manifest and `ground_truth_continuation_*` files.

### Perplexity (PPX)
To get a PPX metric estimate on an ASR transcript, you need to run the following command:
```bash
python ppx.py $PATH_TO_ASR_OUTPUT/hypo.word-960h_scratch.pt-train.txt --cut-tail\
  --manifest=$MANIFEST_DIR/train.tsv --prompts-description=data/ground_truth_continuation_dev.json
```
where `--cut-tail` tells the script to ignore the last token on each line (ASR puts the sequence ID there).

### Self- and Auto-BLEU
```bash
python self_bleu.py $PATH_TO_ASR_OUTPUT/hypo.word-960h_scratch.pt-train.txt  --cut-tail \
  --manifest=$MANIFEST_DIR/train.tsv --prompts-description=data/ground_truth_continuation_dev.json
```

### Continuation-BLEU
```bash
python continuation_eval.py --asr-transcript $PATH_TO_ASR_OUTPUT/hypo.word-960h_scratch.pt-train.txt \
   --manifest=$MANIFEST_DIR/train.tsv --prompts-description=data/ground_truth_continuation_dev.json
```

### AUC
Based on the metrics calculated above, we can estimate the AUC of the perplexity/diversity trade-off. We provide an illustration in a [Colab notebook](https://colab.research.google.com/drive/1pVPfOVax_PU3MkYdHRSsa-SI8GBUldNt?usp=sharing).