|
# Self-Training with Kaldi HMM Models |
|
This folder contains recipes for self-training on pseudo phone transcripts and |
|
decoding into phones or words with [kaldi](https://github.com/kaldi-asr/kaldi). |
|
|
|
To start, download and install kaldi follow its instruction, and place this |
|
folder in `path/to/kaldi/egs`. |
|
|
|
## Training |
|
Assuming the following has been prepared: |
|
- `w2v_dir`: contains features `{train,valid}.{npy,lengths}`, real transcripts `{train,valid}.${label}`, and dict `dict.${label}.txt` |
|
- `lab_dir`: contains pseudo labels `{train,valid}.txt` |
|
- `arpa_lm`: Arpa-format n-gram phone LM for decoding |
|
- `arpa_lm_bin`: Arpa-format n-gram phone LM for unsupervised model selection to be used with KenLM |
|
|
|
Set these variables in `train.sh`, as well as `out_dir`, the output directory, |
|
and then run it. |
|
|
|
The output will be: |
|
``` |
|
==== WER w.r.t. real transcript (select based on unsupervised metric) |
|
INFO:root:./out/exp/mono/decode_valid/scoring/14.0.0.tra.txt: score 0.9178 wer 28.71% lm_ppl 24.4500 gt_wer 25.57% |
|
INFO:root:./out/exp/tri1/decode_valid/scoring/17.1.0.tra.txt: score 0.9257 wer 26.99% lm_ppl 30.8494 gt_wer 21.90% |
|
INFO:root:./out/exp/tri2b/decode_valid/scoring/8.0.0.tra.txt: score 0.7506 wer 23.15% lm_ppl 25.5944 gt_wer 15.78% |
|
``` |
|
where `wer` is the word eror rate with respect to the pseudo label, `gt_wer` to |
|
the ground truth label, `lm_ppl` the language model perplexity of HMM prediced |
|
transcripts, and `score` is the unsupervised metric for model selection. We |
|
choose the model and the LM parameter of the one with the lowest score. In the |
|
example above, it is `tri2b`, `8.0.0`. |
|
|
|
|
|
## Decoding into Phones |
|
In `decode_phone.sh`, set `out_dir` the same as used in `train.sh`, set |
|
`dec_exp` and `dec_lmparam` to the selected model and LM parameter (e.g. |
|
`tri2b` and `8.0.0` in the above example). `dec_script` needs to be set |
|
according to `dec_exp`: for mono/tri1/tri2b, use `decode.sh`; for tri3b, use |
|
`decode_fmllr.sh`. |
|
|
|
The output will be saved at `out_dir/dec_data` |
|
|
|
|
|
## Decoding into Words |
|
`decode_word_step1.sh` prepares WFSTs for word decoding. Besides the variables |
|
mentioned above, set |
|
- `wrd_arpa_lm`: Arpa-format n-gram word LM for decoding |
|
- `wrd_arpa_lm_bin`: Arpa-format n-gram word LM for unsupervised model selection |
|
|
|
`decode_word_step1.sh` decodes the `train` and `valid` split into word and runs |
|
unsupervised model selection using the `valid` split. The output is like: |
|
``` |
|
INFO:root:./out/exp/tri2b/decodeword_valid/scoring/17.0.0.tra.txt: score 1.8693 wer 24.97% lm_ppl 1785.5333 gt_wer 31.45% |
|
``` |
|
|
|
After determining the LM parameter (`17.0.0` in the example above), set it in |
|
`decode_word_step2.sh` and run it. The output will be saved at |
|
`out_dir/dec_data_word`. |
|
|