|
# MFA based extraction for FastSpeech |
|
|
|
## Prepare |
|
Everything is done from main repo folder so TensorflowTTS/ |
|
|
|
0. Optional* Modify MFA scripts to work with your language (https://montreal-forced-aligner.readthedocs.io/en/latest/pretrained_models.html) |
|
|
|
1. Download pretrained mfa, lexicon and run extract textgrids: |
|
|
|
- ``` |
|
bash examples/mfa_extraction/scripts/prepare_mfa.sh |
|
``` |
|
|
|
- ``` |
|
python examples/mfa_extraction/run_mfa.py \ |
|
--corpus_directory ./libritts \ |
|
--output_directory ./mfa/parsed \ |
|
--jobs 8 |
|
``` |
|
|
|
After this step, the TextGrids is allocated at `./mfa/parsed`. |
|
|
|
2. Extract duration from textgrid files: |
|
- ``` |
|
python examples/mfa_extraction/txt_grid_parser.py \ |
|
--yaml_path examples/fastspeech2_libritts/conf/fastspeech2libritts.yaml \ |
|
--dataset_path ./libritts \ |
|
--text_grid_path ./mfa/parsed \ |
|
--output_durations_path ./libritts/durations \ |
|
--sample_rate 24000 |
|
``` |
|
|
|
- Dataset structure after finish this step: |
|
``` |
|
|- TensorFlowTTS/ |
|
| |- LibriTTS/ |
|
| |- |- train-clean-100/ |
|
| |- |- SPEAKERS.txt |
|
| |- |- ... |
|
| |- dataset/ |
|
| |- |- 200/ |
|
| |- |- |- 200_124139_000001_000000.txt |
|
| |- |- |- 200_124139_000001_000000.wav |
|
| |- |- |- ... |
|
| |- |- 250/ |
|
| |- |- ... |
|
| |- |- durations/ |
|
| |- |- train.txt |
|
| |- tensorflow_tts/ |
|
| |- models/ |
|
| |- ... |
|
``` |
|
3. Optional* add your own dataset parser based on tensorflow_tts/processor/experiment/example_dataset.py ( If base processor dataset didnt match yours ) |
|
|
|
4. Run preprocess and normalization (Step 4,5 in `examples/fastspeech2_libritts/README.MD`) |
|
|
|
5. Run fix mismatch to fix few frames difference in audio and duration files: |
|
|
|
- ``` |
|
python examples/mfa_extraction/fix_mismatch.py \ |
|
--base_path ./dump \ |
|
--trimmed_dur_path ./dataset/trimmed-durations \ |
|
--dur_path ./dataset/durations |
|
``` |
|
|
|
## Problems with MFA extraction |
|
Looks like MFA have problems with trimmed files it works better (in my experiments) with ~100ms of silence at start and end |
|
|
|
Short files can get a lot of false positive like only silence extraction (LibriTTS example) so i would get only samples >2s |
|
|