vishred18's picture
Upload 364 files
d5ee97c
# MFA based extraction for FastSpeech
## Prepare
Everything is done from main repo folder so TensorflowTTS/
0. Optional* Modify MFA scripts to work with your language (https://montreal-forced-aligner.readthedocs.io/en/latest/pretrained_models.html)
1. Download pretrained mfa, lexicon and run extract textgrids:
- ```
bash examples/mfa_extraction/scripts/prepare_mfa.sh
```
- ```
python examples/mfa_extraction/run_mfa.py \
--corpus_directory ./libritts \
--output_directory ./mfa/parsed \
--jobs 8
```
After this step, the TextGrids is allocated at `./mfa/parsed`.
2. Extract duration from textgrid files:
- ```
python examples/mfa_extraction/txt_grid_parser.py \
--yaml_path examples/fastspeech2_libritts/conf/fastspeech2libritts.yaml \
--dataset_path ./libritts \
--text_grid_path ./mfa/parsed \
--output_durations_path ./libritts/durations \
--sample_rate 24000
```
- Dataset structure after finish this step:
```
|- TensorFlowTTS/
| |- LibriTTS/
| |- |- train-clean-100/
| |- |- SPEAKERS.txt
| |- |- ...
| |- dataset/
| |- |- 200/
| |- |- |- 200_124139_000001_000000.txt
| |- |- |- 200_124139_000001_000000.wav
| |- |- |- ...
| |- |- 250/
| |- |- ...
| |- |- durations/
| |- |- train.txt
| |- tensorflow_tts/
| |- models/
| |- ...
```
3. Optional* add your own dataset parser based on tensorflow_tts/processor/experiment/example_dataset.py ( If base processor dataset didnt match yours )
4. Run preprocess and normalization (Step 4,5 in `examples/fastspeech2_libritts/README.MD`)
5. Run fix mismatch to fix few frames difference in audio and duration files:
- ```
python examples/mfa_extraction/fix_mismatch.py \
--base_path ./dump \
--trimmed_dur_path ./dataset/trimmed-durations \
--dur_path ./dataset/durations
```
## Problems with MFA extraction
Looks like MFA have problems with trimmed files it works better (in my experiments) with ~100ms of silence at start and end
Short files can get a lot of false positive like only silence extraction (LibriTTS example) so i would get only samples >2s