|
 |
|
|
|
IMS Toucan is a toolkit for teaching, training and using state-of-the-art Speech Synthesis models, developed at the |
|
**Institute for Natural Language Processing (IMS), University of Stuttgart, Germany**. Everything is pure Python and |
|
PyTorch based to keep it as simple and beginner-friendly, yet powerful as possible. |
|
|
|
The PyTorch Modules of [Tacotron 2](https://arxiv.org/abs/1712.05884) |
|
and [FastSpeech 2](https://arxiv.org/abs/2006.04558) are taken from |
|
[ESPnet](https://github.com/espnet/espnet), the PyTorch Modules of [HiFiGAN](https://arxiv.org/abs/2010.05646) are taken |
|
from the [ParallelWaveGAN repository](https://github.com/kan-bayashi/ParallelWaveGAN) |
|
which are also authored by the brilliant [Tomoki Hayashi](https://github.com/kan-bayashi). |
|
|
|
For a version of the toolkit that includes TransformerTTS instead of Tacotron 2 and MelGAN instead of HiFiGAN, check out |
|
the TransformerTTS and MelGAN branch. They are separated to keep the code clean, simple and minimal. |
|
|
|
--- |
|
|
|
## Contents |
|
|
|
- [New Features](#new-features) |
|
- [Demonstration](#demonstration) |
|
- [Installation](#installation) |
|
+ [Basic Requirements](#basic-requirements) |
|
+ [Speaker Embedding](#speaker-embedding) |
|
+ [espeak-ng](#espeak-ng) |
|
- [Creating a new Pipeline](#creating-a-new-pipeline) |
|
* [Build a HiFi-GAN Pipeline](#build-a-hifi-gan-pipeline) |
|
* [Build a FastSpeech 2 Pipeline](#build-a-fastspeech-2-pipeline) |
|
- [Training a Model](#training-a-model) |
|
- [Creating a new InferenceInterface](#creating-a-new-inferenceinterface) |
|
- [Using a trained Model for Inference](#using-a-trained-model-for-inference) |
|
- [FAQ](#faq) |
|
- [Citation](#citation) |
|
|
|
--- |
|
|
|
## New Features |
|
|
|
- [As shown in this paper](http://festvox.org/blizzard/bc2021/BC21_DelightfulTTS.pdf) vocoders can be used to perform |
|
super-resolution and spectrogram inversion simultaneously. We added this to our HiFi-GAN vocoder. It now takes 16kHz |
|
spectrograms as input, but produces 48kHz waveforms. |
|
- We officially introduced IMS Toucan in |
|
[our contribution to the Blizzard Challenge 2021](http://festvox.org/blizzard/bc2021/BC21_IMS.pdf). Check out the |
|
bottom of the readme for a bibtex entry. |
|
- We now use articulatory representations of phonemes as the input for all models. This allows us to easily use |
|
multilingual data. |
|
- We provide a checkpoint trained with [model agnostic meta learning](https://arxiv.org/abs/1703.03400) from which you |
|
should be able to fine-tune a model with very little data in almost any language. |
|
- We now use a small self-contained Aligner that is trained with CTC, inspired by |
|
[this implementation](https://github.com/as-ideas/DeepForcedAligner). This allows us to get rid of the dependence on |
|
autoregressive models. Tacotron 2 is thus now also no longer in this branch, but still present in other branches, |
|
similar to TransformerTTS. |
|
|
|
--- |
|
|
|
## Demonstration |
|
|
|
[Here are two sentences](https://drive.google.com/file/d/1ltAyR2EwAbmDo2hgkx1mvUny4FuxYmru/view?usp=sharing) |
|
produced by Tacotron 2 combined with HiFi-GAN, trained on |
|
[Nancy Krebs](https://www.cstr.ed.ac.uk/projects/blizzard/2011/lessac_blizzard2011/) using this toolkit. |
|
|
|
[Here is some speech](https://drive.google.com/file/d/1mZ1LvTlY6pJ5ZQ4UXZ9jbzB651mufBrB/view?usp=sharing) |
|
produced by FastSpeech 2 and MelGAN trained on [LJSpeech](https://keithito.com/LJ-Speech-Dataset/) |
|
using this toolkit. |
|
|
|
And [here is a sentence](https://drive.google.com/file/d/1FT49Jf0yyibwMDbsEJEO9mjwHkHRIGXc/view?usp=sharing) |
|
produced by TransformerTTS and MelGAN trained on [Thorsten](https://github.com/thorstenMueller/deep-learning-german-tts) |
|
using this toolkit. |
|
|
|
[Here is some speech](https://drive.google.com/file/d/14nPo2o1VKtWLPGF7e_0TxL8XGI3n7tAs/view?usp=sharing) |
|
produced by a multi-speaker FastSpeech 2 with MelGAN trained on |
|
[LibriTTS](https://research.google/tools/datasets/libri-tts/) using this toolkit. Fans of the videogame Portal may |
|
recognize who was used as the reference speaker for this utterance. |
|
|
|
[Interactive Demo of our entry to the Blizzard Challenge 2021.](https://colab.research.google.com/drive/1bRaySf8U55MRPaxqBr8huWrzCOzlxVqw) |
|
This is based on an older version of the toolkit though. It uses FastSpeech2 and MelGAN as vocoder and is trained on 5 |
|
hours of Spanish. |
|
|
|
--- |
|
|
|
## Installation |
|
|
|
#### Basic Requirements |
|
|
|
To install this toolkit, clone it onto the machine you want to use it on |
|
(should have at least one GPU if you intend to train models on that machine. For inference, you can get by without GPU). |
|
Navigate to the directory you have cloned. We are going to create and activate a |
|
[conda virtual environment](https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html) |
|
to install the basic requirements into. After creating the environment, the command you need to use to activate the |
|
virtual environment is displayed. The commands below show everything you need to do. |
|
|
|
``` |
|
conda create --prefix ./toucan_conda_venv --no-default-packages python=3.8 |
|
|
|
pip install --no-cache-dir -r requirements.txt |
|
|
|
pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html |
|
``` |
|
|
|
#### Speaker Embedding |
|
|
|
As [NVIDIA has shown](https://arxiv.org/pdf/2110.05798.pdf), you get better results by fine-tuning a pretrained model on |
|
a new speaker, rather than training a multispeaker model. We have thus dropped support for zero-shot multispeaker models |
|
using speaker embeddings. However we still |
|
use [Speechbrain's ECAPA-TDNN](https://huggingface.co/speechbrain/spkrec-ecapa-voxceleb) for a cycle consistency loss to |
|
make adapting to new speakers a bit faster. |
|
|
|
In the current version of the toolkit no further action should be required. When you are using multispeaker for the |
|
first time, it requires an internet connection to download the pretrained models though. |
|
|
|
#### espeak-ng |
|
|
|
And finally you need to have espeak-ng installed on your system, because it is used as backend for the phonemizer. If |
|
you replace the phonemizer, you don't need it. On most Linux environments it will be installed already, and if it is |
|
not, and you have the sufficient rights, you can install it by simply running |
|
|
|
``` |
|
apt-get install espeak-ng |
|
``` |
|
|
|
--- |
|
|
|
## Creating a new Pipeline |
|
|
|
To create a new pipeline to train a HiFiGAN vocoder, you only need a set of audio files. To create a new pipeline for a |
|
FastSpeech 2, you need audio files, corresponding text labels, and an already trained Aligner model to estimate the |
|
duration information that FastSpeech 2 needs as input. Let's go through them in order of increasing complexity. |
|
|
|
### Build a HiFi-GAN Pipeline |
|
|
|
In the directory called |
|
*Utility* there is a file called |
|
*file_lists.py*. In this file you should write a function that returns a list of all the absolute paths to each of the |
|
audio files in your dataset as strings. |
|
|
|
Then go to the directory |
|
*TrainingInterfaces/TrainingPipelines*. In there, make a copy of any existing pipeline that has HiFiGAN in its name. We |
|
will use this as reference and only make the necessary changes to use the new dataset. Import the function you have just |
|
written as |
|
*get_file_list*. Now look out for a variable called |
|
*model_save_dir*. This is the default directory that checkpoints will be saved into, unless you specify another one when |
|
calling the training script. Change it to whatever you like. |
|
|
|
Now you need to add your newly created pipeline to the pipeline dictionary in the file |
|
*run_training_pipeline.py* in the top level of the toolkit. In this file, import the |
|
*run* function from the pipeline you just created and give it a speaking name. Now in the |
|
*pipeline_dict*, add your imported function as value and use as key a shorthand that makes sense. And just like that |
|
you're done. |
|
|
|
### Build a FastSpeech 2 Pipeline |
|
|
|
In the directory called |
|
*Utility* there is a file called |
|
*path_to_transcript_dicts.py*. In this file you should write a function that returns a dictionary that has all the |
|
absolute paths to each of the audio files in your dataset as strings as the keys and the textual transcriptions of the |
|
corresponding audios as the values. |
|
|
|
Then go to the directory |
|
*TrainingInterfaces/TrainingPipelines*. In there, make a copy of any existing pipeline that has FastSpeech 2 in its |
|
name. We will use this copy as reference and only make the necessary changes to use the new dataset. Import the function |
|
you have just written as |
|
*build_path_to_transcript_dict*. Since the data will be processed a considerable amount, a cache will be built and saved |
|
as file for quick and easy restarts. So find the variable |
|
*cache_dir* and adapt it to your needs. The same goes for the variable |
|
*save_dir*, which is where the checkpoints will be saved to. This is a default value, you can overwrite it when calling |
|
the pipeline later using a command line argument, in case you want to fine-tune from a checkpoint and thus save into a |
|
different directory. |
|
|
|
In your new pipeline file, look out for the line in which the |
|
*acoustic_model* is loaded. Change the path to the checkpoint of an Aligner model. It can either be the one that is |
|
supplied with the toolkit in the download script, or one that you trained yourself. In the example pipelines, the one |
|
that we provide is finetuned to the dataset it is applied to before it is used to extract durations. |
|
|
|
Since we are using text here, we have to make sure that the text processing is adequate for the language. So check in |
|
*Preprocessing/TextFrontend* whether the TextFrontend already has a language ID (e.g. 'en' and 'de') for the language of |
|
your dataset. If not, you'll have to implement handling for that, but it should be pretty simple by just doing it |
|
analogous to what is there already. Now back in the pipeline, change the |
|
*lang* argument in the creation of the dataset and in the call to the train loop function to the language ID that |
|
matches your data. |
|
|
|
Now navigate to the implementation of the |
|
*train_loop* that is called in the pipeline. In this file, find the function called |
|
*plot_progress_spec*. This function will produce spectrogram plots during training, which is the most important way to |
|
monitor the progress of the training. In there, you may need to add an example sentence for the language of the data you |
|
are using. It should all be pretty clear from looking at it. |
|
|
|
Once this is done, we are almost done, now we just need to make it available to the |
|
*run_training_pipeline.py* file in the top level. In said file, import the |
|
*run* function from the pipeline you just created and give it a speaking name. Now in the |
|
*pipeline_dict*, add your imported function as value and use as key a shorthand that makes sense. And that's it. |
|
|
|
--- |
|
|
|
## Training a Model |
|
|
|
Once you have a pipeline built, training is super easy. Just activate your virtual environment and run the command |
|
below. You might want to use something like nohup to keep it running after you log out from the server (then you should |
|
also add -u as option to python) and add an & to start it in the background. Also, you might want to direct the std:out |
|
and std:err into a file using > but all of that is just standard shell use and has nothing to do with the toolkit. |
|
|
|
``` |
|
python run_training_pipeline.py <shorthand of the pipeline> |
|
``` |
|
|
|
You can supply any of the following arguments, but don't have to (although for training you should definitely specify at |
|
least a GPU ID). |
|
|
|
``` |
|
--gpu_id <ID of the GPU you wish to use, as displayed with nvidia-smi, default is cpu> |
|
|
|
--resume_checkpoint <path to a checkpoint to load> |
|
|
|
--resume (if this is present, the furthest checkpoint available will be loaded automatically) |
|
|
|
--finetune (if this is present, the provided checkpoint will be fine-tuned on the data from this pipeline) |
|
|
|
--model_save_dir <path to a directory where the checkpoints should be saved> |
|
``` |
|
|
|
After every epoch, some logs will be written to the console. If the loss becomes NaN, you'll need to use a smaller |
|
learning rate or more warmup steps in the arguments of the call to the training_loop in the pipeline you are running. |
|
|
|
If you get cuda out of memory errors, you need to decrease the batchsize in the arguments of the call to the |
|
training_loop in the pipeline you are running. Try decreasing the batchsize in small steps until you get no more out of |
|
cuda memory errors. Decreasing the batchsize may also require you to use a smaller learning rate. The use of GroupNorm |
|
should make it so that the training remains mostly stable. |
|
|
|
Speaking of plots: in the directory you specified for saving model's checkpoint files and self-explanatory visualization |
|
data will appear. Since the checkpoints are quite big, only the five most recent ones will be kept. Training will stop |
|
after 500,000 for FastSpeech 2, and after 2,500,000 steps for HiFiGAN. Depending on the machine and configuration you |
|
are using this will take multiple days, so verify that everything works on small tests before running the big thing. If |
|
you want to stop earlier, just kill the process, since everything is daemonic all the child-processes should die with |
|
it. In case there are some ghost-processes left behind, you can use the following command to find them and kill them |
|
manually. |
|
|
|
``` |
|
fuser -v /dev/nvidia* |
|
``` |
|
|
|
After training is complete, it is recommended to run |
|
*run_weight_averaging.py*. If you made no changes to the architectures and stuck to the default directory layout, it |
|
will automatically load any models you produced with one pipeline, average their parameters to get a slightly more |
|
robust model and save the result as |
|
*best.pt* in the same directory where all the corresponding checkpoints lie. This also compresses the file size |
|
significantly, so you should do this and then use the |
|
*best.pt* model for inference. |
|
|
|
--- |
|
|
|
## Creating a new InferenceInterface |
|
|
|
To build a new |
|
*InferenceInterface*, which you can then use for super simple inference, we're going to use an existing one as template |
|
again. Make a copy of the |
|
*InferenceInterface*. Change the name of the class in the copy and change the paths to the models to use the trained |
|
models of your choice. Instantiate the model with the same hyperparameters that you used when you created it in the |
|
corresponding training pipeline. The last thing to check is the language that you supply to the text frontend. Make sure |
|
it matches what you used during training. |
|
|
|
With your newly created |
|
*InferenceInterface*, you can use your trained models pretty much anywhere, e.g. in other projects. All you need is the |
|
*Utility* directory, the |
|
*Layers* |
|
directory, the |
|
*Preprocessing* directory and the |
|
*InferenceInterfaces* directory (and of course your model checkpoint). That's all the code you need, it works |
|
standalone. |
|
|
|
--- |
|
|
|
## Using a trained Model for Inference |
|
|
|
An |
|
*InferenceInterface* contains two useful methods. They are |
|
*read_to_file* and |
|
*read_aloud*. |
|
|
|
- *read_to_file* takes as input a list of strings and a filename. It will synthesize the sentences in the list and |
|
concatenate them with a short pause inbetween and write them to the filepath you supply as the other argument. |
|
|
|
- *read_aloud* takes just a string, which it will then convert to speech and immediately play using the system's |
|
speakers. If you set the optional argument |
|
*view* to |
|
*True* when calling it, it will also show a plot of the phonemes it produced, the spectrogram it came up with, and the |
|
wave it created from that spectrogram. So all the representations can be seen, text to phoneme, phoneme to spectrogram |
|
and finally spectrogram to wave. |
|
|
|
Those methods are used in demo code in the toolkit. In |
|
*run_interactive_demo.py* and |
|
*run_text_to_file_reader.py*, you can import |
|
*InferenceInterfaces* that you created and add them to the dictionary in each of the files with a shorthand that makes |
|
sense. In the interactive demo, you can just call the python script, then type in the shorthand when prompted and |
|
immediately listen to your synthesis saying whatever you put in next (be wary of out of memory errors for too long |
|
inputs). In the text reader demo script you have to call the function that wraps around the |
|
*InferenceInterface* and supply the shorthand of your choice. It should be pretty clear from looking at it. |
|
|
|
--- |
|
|
|
## FAQ |
|
|
|
Here are a few points that were brought up by users: |
|
|
|
- My error message shows GPU0, even though I specified a different GPU - The way GPU selection works is that the |
|
specified GPU is set as the only visible device, in order to avoid backend stuff running accidentally on different |
|
GPUs. So internally the program will name the device GPU0, because it is the only GPU it can see. It is actually |
|
running on the GPU you specified. |
|
|
|
--- |
|
|
|
This toolkit has been written by Florian Lux (except for the pytorch modules taken |
|
from [ESPnet](https://github.com/espnet/espnet) and |
|
[ParallelWaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN), as mentioned above), so if you come across problems |
|
or questions, feel free to [write a mail](mailto:[email protected]). Also let me know if you do something |
|
cool with it. Thank you for reading. |
|
|
|
## Citation |
|
|
|
``` |
|
@inproceedings{lux2021toucan, |
|
title={{The IMS Toucan system for the Blizzard Challenge 2021}}, |
|
author={Florian Lux and Julia Koch and Antje Schweitzer and Ngoc Thang Vu}, |
|
year={2021}, |
|
booktitle={Proc. Blizzard Challenge Workshop}, |
|
volume={2021}, |
|
publisher={{Speech Synthesis SIG}} |
|
} |
|
``` |
|
|