Spaces:

sarinam
/

speaker-anonymization-gan

Running

App Files Files Community

speaker-anonymization-gan / IMSToucan /README.md

sarinam

Initial commit

09b2bbf over 2 years ago

preview code

raw

history blame contribute delete

17.3 kB

	![image](Utility/toucan.png)

	IMS Toucan is a toolkit for teaching, training and using state-of-the-art Speech Synthesis models, developed at the
	Institute for Natural Language Processing (IMS), University of Stuttgart, Germany. Everything is pure Python and
	PyTorch based to keep it as simple and beginner-friendly, yet powerful as possible.

	The PyTorch Modules of [Tacotron 2](https://arxiv.org/abs/1712.05884)
	and [FastSpeech 2](https://arxiv.org/abs/2006.04558) are taken from
	[ESPnet](https://github.com/espnet/espnet), the PyTorch Modules of [HiFiGAN](https://arxiv.org/abs/2010.05646) are taken
	from the [ParallelWaveGAN repository](https://github.com/kan-bayashi/ParallelWaveGAN)
	which are also authored by the brilliant [Tomoki Hayashi](https://github.com/kan-bayashi).

	For a version of the toolkit that includes TransformerTTS instead of Tacotron 2 and MelGAN instead of HiFiGAN, check out
	the TransformerTTS and MelGAN branch. They are separated to keep the code clean, simple and minimal.

	---

	## Contents

	- [New Features](#new-features)
	- [Demonstration](#demonstration)
	- [Installation](#installation)
	+ [Basic Requirements](#basic-requirements)
	+ [Speaker Embedding](#speaker-embedding)
	+ [espeak-ng](#espeak-ng)
	- [Creating a new Pipeline](#creating-a-new-pipeline)
	* [Build a HiFi-GAN Pipeline](#build-a-hifi-gan-pipeline)
	* [Build a FastSpeech 2 Pipeline](#build-a-fastspeech-2-pipeline)
	- [Training a Model](#training-a-model)
	- [Creating a new InferenceInterface](#creating-a-new-inferenceinterface)
	- [Using a trained Model for Inference](#using-a-trained-model-for-inference)
	- [FAQ](#faq)
	- [Citation](#citation)

	---

	## New Features

	- [As shown in this paper](http://festvox.org/blizzard/bc2021/BC21_DelightfulTTS.pdf) vocoders can be used to perform
	super-resolution and spectrogram inversion simultaneously. We added this to our HiFi-GAN vocoder. It now takes 16kHz
	spectrograms as input, but produces 48kHz waveforms.
	- We officially introduced IMS Toucan in
	[our contribution to the Blizzard Challenge 2021](http://festvox.org/blizzard/bc2021/BC21_IMS.pdf). Check out the
	bottom of the readme for a bibtex entry.
	- We now use articulatory representations of phonemes as the input for all models. This allows us to easily use
	multilingual data.
	- We provide a checkpoint trained with [model agnostic meta learning](https://arxiv.org/abs/1703.03400) from which you
	should be able to fine-tune a model with very little data in almost any language.
	- We now use a small self-contained Aligner that is trained with CTC, inspired by
	[this implementation](https://github.com/as-ideas/DeepForcedAligner). This allows us to get rid of the dependence on
	autoregressive models. Tacotron 2 is thus now also no longer in this branch, but still present in other branches,
	similar to TransformerTTS.

	---

	## Demonstration

	[Here are two sentences](https://drive.google.com/file/d/1ltAyR2EwAbmDo2hgkx1mvUny4FuxYmru/view?usp=sharing)
	produced by Tacotron 2 combined with HiFi-GAN, trained on
	[Nancy Krebs](https://www.cstr.ed.ac.uk/projects/blizzard/2011/lessac_blizzard2011/) using this toolkit.

	[Here is some speech](https://drive.google.com/file/d/1mZ1LvTlY6pJ5ZQ4UXZ9jbzB651mufBrB/view?usp=sharing)
	produced by FastSpeech 2 and MelGAN trained on [LJSpeech](https://keithito.com/LJ-Speech-Dataset/)
	using this toolkit.

	And [here is a sentence](https://drive.google.com/file/d/1FT49Jf0yyibwMDbsEJEO9mjwHkHRIGXc/view?usp=sharing)
	produced by TransformerTTS and MelGAN trained on [Thorsten](https://github.com/thorstenMueller/deep-learning-german-tts)
	using this toolkit.

	[Here is some speech](https://drive.google.com/file/d/14nPo2o1VKtWLPGF7e_0TxL8XGI3n7tAs/view?usp=sharing)
	produced by a multi-speaker FastSpeech 2 with MelGAN trained on
	[LibriTTS](https://research.google/tools/datasets/libri-tts/) using this toolkit. Fans of the videogame Portal may
	recognize who was used as the reference speaker for this utterance.

	[Interactive Demo of our entry to the Blizzard Challenge 2021.](https://colab.research.google.com/drive/1bRaySf8U55MRPaxqBr8huWrzCOzlxVqw)
	This is based on an older version of the toolkit though. It uses FastSpeech2 and MelGAN as vocoder and is trained on 5
	hours of Spanish.

	---

	## Installation

	#### Basic Requirements

	To install this toolkit, clone it onto the machine you want to use it on
	(should have at least one GPU if you intend to train models on that machine. For inference, you can get by without GPU).
	Navigate to the directory you have cloned. We are going to create and activate a
	[conda virtual environment](https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html)
	to install the basic requirements into. After creating the environment, the command you need to use to activate the
	virtual environment is displayed. The commands below show everything you need to do.

	```
	conda create --prefix ./toucan_conda_venv --no-default-packages python=3.8

	pip install --no-cache-dir -r requirements.txt

	pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html
	```

	#### Speaker Embedding

	As [NVIDIA has shown](https://arxiv.org/pdf/2110.05798.pdf), you get better results by fine-tuning a pretrained model on
	a new speaker, rather than training a multispeaker model. We have thus dropped support for zero-shot multispeaker models
	using speaker embeddings. However we still
	use [Speechbrain's ECAPA-TDNN](https://huggingface.co/speechbrain/spkrec-ecapa-voxceleb) for a cycle consistency loss to
	make adapting to new speakers a bit faster.

	In the current version of the toolkit no further action should be required. When you are using multispeaker for the
	first time, it requires an internet connection to download the pretrained models though.

	#### espeak-ng

	And finally you need to have espeak-ng installed on your system, because it is used as backend for the phonemizer. If
	you replace the phonemizer, you don't need it. On most Linux environments it will be installed already, and if it is
	not, and you have the sufficient rights, you can install it by simply running

	```
	apt-get install espeak-ng
	```

	---

	## Creating a new Pipeline

	To create a new pipeline to train a HiFiGAN vocoder, you only need a set of audio files. To create a new pipeline for a
	FastSpeech 2, you need audio files, corresponding text labels, and an already trained Aligner model to estimate the
	duration information that FastSpeech 2 needs as input. Let's go through them in order of increasing complexity.

	### Build a HiFi-GAN Pipeline

	In the directory called
	Utility there is a file called
	file_lists.py. In this file you should write a function that returns a list of all the absolute paths to each of the
	audio files in your dataset as strings.

	Then go to the directory
	TrainingInterfaces/TrainingPipelines. In there, make a copy of any existing pipeline that has HiFiGAN in its name. We
	will use this as reference and only make the necessary changes to use the new dataset. Import the function you have just
	written as
	get_file_list. Now look out for a variable called
	model_save_dir. This is the default directory that checkpoints will be saved into, unless you specify another one when
	calling the training script. Change it to whatever you like.

	Now you need to add your newly created pipeline to the pipeline dictionary in the file
	run_training_pipeline.py in the top level of the toolkit. In this file, import the
	run function from the pipeline you just created and give it a speaking name. Now in the
	pipeline_dict, add your imported function as value and use as key a shorthand that makes sense. And just like that
	you're done.

	### Build a FastSpeech 2 Pipeline

	In the directory called
	Utility there is a file called
	path_to_transcript_dicts.py. In this file you should write a function that returns a dictionary that has all the
	absolute paths to each of the audio files in your dataset as strings as the keys and the textual transcriptions of the
	corresponding audios as the values.

	Then go to the directory
	TrainingInterfaces/TrainingPipelines. In there, make a copy of any existing pipeline that has FastSpeech 2 in its
	name. We will use this copy as reference and only make the necessary changes to use the new dataset. Import the function
	you have just written as
	build_path_to_transcript_dict. Since the data will be processed a considerable amount, a cache will be built and saved
	as file for quick and easy restarts. So find the variable
	cache_dir and adapt it to your needs. The same goes for the variable
	save_dir, which is where the checkpoints will be saved to. This is a default value, you can overwrite it when calling
	the pipeline later using a command line argument, in case you want to fine-tune from a checkpoint and thus save into a
	different directory.

	In your new pipeline file, look out for the line in which the
	acoustic_model is loaded. Change the path to the checkpoint of an Aligner model. It can either be the one that is
	supplied with the toolkit in the download script, or one that you trained yourself. In the example pipelines, the one
	that we provide is finetuned to the dataset it is applied to before it is used to extract durations.

	Since we are using text here, we have to make sure that the text processing is adequate for the language. So check in
	Preprocessing/TextFrontend whether the TextFrontend already has a language ID (e.g. 'en' and 'de') for the language of
	your dataset. If not, you'll have to implement handling for that, but it should be pretty simple by just doing it
	analogous to what is there already. Now back in the pipeline, change the
	lang argument in the creation of the dataset and in the call to the train loop function to the language ID that
	matches your data.

	Now navigate to the implementation of the
	train_loop that is called in the pipeline. In this file, find the function called
	plot_progress_spec. This function will produce spectrogram plots during training, which is the most important way to
	monitor the progress of the training. In there, you may need to add an example sentence for the language of the data you
	are using. It should all be pretty clear from looking at it.

	Once this is done, we are almost done, now we just need to make it available to the
	run_training_pipeline.py file in the top level. In said file, import the
	run function from the pipeline you just created and give it a speaking name. Now in the
	pipeline_dict, add your imported function as value and use as key a shorthand that makes sense. And that's it.

	---

	## Training a Model

	Once you have a pipeline built, training is super easy. Just activate your virtual environment and run the command
	below. You might want to use something like nohup to keep it running after you log out from the server (then you should
	also add -u as option to python) and add an & to start it in the background. Also, you might want to direct the std:out
	and std:err into a file using > but all of that is just standard shell use and has nothing to do with the toolkit.

	```
	python run_training_pipeline.py <shorthand of the pipeline>
	```

	You can supply any of the following arguments, but don't have to (although for training you should definitely specify at
	least a GPU ID).

	```
	--gpu_id <ID of the GPU you wish to use, as displayed with nvidia-smi, default is cpu>

	--resume_checkpoint <path to a checkpoint to load>

	--resume (if this is present, the furthest checkpoint available will be loaded automatically)

	--finetune (if this is present, the provided checkpoint will be fine-tuned on the data from this pipeline)

	--model_save_dir <path to a directory where the checkpoints should be saved>
	```

	After every epoch, some logs will be written to the console. If the loss becomes NaN, you'll need to use a smaller
	learning rate or more warmup steps in the arguments of the call to the training_loop in the pipeline you are running.

	If you get cuda out of memory errors, you need to decrease the batchsize in the arguments of the call to the
	training_loop in the pipeline you are running. Try decreasing the batchsize in small steps until you get no more out of
	cuda memory errors. Decreasing the batchsize may also require you to use a smaller learning rate. The use of GroupNorm
	should make it so that the training remains mostly stable.

	Speaking of plots: in the directory you specified for saving model's checkpoint files and self-explanatory visualization
	data will appear. Since the checkpoints are quite big, only the five most recent ones will be kept. Training will stop
	after 500,000 for FastSpeech 2, and after 2,500,000 steps for HiFiGAN. Depending on the machine and configuration you
	are using this will take multiple days, so verify that everything works on small tests before running the big thing. If
	you want to stop earlier, just kill the process, since everything is daemonic all the child-processes should die with
	it. In case there are some ghost-processes left behind, you can use the following command to find them and kill them
	manually.

	```
	fuser -v /dev/nvidia*
	```

	After training is complete, it is recommended to run
	run_weight_averaging.py. If you made no changes to the architectures and stuck to the default directory layout, it
	will automatically load any models you produced with one pipeline, average their parameters to get a slightly more
	robust model and save the result as
	best.pt in the same directory where all the corresponding checkpoints lie. This also compresses the file size
	significantly, so you should do this and then use the
	best.pt model for inference.

	---

	## Creating a new InferenceInterface

	To build a new
	InferenceInterface, which you can then use for super simple inference, we're going to use an existing one as template
	again. Make a copy of the
	InferenceInterface. Change the name of the class in the copy and change the paths to the models to use the trained
	models of your choice. Instantiate the model with the same hyperparameters that you used when you created it in the
	corresponding training pipeline. The last thing to check is the language that you supply to the text frontend. Make sure
	it matches what you used during training.

	With your newly created
	InferenceInterface, you can use your trained models pretty much anywhere, e.g. in other projects. All you need is the
	Utility directory, the
	Layers
	directory, the
	Preprocessing directory and the
	InferenceInterfaces directory (and of course your model checkpoint). That's all the code you need, it works
	standalone.

	---

	## Using a trained Model for Inference

	An
	InferenceInterface contains two useful methods. They are
	read_to_file and
	read_aloud.

	- read_to_file takes as input a list of strings and a filename. It will synthesize the sentences in the list and
	concatenate them with a short pause inbetween and write them to the filepath you supply as the other argument.

	- read_aloud takes just a string, which it will then convert to speech and immediately play using the system's
	speakers. If you set the optional argument
	view to
	True when calling it, it will also show a plot of the phonemes it produced, the spectrogram it came up with, and the
	wave it created from that spectrogram. So all the representations can be seen, text to phoneme, phoneme to spectrogram
	and finally spectrogram to wave.

	Those methods are used in demo code in the toolkit. In
	run_interactive_demo.py and
	run_text_to_file_reader.py, you can import
	InferenceInterfaces that you created and add them to the dictionary in each of the files with a shorthand that makes
	sense. In the interactive demo, you can just call the python script, then type in the shorthand when prompted and
	immediately listen to your synthesis saying whatever you put in next (be wary of out of memory errors for too long
	inputs). In the text reader demo script you have to call the function that wraps around the
	InferenceInterface and supply the shorthand of your choice. It should be pretty clear from looking at it.

	---

	## FAQ

	Here are a few points that were brought up by users:

	- My error message shows GPU0, even though I specified a different GPU - The way GPU selection works is that the
	specified GPU is set as the only visible device, in order to avoid backend stuff running accidentally on different
	GPUs. So internally the program will name the device GPU0, because it is the only GPU it can see. It is actually
	running on the GPU you specified.

	---

	This toolkit has been written by Florian Lux (except for the pytorch modules taken
	from [ESPnet](https://github.com/espnet/espnet) and
	[ParallelWaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN), as mentioned above), so if you come across problems
	or questions, feel free to [write a mail](mailto:[email protected]). Also let me know if you do something
	cool with it. Thank you for reading.

	## Citation

	```
	@inproceedings{lux2021toucan,
	title={{The IMS Toucan system for the Blizzard Challenge 2021}},
	author={Florian Lux and Julia Koch and Antje Schweitzer and Ngoc Thang Vu},
	year={2021},
	booktitle={Proc. Blizzard Challenge Workshop},
	volume={2021},
	publisher={{Speech Synthesis SIG}}
	}
	```