YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for Everyone
YourTTS is a model for zero-shot multi-speaker Text-to-Speech (TTS) and voice conversion, built on the VITS framework with novel modifications for multi-speaker and multilingual training. It achieves state-of-the-art results in zero-shot multi-speaker TTS and competitive performance in voice conversion on the VCTK dataset. The model enables synthesis for speakers with different voices or recording conditions from those in the training data, even with less than a minute of fine-tuning data.
This repository provides the official VCTK checkpoint, aiming to offer a fair comparison and restore access to a full training-ready version of YourTTS.
Model Details
- Architecture: YourTTS builds upon the VITS model, integrating speaker consistency loss, multilingual data handling, and other customizations.
- Tasks Supported:
- Zero-shot multi-speaker TTS: Synthesize speech in multiple voices, even without prior training data for a specific speaker.
- Zero-shot voice conversion: Convert a source voice to sound like a target speaker.
- Training Data: Multilingual training was performed with more than 1,000 English speakers, but only 5 speakers in French and 1 speaker in Portuguese. The language batch balancer used during training allocates a significant portion of batches to the few non-English speakers.
Important Notice
Previous Issues with Checkpoints
The original models were hosted at Edresson's GitHub repository, but the hosting server was wiped out, resulting in broken links. Additionally, a commercial version of YourTTS was made available through Coqui, but it only included the predictor, not the discriminator. This omission impacts training capabilities, leading to suboptimal results for some users who attempted fine-tuning.
Concerns About Previous Comparisons
Several recent works (Li et al., 2023; Wang et al., 2023; Kim et al., 2023) have compared their models against the multilingual checkpoint of YourTTS. However, the comparison may not be fair due to the limited speaker diversity in non-English languages during training. The imbalanced training can result in overfitting and reduced performance in English. Refer to Section 4.1 of the original paper for more details.
Usage
Text-to-Speech (TTS)
To use the ๐ธ TTS version v0.7.0 released YourTTS model for Text-to-Speech, run the following command:
tts --text "This is an example!" --model_name tts_models/multilingual/multi-dataset/your_tts --speaker_wav target_speaker_wav.wav --language_idx "en"
In this example, target_speaker_wav.wav
should be an audio sample from the target speaker.
Voice Conversion
To use the ๐ธ TTS released YourTTS model for voice conversion, run the following command:
tts --model_name tts_models/multilingual/multi-dataset/your_tts --speaker_wav target_speaker_wav.wav --reference_wav target_content_wav.wav --language_idx "en"
Here, target_content_wav.wav
is the reference audio file that you want to convert into the voice of the speaker in target_speaker_wav.wav
.
Fine-Tuning
YourTTS allows fine-tuning with as little as one minute of audio data, achieving high voice similarity and quality.
Audio Samples
Listen to audio samples here.
Training Instructions
To replicate Experiment 1 from the paper, use the provided training recipe in the Coqui TTS repository.
For fine-tuning:
- Download the speaker encoder and model checkpoints from this repository.
- Update the configuration (
config.json
) to point to your dataset and speaker embeddings. - Start training using:
python3 TTS/bin/train_tts.py --config_path config.json
Checkpoints
The VCTK checkpoint is available under the CC BY-NC-ND 4.0 license.
Erratum
An error in the implementation of the speaker consistency loss was discovered after the paper's publication, affecting some fine-tuning experiments. The issue has been fixed in Coqui TTS version 0.12.0 or higher.
Citation
If you use this model, please cite the original paper:
@inproceedings{casanova2022yourtts,
title={YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for Everyone},
author={Casanova, Edresson and Weber, Julian and Shulby, Christopher D and others},
booktitle={International Conference on Machine Learning},
year={2022}
}