YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for Everyone

YourTTS is a model for zero-shot multi-speaker Text-to-Speech (TTS) and voice conversion, built on the VITS framework with novel modifications for multi-speaker and multilingual training. It achieves state-of-the-art results in zero-shot multi-speaker TTS and competitive performance in voice conversion on the VCTK dataset. The model enables synthesis for speakers with different voices or recording conditions from those in the training data, even with less than a minute of fine-tuning data.

This repository provides the official VCTK checkpoint, aiming to offer a fair comparison and restore access to a full training-ready version of YourTTS.

Model Details

Architecture: YourTTS builds upon the VITS model, integrating speaker consistency loss, multilingual data handling, and other customizations.
Tasks Supported:
- Zero-shot multi-speaker TTS: Synthesize speech in multiple voices, even without prior training data for a specific speaker.
- Zero-shot voice conversion: Convert a source voice to sound like a target speaker.
Training Data: Multilingual training was performed with more than 1,000 English speakers, but only 5 speakers in French and 1 speaker in Portuguese. The language batch balancer used during training allocates a significant portion of batches to the few non-English speakers.

Important Notice

Previous Issues with Checkpoints

The original models were hosted at Edresson's GitHub repository, but the hosting server was wiped out, resulting in broken links. Additionally, a commercial version of YourTTS was made available through Coqui, but it only included the predictor, not the discriminator. This omission impacts training capabilities, leading to suboptimal results for some users who attempted fine-tuning.

Concerns About Previous Comparisons

Several recent works (Li et al., 2023; Wang et al., 2023; Kim et al., 2023) have compared their models against the multilingual checkpoint of YourTTS. However, the comparison may not be fair due to the limited speaker diversity in non-English languages during training. The imbalanced training can result in overfitting and reduced performance in English. Refer to Section 4.1 of the original paper for more details.

Usage

Text-to-Speech (TTS)

To use the 🐸 TTS version v0.7.0 released YourTTS model for Text-to-Speech, run the following command:

tts --text "This is an example!" --model_name tts_models/multilingual/multi-dataset/your_tts --speaker_wav target_speaker_wav.wav --language_idx "en"

In this example, target_speaker_wav.wav should be an audio sample from the target speaker.

Voice Conversion

To use the 🐸 TTS released YourTTS model for voice conversion, run the following command:

tts --model_name tts_models/multilingual/multi-dataset/your_tts --speaker_wav target_speaker_wav.wav --reference_wav target_content_wav.wav --language_idx "en"

Here, target_content_wav.wav is the reference audio file that you want to convert into the voice of the speaker in target_speaker_wav.wav.

Fine-Tuning

YourTTS allows fine-tuning with as little as one minute of audio data, achieving high voice similarity and quality.

Audio Samples

Listen to audio samples here.

Training Instructions

To replicate Experiment 1 from the paper, use the provided training recipe in the Coqui TTS repository.

For fine-tuning:

Download the speaker encoder and model checkpoints from this repository.
Update the configuration (config.json) to point to your dataset and speaker embeddings.

Start training using:

python3 TTS/bin/train_tts.py --config_path config.json

Checkpoints

The VCTK checkpoint is available under the CC BY-NC-ND 4.0 license.

Erratum

An error in the implementation of the speaker consistency loss was discovered after the paper's publication, affecting some fine-tuning experiments. The issue has been fixed in Coqui TTS version 0.12.0 or higher.

Citation

If you use this model, please cite the original paper:

@inproceedings{casanova2022yourtts,
  title={YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for Everyone},
  author={Casanova, Edresson and Weber, Julian and Shulby, Christopher D and others},
  booktitle={International Conference on Machine Learning},
  year={2022}
}