|
--- |
|
license: mit |
|
tags: |
|
- vits |
|
- vits istft |
|
- istft |
|
pipeline_tag: text-to-speech |
|
--- |
|
|
|
# VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech |
|
|
|
VITS is an end-to-end speech synthesis model that predicts a speech waveform conditional on an input text sequence. It is a |
|
conditional variational autoencoder (VAE) comprised of a posterior encoder, decoder, and conditional prior. This repository |
|
contains the weights for the official VITS checkpoint trained on the [LJ Speech](https://huggingface.co/datasets/lj_speech) dataset. |
|
|
|
# VITS ISTFT: New decoder synthesized speech as natural as that synthesized by VITS while achieving a real-time factor of 0.066 on an Intel Core i7 CPU, 4.1 times faster than original VITS. Suitable for real-time and edge device applications |
|
|
|
| Checkpoint | Train Hours | Speakers | |
|
|------------|-------------|----------| |
|
| [ljspeech_vits_ms_istft](https://huggingface.co/anhnct/ljspeech_vits_ms_istft) | 24 | 1 | |
|
| [ljspeech_vits_mb_istft](https://huggingface.co/anhnct/ljspeech_vits_mb_istft) | 24 | 1 | |
|
| [ljspeech_vits_istft](https://huggingface.co/anhnct/ljspeech_vits_istft) | 24 | 1 | |
|
|
|
## Usage |
|
|
|
To use this checkpoint, |
|
first install the latest version of the library: |
|
|
|
``` |
|
pip install --upgrade transformers accelerate |
|
``` |
|
|
|
Then, run inference with the following code-snippet: |
|
|
|
```python |
|
from transformers import AutoModel, AutoTokenizer |
|
import torch |
|
import numpy as np |
|
|
|
model = AutoModel.from_pretrained("anhnct/ljspeech_vits_mb_istft", trust_remote_code=True) |
|
tokenizer = AutoTokenizer.from_pretrained("anhnct/ljspeech_vits_mb_istft") |
|
|
|
text = "Hey, it's Hugging Face on the phone" |
|
inputs = tokenizer(text, return_tensors="pt") |
|
|
|
with torch.no_grad(): |
|
output = model(**inputs).waveform |
|
``` |
|
|
|
The resulting waveform can be saved as a `.wav` file: |
|
|
|
```python |
|
import scipy |
|
|
|
data_np = output.numpy() |
|
data_np_squeezed = np.squeeze(data_np) |
|
scipy.io.wavfile.write("techno.wav", rate=model.config.sampling_rate, data=data_np_squeezed) |
|
``` |
|
|
|
Or displayed in a Jupyter Notebook / Google Colab: |
|
|
|
```python |
|
from IPython.display import Audio |
|
|
|
Audio(data_np_squeezed, rate=model.config.sampling_rate) |
|
``` |
|
|
|
## License |
|
|
|
The model is licensed as [**MIT**](https://github.com/jaywalnut310/vits/blob/main/LICENSE). |