anhnct's picture
Update README.md
f853f64 verified
---
license: mit
tags:
- vits
- vits istft
- istft
pipeline_tag: text-to-speech
---
# VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
VITS is an end-to-end speech synthesis model that predicts a speech waveform conditional on an input text sequence. It is a
conditional variational autoencoder (VAE) comprised of a posterior encoder, decoder, and conditional prior. This repository
contains the weights for the official VITS checkpoint trained on the [LJ Speech](https://huggingface.co/datasets/lj_speech) dataset.
# VITS ISTFT: New decoder synthesized speech as natural as that synthesized by VITS while achieving a real-time factor of 0.066 on an Intel Core i7 CPU, 4.1 times faster than original VITS. Suitable for real-time and edge device applications
| Checkpoint | Train Hours | Speakers |
|------------|-------------|----------|
| [ljspeech_vits_ms_istft](https://huggingface.co/anhnct/ljspeech_vits_ms_istft) | 24 | 1 |
| [ljspeech_vits_mb_istft](https://huggingface.co/anhnct/ljspeech_vits_mb_istft) | 24 | 1 |
| [ljspeech_vits_istft](https://huggingface.co/anhnct/ljspeech_vits_istft) | 24 | 1 |
## Usage
To use this checkpoint,
first install the latest version of the library:
```
pip install --upgrade transformers accelerate
```
Then, run inference with the following code-snippet:
```python
from transformers import AutoModel, AutoTokenizer
import torch
import numpy as np
model = AutoModel.from_pretrained("anhnct/ljspeech_vits_mb_istft", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("anhnct/ljspeech_vits_mb_istft")
text = "Hey, it's Hugging Face on the phone"
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
output = model(**inputs).waveform
```
The resulting waveform can be saved as a `.wav` file:
```python
import scipy
data_np = output.numpy()
data_np_squeezed = np.squeeze(data_np)
scipy.io.wavfile.write("techno.wav", rate=model.config.sampling_rate, data=data_np_squeezed)
```
Or displayed in a Jupyter Notebook / Google Colab:
```python
from IPython.display import Audio
Audio(data_np_squeezed, rate=model.config.sampling_rate)
```
## License
The model is licensed as [**MIT**](https://github.com/jaywalnut310/vits/blob/main/LICENSE).