File size: 2,334 Bytes
781856a f853f64 781856a f853f64 781856a f853f64 781856a f853f64 781856a f853f64 781856a f853f64 781856a f853f64 781856a f853f64 781856a f853f64 781856a f853f64 781856a f853f64 781856a f853f64 781856a f853f64 781856a f853f64 781856a f853f64 781856a f853f64 781856a f853f64 781856a f853f64 781856a f853f64 781856a f853f64 781856a f853f64 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 |
---
license: mit
tags:
- vits
- vits istft
- istft
pipeline_tag: text-to-speech
---
# VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
VITS is an end-to-end speech synthesis model that predicts a speech waveform conditional on an input text sequence. It is a
conditional variational autoencoder (VAE) comprised of a posterior encoder, decoder, and conditional prior. This repository
contains the weights for the official VITS checkpoint trained on the [LJ Speech](https://huggingface.co/datasets/lj_speech) dataset.
# VITS ISTFT: New decoder synthesized speech as natural as that synthesized by VITS while achieving a real-time factor of 0.066 on an Intel Core i7 CPU, 4.1 times faster than original VITS. Suitable for real-time and edge device applications
| Checkpoint | Train Hours | Speakers |
|------------|-------------|----------|
| [ljspeech_vits_ms_istft](https://huggingface.co/anhnct/ljspeech_vits_ms_istft) | 24 | 1 |
| [ljspeech_vits_mb_istft](https://huggingface.co/anhnct/ljspeech_vits_mb_istft) | 24 | 1 |
| [ljspeech_vits_istft](https://huggingface.co/anhnct/ljspeech_vits_istft) | 24 | 1 |
## Usage
To use this checkpoint,
first install the latest version of the library:
```
pip install --upgrade transformers accelerate
```
Then, run inference with the following code-snippet:
```python
from transformers import AutoModel, AutoTokenizer
import torch
import numpy as np
model = AutoModel.from_pretrained("anhnct/ljspeech_vits_mb_istft", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("anhnct/ljspeech_vits_mb_istft")
text = "Hey, it's Hugging Face on the phone"
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
output = model(**inputs).waveform
```
The resulting waveform can be saved as a `.wav` file:
```python
import scipy
data_np = output.numpy()
data_np_squeezed = np.squeeze(data_np)
scipy.io.wavfile.write("techno.wav", rate=model.config.sampling_rate, data=data_np_squeezed)
```
Or displayed in a Jupyter Notebook / Google Colab:
```python
from IPython.display import Audio
Audio(data_np_squeezed, rate=model.config.sampling_rate)
```
## License
The model is licensed as [**MIT**](https://github.com/jaywalnut310/vits/blob/main/LICENSE). |