File size: 2,334 Bytes
781856a
f853f64
 
 
 
 
 
781856a
 
f853f64
781856a
f853f64
 
 
781856a
f853f64
781856a
f853f64
 
 
 
 
781856a
f853f64
781856a
f853f64
 
781856a
f853f64
 
 
781856a
f853f64
781856a
f853f64
 
 
 
781856a
f853f64
 
781856a
f853f64
 
781856a
f853f64
 
 
781856a
f853f64
781856a
f853f64
 
781856a
f853f64
 
 
 
781856a
f853f64
781856a
f853f64
 
781856a
f853f64
 
781856a
f853f64
781856a
f853f64
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
---
license: mit
tags:
- vits
- vits istft
- istft
pipeline_tag: text-to-speech
---

# VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

VITS is an end-to-end speech synthesis model that predicts a speech waveform conditional on an input text sequence. It is a 
conditional variational autoencoder (VAE) comprised of a posterior encoder, decoder, and conditional prior. This repository 
contains the weights for the official VITS checkpoint trained on the [LJ Speech](https://huggingface.co/datasets/lj_speech) dataset. 

# VITS ISTFT:  New decoder synthesized speech as natural as that synthesized by VITS while achieving a real-time factor of 0.066 on an Intel Core i7 CPU, 4.1 times faster than original VITS. Suitable for real-time and edge device applications

| Checkpoint | Train Hours | Speakers |
|------------|-------------|----------|
| [ljspeech_vits_ms_istft](https://huggingface.co/anhnct/ljspeech_vits_ms_istft)   | 24          | 1        |
| [ljspeech_vits_mb_istft](https://huggingface.co/anhnct/ljspeech_vits_mb_istft)   | 24          | 1        |
| [ljspeech_vits_istft](https://huggingface.co/anhnct/ljspeech_vits_istft)   | 24          | 1        |

## Usage

To use this checkpoint, 
first install the latest version of the library:

```
pip install --upgrade transformers accelerate
```

Then, run inference with the following code-snippet:

```python
from transformers import AutoModel, AutoTokenizer
import torch
import numpy as np

model = AutoModel.from_pretrained("anhnct/ljspeech_vits_mb_istft", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("anhnct/ljspeech_vits_mb_istft")

text = "Hey, it's Hugging Face on the phone"
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    output = model(**inputs).waveform
```

The resulting waveform can be saved as a `.wav` file:

```python
import scipy

data_np = output.numpy()
data_np_squeezed = np.squeeze(data_np)
scipy.io.wavfile.write("techno.wav", rate=model.config.sampling_rate, data=data_np_squeezed)
```

Or displayed in a Jupyter Notebook / Google Colab:

```python
from IPython.display import Audio

Audio(data_np_squeezed, rate=model.config.sampling_rate)
```

## License

The model is licensed as [**MIT**](https://github.com/jaywalnut310/vits/blob/main/LICENSE).