|
--- |
|
license: apache-2.0 |
|
language: |
|
- zh |
|
library_name: transformers.js |
|
pipeline_tag: text-to-speech |
|
--- |
|
|
|
# VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech |
|
|
|
VITS is an end-to-end speech synthesis model that predicts a speech waveform conditional on an input text sequence. It is a conditional variational autoencoder (VAE) comprised of a posterior encoder, decoder, and conditional prior. |
|
|
|
## Model Details |
|
|
|
Languages: Chinese |
|
|
|
Dataset: THCHS-30 |
|
|
|
Speakers: 44 |
|
|
|
Training Hours: 48 |
|
|
|
## Usage |
|
|
|
Using this checkpoint from Hugging Face Transformers: |
|
|
|
```py |
|
from transformers import VitsModel, VitsTokenizer |
|
from pypinyin import lazy_pinyin, Style |
|
import torch |
|
|
|
model = VitsModel.from_pretrained("BricksDisplay/vits-cmn") |
|
tokenizer = VitsTokenizer.from_pretrained("BricksDisplay/vits-cmn") |
|
|
|
text = "中文" |
|
payload = ''.join(lazy_pinyin(text, style=Style.TONE, tone_sandhi=True)) |
|
inputs = tokenizer(payload, return_tensors="pt") |
|
|
|
with torch.no_grad(): |
|
output = model(**inputs, speaker_id=0) |
|
|
|
from IPython.display import Audio |
|
Audio(output.audio[0], rate=16000) |
|
``` |
|
|
|
Using this checkpoint from Transformers.js: |
|
|
|
```js |
|
import { pipeline } from '@xenova/transformers'; |
|
import { pinyin } from 'pinyin-pro'; // Our use-case, using `pinyin-pro` |
|
|
|
const synthesizer = await pipeline('text-to-audio', 'BricksDisplay/vits-cmn', { quantized: false }) |
|
console.log(await synthesizer(pinyin("中文"))) |
|
// { |
|
// audio: Float32Array(?) [ ... ], |
|
// sampling_rate: 16000 |
|
// } |
|
``` |
|
|
|
Note: Transformers.js (ONNX) version does not support speaker_id, so it will fixed in 0 |