Spaces:
Running
on
CPU Upgrade
Add TTS Model: xVASynth
TTS name - xVASynth
Author - Dan Ruta (not me)
Model name - xVAPitch (v3 model, v2 is FastPitch IIRC)
Model link: https://huggingface.co/Pendrokar/xvapitch_nvidia (Note the Legal note)
Model License: CC-BY
TTS License: GPL-3
๐ค Space: https://huggingface.co/spaces/Pendrokar/xVASynth
Several questions:
- I hear you only synthesize female voices?
- Not only that, but American English voices?
Can be either 22kHz/24kHz?[edit] โ ElevenLabs uses 44kHzPost synthesis super resolution to 44/48kHz allowed?(not used by Space)- Is RVC allowed post synthesis? (not used by Space)
Sadly the male voices in xVASynth Space are better than most of the female voices. Except one. But that voice sounds British English. Will have to fetch the NVIDIA dataset to train a proper female American English voice. ๐ค
The xVASynth Space in particular does not use or support CUDA. Loading a single model takes around 600 MB of RAM. 2 GB of VRAM, if normal xVASynth is run with CUDA. The 2 CPU core Space hit a bottleneck once multiple people tried to use it at once, so had to use CPU Upgraded on launch. Real-Time-Factor is close to 1.0 or lower even on CPU.
Gradio API defaults:
client = Client("Pendrokar/xVASynth")
result = client.predict(
"Oh, hello."
"ccby_nvidia_hifi_92_F"
"en"
1.0, # duration is 1.0 by default, not 0.5
0, # pitch unused
0.1, # energy unused
0,
0,
0,
0,
True, # DeepMoji affects inference
api_name="/predict"
)
The xVASynth Space in particular does not use or support CUDA. Loading a single model takes around 600 MB of RAM. 2 GB of VRAM, if normal xVASynth is run with CUDA. The 2 CPU core Space hit a bottleneck once multiple people tried to use it at once, so had to use CPU Upgraded on launch. Real-Time-Factor is close to 1.0 or lower even on CPU.
No longer an issue for the ๐ค Space of the TTS now that a HF CPU-Upgrade grant has been given, RTF of below 1.0 is now guaranteed. ๐๐ผ
Model link: https://huggingface.co/Pendrokar/xvapitch_nvidia (Note the Legal note)
With the inclusion of VoiceCraft v2 into the TTS Arena, which @reach-vb admitted includes non-permissive license, I am dumbfounded by the deafness on adding xVASynth.
To clarify, the quoted fine-tuned voice models themselves are made from datasets of a permissive license. It is the base model whose datasets include non-permissive licenses. The fine-tuning is don on the base models.
Now I am not claiming that xVASynth can go toe to toe with StyleTTS 2 and XTTS, but the preliminary results for xVASynth on the cloned TTS Arena Space are not too shabby. Even if I do force xVASynth to be one of the chosen candidates.
https://huggingface.co/spaces/Pendrokar/TTS-Spaces-Arena
So... what is the issue, if VoiceCraft's non-permissive license wasn't an issue for it? Why are newer and more unknown TTS taking precedence in being included?
xVASynth base model would need to be retrained to compete with other TTS engines.