Spaces:
Running
on
A10G
Apply for community grant: Academic project (gpu)
Hi Dear HF Team! ๐
This is an open-source implementation of Microsoft's latest Text-to-speech model VALL-E X ๐๏ธ, from paper Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling ๐. It is basically a 24 layers, 1024 d GPT-style model ๐ป. I have made a demo page about this model for your inspection and consideration ๐.
It takes about 60s to synthesize a 6s speech on free CPU โณ, but only 2~3s on a single RTX 3060 โก. I sincerely hope that you could grant GPU resources for the Hugging Face space of this project, so that more people can have the chance to play with this awesome model ๐.
Best Regards!๐ค
Hello, great author:
I encountered a problem when I used the downloaded pre-training model to generate directly according to the basic usage method in the explanatory text: I generated blank noise audio. The same is true for audio generated by running python -X utf8 launch-ui.py into the user interface. However, the results generated by opening the online demonstration link you provided are normal, and I don't understand what went wrong. Vallex_checkpoint.pt and vocos models have been downloaded.