10 1 12

Christoph Minixhofer PRO

cdminix

AI & ML interests

None yet

Recent Activity

liked a model 4 days ago

tflite-hub/conformer-speaker-encoder

liked a model 11 days ago

amphion/MaskGCT

liked a Space 15 days ago

stachu86/HeightCeleb-estimator-demo

View all activity

Organizations

cdminix's activity

posted an update about 1 month ago

Post

986

As part of some ongoing work, I'm releasing the currently biggest collection of docker containers for state-of-the-art voice cloning TTS systems.
https://github.com/ttsds/datasets

Alongside there is also a nice overview of all systems (see below)

replied to their post 5 months ago

Totally agree! Tortoise seems to not get benchmarked/compared to as much as other systems, and I don't know exactly why.

Not just for Tortoise, but for all theses systems it would be interesting how they compare to each other when finetuned. Unfortantely I don't know of any benchmarks/papers that have tried to evaluate that (yet).

posted an update 5 months ago

Post

514

I just added 5 more models to my open source TTS model benchmark, ttsds/benchmark.
Let's talk about the results!

Over the last couple days, I added jbetker/tortoise-tts-v2, metavoiceio/metavoice-1B-v0.1, audo/HierSpeechpp, and the unofficial implementations of amphion/NaturalSpeech2 and amphion/valle by https://huggingface.co/amphion

Takeaways:
- TorToiSe does very well, falling into second place after StyleTTS 2, which is also ranked first in the human evaluation at TTS-AGI/TTS-Arena.
- MetaVoice-1B's overall score is dragged down by its Intelligibility Score (probably due to utterances being cut short), it achieves #3 in Speaker Score, which indicates good voice cloning ability.
- HierSpeech++ lands in the middle of the road in terms of performance, but excels at the Environment Score, achieving #2 - this means the model is especially good at modeling recording conditions such as microphone and background noise.
- The Amphion models, possibly due to not being trained for the same amount as in the papers, achieve relatively low scores. However, they seem to struggle for different reasons. The autoregressive VALLE models have low Intelligibility Scores (possibly due to "babbling" or early stop tokens) while NaturalSpeech2 has low Speaker and Prosody scores.

What's next?
I'm planning to add more open source TTS models like suno/bark, CAMB-AI/MARS5-TTS and fishaudio/fish-speech-1.2. I'll also write an article on these and all the other results soon, since our paper, TTSDS -- Text-to-Speech Distribution Score (2407.12707), mostly focused on establishing the benchmark itself rather than the indiviual TTS systems.

3 replies

posted an update 5 months ago

Post

2233

Since new TTS (Text-to-Speech) systems are coming out what feels like every day, and it's currently hard to compare them, my latest project has focused on doing just that.

I was inspired by the TTS-AGI/TTS-Arena (definitely check it out if you haven't), which compares recent TTS system using crowdsourced A/B testing.

I wanted to see if we can also do a similar evaluation with objective metrics and it's now available here:
ttsds/benchmark
Anyone can submit a new TTS model, and I hope this can provide a way to get some information on which areas models perform well or poorly in.

The paper with all the details is available here: https://arxiv.org/abs/2407.12707