Update README.md
Browse files
README.md
CHANGED
@@ -7,4 +7,42 @@ sdk: static
|
|
7 |
pinned: false
|
8 |
---
|
9 |
|
10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
7 |
pinned: false
|
8 |
---
|
9 |
|
10 |
+
# TTSDS Benchmark
|
11 |
+
|
12 |
+
As many recent Text-to-Speech (TTS) models have shown, synthetic audio can be close to real human speech.
|
13 |
+
However, traditional evaluation methods for TTS systems need an update to keep pace with these new developments.
|
14 |
+
Our TTSDS benchmark assesses the quality of synthetic speech by considering factors like prosody, speaker identity, and intelligibility.
|
15 |
+
By comparing these factors with both real speech and noise datasets, we can better understand how synthetic speech stacks up.
|
16 |
+
|
17 |
+
## More information
|
18 |
+
More details can be found in our paper [*TTSDS -- Text-to-Speech Distribution Score*](https://arxiv.org/abs/2407.12707).
|
19 |
+
|
20 |
+
## Reproducibility
|
21 |
+
To reproduce our results, check out our repository [here](https://github.com/ttsds/ttsds).
|
22 |
+
|
23 |
+
## Credits
|
24 |
+
|
25 |
+
|
26 |
+
This benchmark is inspired by [TTS Arena](https://huggingface.co/spaces/TTS-AGI/TTS-Arena) which instead focuses on the subjective evaluation of TTS models.
|
27 |
+
Our benchmark would not be possible without the many open-source TTS models on Hugging Face and GitHub.
|
28 |
+
Additionally, our benchmark uses the following datasets:
|
29 |
+
- [LJSpeech](https://keithito.com/LJ-Speech-Dataset/h)
|
30 |
+
- [LibriTTS](https://www.openslr.org/60/)
|
31 |
+
- [VCTK](https://datashare.ed.ac.uk/handle/10283/2950)
|
32 |
+
- [Common Voice](https://commonvoice.mozilla.org/)
|
33 |
+
- [ESC-50](https://github.com/karolpiczak/ESC-50)
|
34 |
+
And the following metrics/representations/tools:
|
35 |
+
- [Wav2Vec2](https://arxiv.org/abs/2006.11477)
|
36 |
+
- [Hubert](https://arxiv.org/abs/2006.11477)
|
37 |
+
- [WavLM](https://arxiv.org/abs/2110.13900)
|
38 |
+
- [PESQ](https://en.wikipedia.org/wiki/Perceptual_Evaluation_of_Speech_Quality)
|
39 |
+
- [VoiceFixer](https://arxiv.org/abs/2204.05841)
|
40 |
+
- [WADA SNR](https://www.cs.cmu.edu/~robust/Papers/KimSternIS08.pdf)
|
41 |
+
- [Whisper](https://arxiv.org/abs/2212.04356)
|
42 |
+
- [Masked Prosody Model](https://huggingface.co/cdminix/masked_prosody_model)
|
43 |
+
- [PyWorld](https://github.com/JeremyCCHsu/Python-Wrapper-for-World-Vocoder)
|
44 |
+
- [WeSpeaker](https://arxiv.org/abs/2210.17016)
|
45 |
+
- [D-Vector](https://github.com/yistLin/dvector)
|
46 |
+
|
47 |
+
Authors: Christoph Minixhofer, Ondřej Klejch, and Peter Bell
|
48 |
+
of the University of Edinburgh.
|