buyun ryanmiao commited on
Commit
0be64e4
·
verified ·
1 Parent(s): 3d3fa83

Update README.md (#2)

Browse files

- Update README.md (332231122f3ccbb033c8983e0def9dfcb13d2f14)


Co-authored-by: mrh <[email protected]>

Files changed (1) hide show
  1. README.md +150 -5
README.md CHANGED
@@ -3,15 +3,160 @@ license: apache-2.0
3
  ---
4
  # Step-Audio-TTS-3B
5
 
 
6
 
7
- Step-Audio-TTS-3B 是业界首个基于大规模合成数据和LLM-Chat范式训练的TTS模型,在SEED TTS Eval上取得SOTA的CER结果,支持多种语言,多种情感,多种语音风格控制,也是业界首个支持RAP和哼唱的TTS模型。
8
 
9
- Step-Audio-TTS-3B represents the industry's first Text-to-Speech (TTS) model trained on a large-scale synthetic dataset utilizing the LLM-Chat paradigm. It has achieved SOTA Character Error Rate (CER) results on the SEED TTS Eval benchmark. The model supports multiple languages, a variety of emotional expressions, and diverse voice style controls. Notably, Step-Audio-TTS-3B is also the first TTS model in the industry capable of generating RAP and Humming, marking a significant advancement in the field of speech synthesis.
10
 
11
- 本仓库提供采用dual-codebook训练的StepAudio-TTS-3B LLM 模型权重,基于dual-codebook训练的vocoder,以及为哼唱专门训练的vocoder。
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
 
13
- This repository provides the model weights for StepAudio-TTS-3B, which is a dual-codebook trained LLM (Large Language Model) for text-to-speech synthesis. Additionally, it includes a vocoder trained using the dual-codebook approach, as well as a specialized vocoder specifically optimized for humming generation. These resources collectively enable high-quality speech synthesis and humming capabilities, leveraging the advanced dual-codebook training methodology.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
 
15
- 更多信息请参考我们的仓库: [Step-Audio](https://github.com/stepfun-ai/Step-Audio).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
 
 
17
  For more information, please refer to our repository: [Step-Audio](https://github.com/stepfun-ai/Step-Audio).
 
3
  ---
4
  # Step-Audio-TTS-3B
5
 
6
+ Step-Audio-TTS-3B represents the industry's first Text-to-Speech (TTS) model trained on a large-scale synthetic dataset utilizing the LLM-Chat paradigm. It has achieved SOTA Character Error Rate (CER) results on the SEED TTS Eval benchmark. The model supports multiple languages, a variety of emotional expressions, and diverse voice style controls. Notably, Step-Audio-TTS-3B is also the first TTS model in the industry capable of generating RAP and Humming, marking a significant advancement in the field of speech synthesis.
7
 
8
+ This repository provides the model weights for StepAudio-TTS-3B, which is a dual-codebook trained LLM (Large Language Model) for text-to-speech synthesis. Additionally, it includes a vocoder trained using the dual-codebook approach, as well as a specialized vocoder specifically optimized for humming generation. These resources collectively enable high-quality speech synthesis and humming capabilities, leveraging the advanced dual-codebook training methodology.
9
 
10
+ ## Performance comparison of content consistency (CER/WER) between GLM-4-Voice and MinMo.
11
 
12
+ <table>
13
+ <thead>
14
+ <tr>
15
+ <th rowspan="2">Model</th>
16
+ <th style="text-align:center" colspan="1">test-zh</th>
17
+ <th style="text-align:center" colspan="1">test-en</th>
18
+ </tr>
19
+ <tr>
20
+ <th style="text-align:center">CER (%) &darr;</th>
21
+ <th style="text-align:center">WER (%) &darr;</th>
22
+ </tr>
23
+ </thead>
24
+ <tbody>
25
+ <tr>
26
+ <td>GLM-4-Voice</td>
27
+ <td style="text-align:center">2.19</td>
28
+ <td style="text-align:center">2.91</td>
29
+ </tr>
30
+ <tr>
31
+ <td>MinMo</td>
32
+ <td style="text-align:center">2.48</td>
33
+ <td style="text-align:center">2.90</td>
34
+ </tr>
35
+ <tr>
36
+ <td><strong>Step-Audio</strong></td>
37
+ <td style="text-align:center"><strong>1.53</strong></td>
38
+ <td style="text-align:center"><strong>2.71</strong></td>
39
+ </tr>
40
+ </tbody>
41
+ </table>
42
 
43
+ ## Results of TTS Models on SEED Test Sets.
44
+ * StepAudio-TTS-3B-Single denotes dual-codebook backbone with single-codebook vocoder*
45
+
46
+ <table>
47
+ <thead>
48
+ <tr>
49
+ <th rowspan="2">Model</th>
50
+ <th style="text-align:center" colspan="2">test-zh</th>
51
+ <th style="text-align:center" colspan="2">test-en</th>
52
+ </tr>
53
+ <tr>
54
+ <th style="text-align:center">CER (%) &darr;</th>
55
+ <th style="text-align:center">SS &uarr;</th>
56
+ <th style="text-align:center">WER (%) &darr;</th>
57
+ <th style="text-align:center">SS &uarr;</th>
58
+ </tr>
59
+ </thead>
60
+ <tbody>
61
+ <tr>
62
+ <td>FireRedTTS</td>
63
+ <td style="text-align:center">1.51</td>
64
+ <td style="text-align:center">0.630</td>
65
+ <td style="text-align:center">3.82</td>
66
+ <td style="text-align:center">0.460</td>
67
+ </tr>
68
+ <tr>
69
+ <td>MaskGCT</td>
70
+ <td style="text-align:center">2.27</td>
71
+ <td style="text-align:center">0.774</td>
72
+ <td style="text-align:center">2.62</td>
73
+ <td style="text-align:center">0.774</td>
74
+ </tr>
75
+ <tr>
76
+ <td>CosyVoice</td>
77
+ <td style="text-align:center">3.63</td>
78
+ <td style="text-align:center">0.775</td>
79
+ <td style="text-align:center">4.29</td>
80
+ <td style="text-align:center">0.699</td>
81
+ </tr>
82
+ <tr>
83
+ <td>CosyVoice 2</td>
84
+ <td style="text-align:center">1.45</td>
85
+ <td style="text-align:center">0.806</td>
86
+ <td style="text-align:center">2.57</td>
87
+ <td style="text-align:center">0.736</td>
88
+ </tr>
89
+ <tr>
90
+ <td>CosyVoice 2-S</td>
91
+ <td style="text-align:center">1.45</td>
92
+ <td style="text-align:center">0.812</td>
93
+ <td style="text-align:center">2.38</td>
94
+ <td style="text-align:center">0.743</td>
95
+ </tr>
96
+ <tr>
97
+ <td><strong>Step-Audio-TTS-3B-Single</strong></td>
98
+ <td style="text-align:center">1.37</td>
99
+ <td style="text-align:center">0.802</td>
100
+ <td style="text-align:center">2.52</td>
101
+ <td style="text-align:center">0.704</td>
102
+ </tr>
103
+ <tr>
104
+ <td><strong>Step-Audio-TTS-3B</strong></td>
105
+ <td style="text-align:center"><strong>1.31</strong></td>
106
+ <td style="text-align:center">0.733</td>
107
+ <td style="text-align:center"><strong>2.31</strong></td>
108
+ <td style="text-align:center">0.660</td>
109
+ </tr>
110
+ <tr>
111
+ <td><strong>Step-Audio-TTS</strong></td>
112
+ <td style="text-align:center"><strong>1.17</strong></td>
113
+ <td style="text-align:center">0.73</td>
114
+ <td style="text-align:center"><strong>2.0</strong></td>
115
+ <td style="text-align:center">0.660</td>
116
+ </tr>
117
+ </tbody>
118
+ </table>
119
+
120
+ ## Performance comparison of Dual-codebook Resynthesis with Cosyvoice.
121
 
122
+ <table>
123
+ <thead>
124
+ <tr>
125
+ <th style="text-align:center" rowspan="2">Token</th>
126
+ <th style="text-align:center" colspan="2">test-zh</th>
127
+ <th style="text-align:center" colspan="2">test-en</th>
128
+ </tr>
129
+ <tr>
130
+ <th style="text-align:center">CER (%) &darr;</th>
131
+ <th style="text-align:center">SS &uarr;</th>
132
+ <th style="text-align:center">WER (%) &darr;</th>
133
+ <th style="text-align:center">SS &uarr;</th>
134
+ </tr>
135
+ </thead>
136
+ <tbody>
137
+ <tr>
138
+ <td style="text-align:center">Groundtruth</td>
139
+ <td style="text-align:center">0.972</td>
140
+ <td style="text-align:center">-</td>
141
+ <td style="text-align:center">2.156</td>
142
+ <td style="text-align:center">-</td>
143
+ </tr>
144
+ <tr>
145
+ <td style="text-align:center">CosyVoice</td>
146
+ <td style="text-align:center">2.857</td>
147
+ <td style="text-align:center"><strong>0.849</strong></td>
148
+ <td style="text-align:center">4.519</td>
149
+ <td style="text-align:center"><strong>0.807</strong></td>
150
+ </tr>
151
+ <tr>
152
+ <td style="text-align:center">Step-Audio-TTS-3B</td>
153
+ <td style="text-align:center"><strong>2.192</strong></td>
154
+ <td style="text-align:center">0.784</td>
155
+ <td style="text-align:center"><strong>3.585</strong></td>
156
+ <td style="text-align:center">0.742</td>
157
+ </tr>
158
+ </tbody>
159
+ </table>
160
 
161
+ # More information
162
  For more information, please refer to our repository: [Step-Audio](https://github.com/stepfun-ai/Step-Audio).