---
license: apache-2.0
pipeline_tag: text-to-speech
---
# Step-Audio-TTS-3B
Step-Audio-TTS-3B represents the industry's first Text-to-Speech (TTS) model trained on a large-scale synthetic dataset utilizing the LLM-Chat paradigm. It has achieved SOTA Character Error Rate (CER) results on the SEED TTS Eval benchmark. The model supports multiple languages, a variety of emotional expressions, and diverse voice style controls. Notably, Step-Audio-TTS-3B is also the first TTS model in the industry capable of generating RAP and Humming, marking a significant advancement in the field of speech synthesis.
This repository provides the model weights for StepAudio-TTS-3B, which is a dual-codebook trained LLM (Large Language Model) for text-to-speech synthesis. Additionally, it includes a vocoder trained using the dual-codebook approach, as well as a specialized vocoder specifically optimized for humming generation. These resources collectively enable high-quality speech synthesis and humming capabilities, leveraging the advanced dual-codebook training methodology.
## Performance comparison of content consistency (CER/WER) between GLM-4-Voice and MinMo.
Model |
test-zh |
test-en |
CER (%) ↓ |
WER (%) ↓ |
GLM-4-Voice |
2.19 |
2.91 |
MinMo |
2.48 |
2.90 |
Step-Audio |
1.53 |
2.71 |
## Results of TTS Models on SEED Test Sets.
* StepAudio-TTS-3B-Single denotes dual-codebook backbone with single-codebook vocoder*
Model |
test-zh |
test-en |
CER (%) ↓ |
SS ↑ |
WER (%) ↓ |
SS ↑ |
FireRedTTS |
1.51 |
0.630 |
3.82 |
0.460 |
MaskGCT |
2.27 |
0.774 |
2.62 |
0.774 |
CosyVoice |
3.63 |
0.775 |
4.29 |
0.699 |
CosyVoice 2 |
1.45 |
0.806 |
2.57 |
0.736 |
CosyVoice 2-S |
1.45 |
0.812 |
2.38 |
0.743 |
Step-Audio-TTS-3B-Single |
1.37 |
0.802 |
2.52 |
0.704 |
Step-Audio-TTS-3B |
1.31 |
0.733 |
2.31 |
0.660 |
Step-Audio-TTS |
1.17 |
0.73 |
2.0 |
0.660 |
## Performance comparison of Dual-codebook Resynthesis with Cosyvoice.
Token |
test-zh |
test-en |
CER (%) ↓ |
SS ↑ |
WER (%) ↓ |
SS ↑ |
Groundtruth |
0.972 |
- |
2.156 |
- |
CosyVoice |
2.857 |
0.849 |
4.519 |
0.807 |
Step-Audio-TTS-3B |
2.192 |
0.784 |
3.585 |
0.742 |
# More information
For more information, please refer to our repository: [Step-Audio](https://github.com/stepfun-ai/Step-Audio).