--- license: apache-2.0 pipeline_tag: text-to-speech --- # Step-Audio-TTS-3B Step-Audio-TTS-3B represents the industry's first Text-to-Speech (TTS) model trained on a large-scale synthetic dataset utilizing the LLM-Chat paradigm. It has achieved SOTA Character Error Rate (CER) results on the SEED TTS Eval benchmark. The model supports multiple languages, a variety of emotional expressions, and diverse voice style controls. Notably, Step-Audio-TTS-3B is also the first TTS model in the industry capable of generating RAP and Humming, marking a significant advancement in the field of speech synthesis. This repository provides the model weights for StepAudio-TTS-3B, which is a dual-codebook trained LLM (Large Language Model) for text-to-speech synthesis. Additionally, it includes a vocoder trained using the dual-codebook approach, as well as a specialized vocoder specifically optimized for humming generation. These resources collectively enable high-quality speech synthesis and humming capabilities, leveraging the advanced dual-codebook training methodology. ## Performance comparison of content consistency (CER/WER) between GLM-4-Voice and MinMo.
Model test-zh test-en
CER (%) ↓ WER (%) ↓
GLM-4-Voice 2.19 2.91
MinMo 2.48 2.90
Step-Audio 1.53 2.71
## Results of TTS Models on SEED Test Sets. * StepAudio-TTS-3B-Single denotes dual-codebook backbone with single-codebook vocoder*
Model test-zh test-en
CER (%) ↓ SS ↑ WER (%) ↓ SS ↑
FireRedTTS 1.51 0.630 3.82 0.460
MaskGCT 2.27 0.774 2.62 0.774
CosyVoice 3.63 0.775 4.29 0.699
CosyVoice 2 1.45 0.806 2.57 0.736
CosyVoice 2-S 1.45 0.812 2.38 0.743
Step-Audio-TTS-3B-Single 1.37 0.802 2.52 0.704
Step-Audio-TTS-3B 1.31 0.733 2.31 0.660
Step-Audio-TTS 1.17 0.73 2.0 0.660
## Performance comparison of Dual-codebook Resynthesis with Cosyvoice.
Token test-zh test-en
CER (%) ↓ SS ↑ WER (%) ↓ SS ↑
Groundtruth 0.972 - 2.156 -
CosyVoice 2.857 0.849 4.519 0.807
Step-Audio-TTS-3B 2.192 0.784 3.585 0.742
# More information For more information, please refer to our repository: [Step-Audio](https://github.com/stepfun-ai/Step-Audio).