ONNX
File size: 3,302 Bytes
d5b702e
 
 
4533019
 
edd6f14
 
 
 
 
 
 
05f13c3
 
 
36da640
4533019
228e5ec
1ff91e5
429fca8
4533019
429fca8
 
228e5ec
429fca8
 
 
 
 
 
 
 
 
 
98ee0dd
4533019
e33b502
4533019
 
 
 
429fca8
4533019
e33b502
 
4533019
 
 
6ce6cd5
 
 
 
 
 
 
67fec82
 
 
 
 
 
 
 
 
 
 
 
d5b702e
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
---
license: apache-2.0
---
# BreezyVoice

🚀 **Try out our interactive [UI playground](https://huggingface.co/spaces/Splend1dchan/BreezyVoice-Playground) now!** 🚀 

Or visit one of these resources:  
- [Playground (CLI Inference)](https://www.kaggle.com/code/a24998667/breezyvoice-playground)  
- [Model](https://huggingface.co/MediaTek-Research/BreezyVoice/tree/main)  
- [Paper](https://arxiv.org/abs/2501.17790) 


**BreezyVoice: Adapting TTS for Taiwanese Mandarin with Enhanced Polyphone Disambiguation -- Challenges and Insights**	

BreezyVoice is a voice-cloning text-to-speech system specifically adapted for Taiwanese Mandarin, highlighting phonetic control abilities via auxiliary 注音 (bopomofo) inputs. BreezyVoice is partially derived from [CosyVoice](https://github.com/FunAudioLLM/CosyVoice)

<img src="https://raw.githubusercontent.com/mtkresearch/BreezyVoice/main/images/flowchart.png" alt="Flowchart" width="750"/>

BreezyVoice outperforms competing commercial services in terms of naturalness.



<img src="https://raw.githubusercontent.com/mtkresearch/BreezyVoice/main/images/comparisons.png" alt="comparisons" width="350"/>

 BreezyVoice excels at code-switching scenarios.

| Code-Switching Term Category        | **BreezyVoice**  | Z | Y | U | M |
|-------------|--------------|---|---|---|---|
| **General Words** | **8**            | 5 | **8** | **8** | 7 |
| **Entities**| **9**         | 6 | 4 | 7 | 4 |
| **Abbreviations**   | **9**            | 8 | 6 | 6 | 7 |
| **Toponyms**| 3            | 3 | **7** | 3 | 4 |
| **Full Sentences**| 7           | 7 | **8** | 5 | 3 |
## How to Run

**Running from [GitHub](https://github.com/mtkresearch/BreezyVoice) following instructions automatically downloads the model for you**

You can also run the model from a specified local path by cloning the model
```
git lfs install
git clone https://huggingface.co/MediaTek-Research/BreezyVoice
```

You can then use the model as outlined in the `single_inference.py` script on [GitHub](https://github.com/mtkresearch/BreezyVoice), specifying the local model path via the `model_path` parameter.

If you like our work, please cite:

```
@article{hsu2025breezyvoice,
  title={BreezyVoice: Adapting TTS for Taiwanese Mandarin with Enhanced Polyphone Disambiguation--Challenges and Insights},
  author={Hsu, Chan-Jan and Lin, Yi-Cheng and Lin, Chia-Chun and Chen, Wei-Chih and Chung, Ho Lam and Li, Chen-An and Chen, Yi-Chang and Yu, Chien-Yu and Lee, Ming-Ji and Chen, Chien-Cheng and others},
  journal={arXiv preprint arXiv:2501.17790},
  year={2025}
}
@article{hsu2025breeze,
  title={The Breeze 2 Herd of Models: Traditional Chinese LLMs Based on Llama with Vision-Aware and Function-Calling Capabilities},
  author={Hsu, Chan-Jan and Liu, Chia-Sheng and Chen, Meng-Hsi and Chen, Muxi and Hsu, Po-Chun and Chen, Yi-Chang and Shiu, Da-Shan},
  journal={arXiv preprint arXiv:2501.13921},
  year={2025}
}
@article{du2024cosyvoice,
  title={Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens},
  author={Du, Zhihao and Chen, Qian and Zhang, Shiliang and Hu, Kai and Lu, Heng and Yang, Yexin and Hu, Hangrui and Zheng, Siqi and Gu, Yue and Ma, Ziyang and others},
  journal={arXiv preprint arXiv:2407.05407},
  year={2024}
}
```