|
--- |
|
license: mit |
|
language: |
|
- zh |
|
- en |
|
metrics: |
|
- cer |
|
- bleu |
|
tags: |
|
- asr |
|
- automatic-speech-recognition |
|
- automatic-speech-translation |
|
- speech-translation |
|
- speech-recognition |
|
--- |
|
|
|
# MooER (ζ©θ³): an LLM-based Speech Recognition and Translation Model from Moore Threads |
|
|
|
**Online Demo**: https://mooer-speech.mthreads.com:10077/ |
|
|
|
## π₯ Update |
|
|
|
We release a new model *MooER-80K-v2* using 80K hours of data. Click [here](https://huggingface.co/mtspeech/MooER-MTL-80K) to try the new model. |
|
|
|
## π Introduction |
|
|
|
We introduce **MooER (ζ©θ³)**: an LLM-based speech recognition and translation model developed by Moore Threads. With the *MooER* framework, you can transcribe the speech into text (speech recognition or, ASR), and translate it into other languages (speech translation or, AST) in a end-to-end manner. The performance of *MooER* is demonstrated in the subsequent section, along with our insights into model configurations, training strategies, and more, provided in our [technical report](https://arxiv.org/abs/2408.05101). |
|
|
|
For the usage of the model files, please refer to our [GitHub](https://github.com/MooreThreads/MooER) |
|
|
|
<br> |
|
<p align="center"> |
|
<img src="assets/framework.png" width="600"/> |
|
<p> |
|
<br> |
|
|
|
## π₯ Evaluation Results |
|
|
|
We demonstrate the training data and the evaluation results below. For more comprehensive information, please refer to our [report](https://arxiv.org/pdf/2408.05101). |
|
|
|
### Training data |
|
|
|
We utilize 5k hours of data (MT5K) to train our basic *MooER-5K* model. The data sources include: |
|
|
|
| Dataset | Duration | |
|
|---------------|---------------| |
|
| aishell2 | 137h | |
|
| librispeech | 131h | |
|
| multi_cn | 100h | |
|
| wenetspeech | 1361h | |
|
| in-house data | 3274h | |
|
|
|
Note that, data from the open-source datasets were randomly selected from the full training set. The in-house data, collected internally without text, were transcribed using a third-party ASR service. |
|
|
|
Since all the above datasets were originally designed only for the speech recognition task, no translation results are available. To train our speech translation model, we used a third-party translation service to generate pseudo-labels. No data filtering techniques were applied. |
|
|
|
At this moment, we are also developing a new model trained with 80K hours of data. |
|
|
|
### Speech Recognition |
|
|
|
The performance of speech recognition is evaluated using WER/CER. |
|
|
|
<table> |
|
<tr> |
|
<th>Language</th> |
|
<th>Testset</th> |
|
<th>Paraformer-large</th> |
|
<th>SenseVoice-small</th> |
|
<th>Qwen-audio</th> |
|
<th>Whisper-large-v3</th> |
|
<th>SeamlessM4T-v2</th> |
|
<th>MooER-5K</th> |
|
<th>MooER-80K</th> |
|
<th>MooER-80K-v2</th> |
|
</tr> |
|
<tr> |
|
<td rowspan="7">Chinese</td> |
|
<td>aishell1</td> |
|
<td>1.93</td> |
|
<td>3.03</td> |
|
<td>1.43</td> |
|
<td>7.86</td> |
|
<td>4.09</td> |
|
<td>1.93</td> |
|
<td>1.25</td> |
|
<td>1.00</td> |
|
</tr> |
|
<tr> |
|
<td>aishell2_ios</td> |
|
<td>2.85</td> |
|
<td>3.79</td> |
|
<td>3.57</td> |
|
<td>5.38</td> |
|
<td>4.81</td> |
|
<td>3.17</td> |
|
<td>2.67</td> |
|
<td>2.62</td> |
|
</tr> |
|
<tr> |
|
<td>test_magicdata</td> |
|
<td>3.66</td> |
|
<td>3.81</td> |
|
<td>5.31</td> |
|
<td>8.36</td> |
|
<td>9.69</td> |
|
<td>3.48</td> |
|
<td>2.52</td> |
|
<td>2.17</td> |
|
</tr> |
|
<tr> |
|
<td>test_thchs</td> |
|
<td>3.99</td> |
|
<td>5.17</td> |
|
<td>4.86</td> |
|
<td>9.06</td> |
|
<td>7.14</td> |
|
<td>4.11</td> |
|
<td>3.14</td> |
|
<td>3.00</td> |
|
</tr> |
|
<tr> |
|
<td>fleurs cmn_dev</td> |
|
<td>5.56</td> |
|
<td>6.39</td> |
|
<td>10.54</td> |
|
<td>4.54</td> |
|
<td>7.12</td> |
|
<td>5.81</td> |
|
<td>5.23</td> |
|
<td>5.15</td> |
|
</tr> |
|
<tr> |
|
<td>fleurs cmn_test</td> |
|
<td>6.92</td> |
|
<td>7.36</td> |
|
<td>11.07</td> |
|
<td>5.24</td> |
|
<td>7.66</td> |
|
<td>6.77</td> |
|
<td>6.18</td> |
|
<td>6.14</td> |
|
</tr> |
|
<tr> |
|
<td>average</td> |
|
<td><strong>4.15</strong></td> |
|
<td><strong>4.93</strong></td> |
|
<td><strong>6.13</strong></td> |
|
<td><strong>6.74</strong></td> |
|
<td><strong>6.75</strong></td> |
|
<td><strong>4.21</strong></td> |
|
<td><strong>3.50</strong></td> |
|
<td><strong>3.35</strong></td> |
|
</tr> |
|
<tr> |
|
<td rowspan="7">English</td> |
|
<td>librispeech test_clean</td> |
|
<td>14.15</td> |
|
<td>4.07</td> |
|
<td>2.15</td> |
|
<td>3.42</td> |
|
<td>2.77</td> |
|
<td>7.78</td> |
|
<td>4.11</td> |
|
<td>3.57</td> |
|
</tr> |
|
<tr> |
|
<td>librispeech test_other</td> |
|
<td>22.99</td> |
|
<td>8.26</td> |
|
<td>4.68</td> |
|
<td>5.62</td> |
|
<td>5.25</td> |
|
<td>15.25</td> |
|
<td>9.99</td> |
|
<td>9.09</td> |
|
</tr> |
|
<tr> |
|
<td>fleurs eng_dev</td> |
|
<td>24.93</td> |
|
<td>12.92</td> |
|
<td>22.53</td> |
|
<td>11.63</td> |
|
<td>11.36</td> |
|
<td>18.89</td> |
|
<td>13.32</td> |
|
<td>13.12</td> |
|
</tr> |
|
<tr> |
|
<td>fleurs eng_test</td> |
|
<td>26.81</td> |
|
<td>13.41</td> |
|
<td>22.51</td> |
|
<td>12.57</td> |
|
<td>11.82</td> |
|
<td>20.41</td> |
|
<td>14.97</td> |
|
<td>14.74</td> |
|
</tr> |
|
<tr> |
|
<td>gigaspeech dev</td> |
|
<td>24.23</td> |
|
<td>19.44</td> |
|
<td>12.96</td> |
|
<td>19.18</td> |
|
<td>28.01</td> |
|
<td>23.46</td> |
|
<td>16.92</td> |
|
<td>17.34</td> |
|
</tr> |
|
<tr> |
|
<td>gigaspeech test</td> |
|
<td>23.07</td> |
|
<td>16.65</td> |
|
<td>13.26</td> |
|
<td>22.34</td> |
|
<td>28.65</td> |
|
<td>22.09</td> |
|
<td>16.64</td> |
|
<td>16.97</td> |
|
</tr> |
|
<tr> |
|
<td>average</td> |
|
<td><strong>22.70</strong></td> |
|
<td><strong>12.46</strong></td> |
|
<td><strong>13.02</strong></td> |
|
<td><strong>12.46</strong></td> |
|
<td><strong>14.64</strong></td> |
|
<td><strong>17.98</strong></td> |
|
<td><strong>12.66</strong></td> |
|
<td><strong>12.47</strong></td> |
|
</tr> |
|
</table> |
|
|
|
### Speech Translation (zh -> en) |
|
|
|
For speech translation, the performanced is evaluated using BLEU score. |
|
|
|
| Testset | Speech-LLaMA | Whisper-large-v3 | Qwen-audio | Qwen2-audio | SeamlessM4T-v2 | MooER-5K | MooER-5K-MTL | |
|
|--------|-------------|-------------------|------------|-------------|-----------------|--------|--------------| |
|
|CoVoST1 zh2en | - | 13.5 | 13.5 | - | 25.3 | - | **30.2** | |
|
|CoVoST2 zh2en | 12.3 | 12.2 | 15.7 | 24.4 | 22.2 | 23.4 | **25.2** | |
|
|CCMT2019 dev | - | 15.9 | 12.0 | - | 14.8 | - | **19.6** | |
|
|
|
|
|
## π Getting Started |
|
|
|
Please visit our [GitHub](https://github.com/MooreThreads/MooER) for the setup and usage. |
|
|
|
|
|
## π§Ύ License |
|
|
|
Please see the [LICENSE](LICENSE). |
|
|
|
|
|
## π Citation |
|
|
|
If you find MooER useful for your research, please π this repo and cite our work using the following BibTeX: |
|
|
|
```bibtex |
|
@article{liang2024mooer, |
|
title = {MooER: an LLM-based Speech Recognition and Translation Model from Moore Threads}, |
|
author = {Zhenlin Liang, Junhao Xu, Yi Liu, Yichao Hu, Jian Li, Yajun Zheng, Meng Cai, Hua Wang}, |
|
journal = {arXiv preprint arXiv:2408.05101}, |
|
url = {https://arxiv.org/abs/2408.05101}, |
|
year = {2024} |
|
} |
|
``` |
|
|
|
## π§ Contact |
|
|
|
If you encouter any problems, feel free to create a discussion. |
|
|
|
Moore Threads Website: **https://www.mthreads.com/** |
|
|
|
<br> |
|
<p align="left"> |
|
<img src="assets/MTLogo.png" width="300"/> |
|
<p> |
|
<br> |