MooER-MTL-5K / README.md

Yi Liu

update README

33cbd55 7 months ago

7.11 kB

	---
	license: mit
	language:
	- zh
	- en
	metrics:
	- cer
	- bleu
	tags:
	- asr
	- automatic-speech-recognition
	- automatic-speech-translation
	- speech-translation
	- speech-recognition
	---

	# MooER (摩耳): an LLM-based Speech Recognition and Translation Model from Moore Threads

	Online Demo: https://mooer-speech.mthreads.com:10077/

	## 🔥 Update

	We release a new model MooER-80K-v2 using 80K hours of data. Click [here](https://huggingface.co/mtspeech/MooER-MTL-80K) to try the new model.

	## 📖 Introduction

	We introduce MooER (摩耳): an LLM-based speech recognition and translation model developed by Moore Threads. With the MooER framework, you can transcribe the speech into text (speech recognition or, ASR), and translate it into other languages (speech translation or, AST) in a end-to-end manner. The performance of MooER is demonstrated in the subsequent section, along with our insights into model configurations, training strategies, and more, provided in our [technical report](https://arxiv.org/abs/2408.05101).

	For the usage of the model files, please refer to our [GitHub](https://github.com/MooreThreads/MooER)

	<br>
	<p align="center">
	<img src="assets/framework.png" width="600"/>
	<p>
	<br>

	## 🥊 Evaluation Results

	We demonstrate the training data and the evaluation results below. For more comprehensive information, please refer to our [report](https://arxiv.org/pdf/2408.05101).

	### Training data

	We utilize 5k hours of data (MT5K) to train our basic MooER-5K model. The data sources include:

	\| Dataset \| Duration \|
	\|---------------\|---------------\|
	\| aishell2 \| 137h \|
	\| librispeech \| 131h \|
	\| multi_cn \| 100h \|
	\| wenetspeech \| 1361h \|
	\| in-house data \| 3274h \|

	Note that, data from the open-source datasets were randomly selected from the full training set. The in-house data, collected internally without text, were transcribed using a third-party ASR service.

	Since all the above datasets were originally designed only for the speech recognition task, no translation results are available. To train our speech translation model, we used a third-party translation service to generate pseudo-labels. No data filtering techniques were applied.

	At this moment, we are also developing a new model trained with 80K hours of data.

	### Speech Recognition

	The performance of speech recognition is evaluated using WER/CER.

	<table>
	<tr>
	<th>Language</th>
	<th>Testset</th>
	<th>Paraformer-large</th>
	<th>SenseVoice-small</th>
	<th>Qwen-audio</th>
	<th>Whisper-large-v3</th>
	<th>SeamlessM4T-v2</th>
	<th>MooER-5K</th>
	<th>MooER-80K</th>
	<th>MooER-80K-v2</th>
	</tr>
	<tr>
	<td rowspan="7">Chinese</td>
	<td>aishell1</td>
	<td>1.93</td>
	<td>3.03</td>
	<td>1.43</td>
	<td>7.86</td>
	<td>4.09</td>
	<td>1.93</td>
	<td>1.25</td>
	<td>1.00</td>
	</tr>
	<tr>
	<td>aishell2_ios</td>
	<td>2.85</td>
	<td>3.79</td>
	<td>3.57</td>
	<td>5.38</td>
	<td>4.81</td>
	<td>3.17</td>
	<td>2.67</td>
	<td>2.62</td>
	</tr>
	<tr>
	<td>test_magicdata</td>
	<td>3.66</td>
	<td>3.81</td>
	<td>5.31</td>
	<td>8.36</td>
	<td>9.69</td>
	<td>3.48</td>
	<td>2.52</td>
	<td>2.17</td>
	</tr>
	<tr>
	<td>test_thchs</td>
	<td>3.99</td>
	<td>5.17</td>
	<td>4.86</td>
	<td>9.06</td>
	<td>7.14</td>
	<td>4.11</td>
	<td>3.14</td>
	<td>3.00</td>
	</tr>
	<tr>
	<td>fleurs cmn_dev</td>
	<td>5.56</td>
	<td>6.39</td>
	<td>10.54</td>
	<td>4.54</td>
	<td>7.12</td>
	<td>5.81</td>
	<td>5.23</td>
	<td>5.15</td>
	</tr>
	<tr>
	<td>fleurs cmn_test</td>
	<td>6.92</td>
	<td>7.36</td>
	<td>11.07</td>
	<td>5.24</td>
	<td>7.66</td>
	<td>6.77</td>
	<td>6.18</td>
	<td>6.14</td>
	</tr>
	<tr>
	<td>average</td>
	<td><strong>4.15</strong></td>
	<td><strong>4.93</strong></td>
	<td><strong>6.13</strong></td>
	<td><strong>6.74</strong></td>
	<td><strong>6.75</strong></td>
	<td><strong>4.21</strong></td>
	<td><strong>3.50</strong></td>
	<td><strong>3.35</strong></td>
	</tr>
	<tr>
	<td rowspan="7">English</td>
	<td>librispeech test_clean</td>
	<td>14.15</td>
	<td>4.07</td>
	<td>2.15</td>
	<td>3.42</td>
	<td>2.77</td>
	<td>7.78</td>
	<td>4.11</td>
	<td>3.57</td>
	</tr>
	<tr>
	<td>librispeech test_other</td>
	<td>22.99</td>
	<td>8.26</td>
	<td>4.68</td>
	<td>5.62</td>
	<td>5.25</td>
	<td>15.25</td>
	<td>9.99</td>
	<td>9.09</td>
	</tr>
	<tr>
	<td>fleurs eng_dev</td>
	<td>24.93</td>
	<td>12.92</td>
	<td>22.53</td>
	<td>11.63</td>
	<td>11.36</td>
	<td>18.89</td>
	<td>13.32</td>
	<td>13.12</td>
	</tr>
	<tr>
	<td>fleurs eng_test</td>
	<td>26.81</td>
	<td>13.41</td>
	<td>22.51</td>
	<td>12.57</td>
	<td>11.82</td>
	<td>20.41</td>
	<td>14.97</td>
	<td>14.74</td>
	</tr>
	<tr>
	<td>gigaspeech dev</td>
	<td>24.23</td>
	<td>19.44</td>
	<td>12.96</td>
	<td>19.18</td>
	<td>28.01</td>
	<td>23.46</td>
	<td>16.92</td>
	<td>17.34</td>
	</tr>
	<tr>
	<td>gigaspeech test</td>
	<td>23.07</td>
	<td>16.65</td>
	<td>13.26</td>
	<td>22.34</td>
	<td>28.65</td>
	<td>22.09</td>
	<td>16.64</td>
	<td>16.97</td>
	</tr>
	<tr>
	<td>average</td>
	<td><strong>22.70</strong></td>
	<td><strong>12.46</strong></td>
	<td><strong>13.02</strong></td>
	<td><strong>12.46</strong></td>
	<td><strong>14.64</strong></td>
	<td><strong>17.98</strong></td>
	<td><strong>12.66</strong></td>
	<td><strong>12.47</strong></td>
	</tr>
	</table>

	### Speech Translation (zh -> en)

	For speech translation, the performanced is evaluated using BLEU score.

	\| Testset \| Speech-LLaMA \| Whisper-large-v3 \| Qwen-audio \| Qwen2-audio \| SeamlessM4T-v2 \| MooER-5K \| MooER-5K-MTL \|
	\|--------\|-------------\|-------------------\|------------\|-------------\|-----------------\|--------\|--------------\|
	\|CoVoST1 zh2en \| - \| 13.5 \| 13.5 \| - \| 25.3 \| - \| 30.2 \|
	\|CoVoST2 zh2en \| 12.3 \| 12.2 \| 15.7 \| 24.4 \| 22.2 \| 23.4 \| 25.2 \|
	\|CCMT2019 dev \| - \| 15.9 \| 12.0 \| - \| 14.8 \| - \| 19.6 \|


	## 🏁 Getting Started

	Please visit our [GitHub](https://github.com/MooreThreads/MooER) for the setup and usage.


	## 🧾 License

	Please see the [LICENSE](LICENSE).


	## 💖 Citation

	If you find MooER useful for your research, please 🌟 this repo and cite our work using the following BibTeX:

	```bibtex
	@article{liang2024mooer,
	title = {MooER: an LLM-based Speech Recognition and Translation Model from Moore Threads},
	author = {Zhenlin Liang, Junhao Xu, Yi Liu, Yichao Hu, Jian Li, Yajun Zheng, Meng Cai, Hua Wang},
	journal = {arXiv preprint arXiv:2408.05101},
	url = {https://arxiv.org/abs/2408.05101},
	year = {2024}
	}
	```

	## 📧 Contact

	If you encouter any problems, feel free to create a discussion.

	Moore Threads Website: https://www.mthreads.com/

	<br>
	<p align="left">
	<img src="assets/MTLogo.png" width="300"/>
	<p>
	<br>