File size: 873 Bytes
ca66086 549b458 ca66086 549b458 ca66086 7f35fb9 ca66086 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
---
license: mit
datasets:
- pkupie/mc2_corpus
language:
- bo
- ug
- mn
- kk
---
# MC^2XLMR-large
[Github Repo](https://github.com/luciusssss/mc2_corpus)
We continually pretrain XLM-RoBERTa-large with [MC^2](https://huggingface.co/datasets/pkupie/mc2_corpus), which supports Tibetan, Uyghur, Kazakh in the Kazakh Arabic script, and Mongolian in the traditional Mongolian script.
See details in the [paper](https://arxiv.org/abs/2311.08348).
*We have also released another model trained on MC^2: [MC^2Llama-13B](https://huggingface.co/pkupie/mc2-llama-13b).*
## Citation
```
@article{zhang2024mc,
title={MC$^2$: Towards Transparent and Culturally-Aware NLP for Minority Languages in China},
author={Zhang, Chen and Tao, Mingxu and Huang, Quzhe and Lin, Jiuheng and Chen, Zhibin and Feng, Yansong},
journal={arXiv preprint arXiv:2311.08348},
year={2024}
}
```
|