--- license: mit datasets: - pkupie/mc2_corpus language: - bo - ug - mn - kk --- # [MC^2XLMR-large] [Github Repo](https://github.com/luciusssss/mc2_corpus) We continually pretrain XLM-RoBERTa-large with [MC^2](https://huggingface.co/datasets/pkupie/mc2_corpus), which supports Tibetan, Uyghur, Kazakh in the Kazakh Arabic script, and Mongolian in the traditional Mongolian script. See details in the [paper](https://arxiv.org/abs/2311.08348). ## Citation ``` @misc{zhang2023mc2, title={MC^2: A Multilingual Corpus of Minority Languages in China}, author={Chen Zhang and Mingxu Tao and Quzhe Huang and Jiuheng Lin and Zhibin Chen and Yansong Feng}, year={2023}, eprint={2311.08348}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```