--- license: apache-2.0 language: - ar - bn - cs - de - es - en - el - fr - id - it - he - hu - ja - kk - ko - ro - ru - az - uk - ur - vi - zh - ms - nl - ne - th - tr - pt - pl base_model: - Qwen/Qwen2-7B --- # Marco-LLM-GLO ## Introduction Marco-LLM is a series of advanced multilingual language models designed to bridge the performance gap between high-resource languages and low-resource languages. This repository contains the Marco-LLM base language model with 7 billion parameters. The model has undergone extensive multilingual continual pretraining on a diverse dataset containing over 5 trillion tokens, with a particular focus on enhancing performance in low-resource languages while maintaining strong capabilities in high-resource languages like English and Chinese. Compared to state-of-the-art open-source language models, Marco-LLM demonstrates significant improvements in multilingual tasks, including machine translation, question answering, and reasoning across multiple languages. For more details, please refer to our [Hugging Face page](https://huggingface.co/AIDC-AI/Marco-LLM-GLO). ## Model Details Marco-LLM includes a 7B parameter model based on the Transformer architecture. The key features of Marco-LLM are: - Multilingual Training: The model is trained on a large-scale multilingual dataset covering 29 languages, including both high-resource languages (e.g., English, Chinese) and low-resource languages (e.g., Kazakh, Nepali). - Enhanced Tokenizer: An improved tokenizer is used to better handle multilingual data, ensuring higher efficiency and accuracy in tokenization. - Post-Training: Marco-LLM supports various post-training methods, such as Supervised Fine-tuning (SFT) and Direct Preference Optimization (DPO), to further enhance performance for specific tasks and languages. ## Usage It is not advised to use the base language models for direct text generation tasks. Instead, it is recommended to apply post-training methods such as Supervised Fine-tuning (SFT), Reinforcement Learning with Human Feedback (RLHF), or continued pretraining to adapt the models for specific use cases. ## Citation If you find our work helpful, please give us a citation. ``` @article{unique_identifier, title={Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement}, journal={arXiv}, volume={}, number={2412.04003}, year={2024}, url={https://arxiv.org/abs/2412.04003} } ```