|
--- |
|
license: apache-2.0 |
|
language: |
|
- ar |
|
- bn |
|
- cs |
|
- de |
|
- es |
|
- en |
|
- el |
|
- fr |
|
- id |
|
- it |
|
- he |
|
- hu |
|
- ja |
|
- kk |
|
- ko |
|
- ro |
|
- ru |
|
- az |
|
- uk |
|
- ur |
|
- vi |
|
- zh |
|
- ms |
|
- nl |
|
- ne |
|
- th |
|
- tr |
|
- pt |
|
- pl |
|
base_model: |
|
- Qwen/Qwen2-7B |
|
--- |
|
|
|
|
|
# Marco-LLM-GLO |
|
|
|
## Introduction |
|
|
|
Marco-LLM is a series of advanced multilingual language models designed to bridge the performance gap between high-resource languages and low-resource languages. This repository contains the Marco-LLM base language model with 7 billion parameters. |
|
|
|
The model has undergone extensive multilingual continual pretraining on a diverse dataset containing over 5 trillion tokens, with a particular focus on enhancing performance in low-resource languages while maintaining strong capabilities in high-resource languages like English and Chinese. |
|
|
|
Compared to state-of-the-art open-source language models, Marco-LLM demonstrates significant improvements in multilingual tasks, including machine translation, question answering, and reasoning across multiple languages. |
|
For more details, please refer to our [Hugging Face page](https://huggingface.co/AIDC-AI/Marco-LLM-GLO). |
|
|
|
## Model Details |
|
|
|
Marco-LLM includes a 7B parameter model based on the Transformer architecture. The key features of Marco-LLM are: |
|
|
|
- Multilingual Training: The model is trained on a large-scale multilingual dataset covering 29 languages, including both high-resource languages (e.g., English, Chinese) and low-resource languages (e.g., Kazakh, Nepali). |
|
|
|
- Enhanced Tokenizer: An improved tokenizer is used to better handle multilingual data, ensuring higher efficiency and accuracy in tokenization. |
|
|
|
- Post-Training: Marco-LLM supports various post-training methods, such as Supervised Fine-tuning (SFT) and Direct Preference Optimization (DPO), to further enhance performance for specific tasks and languages. |
|
## Usage |
|
|
|
It is not advised to use the base language models for direct text generation tasks. Instead, it is recommended to apply post-training methods such as Supervised Fine-tuning (SFT), Reinforcement Learning with Human Feedback (RLHF), or continued pretraining to adapt the models for specific use cases. |
|
|
|
|
|
## Citation |
|
|
|
If you find our work helpful, please give us a citation. |
|
``` |
|
@article{unique_identifier, |
|
|
|
title={Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement}, |
|
|
|
journal={arXiv}, |
|
|
|
volume={}, |
|
|
|
number={2412.04003}, |
|
|
|
year={2024}, |
|
|
|
url={https://arxiv.org/abs/2412.04003} |
|
|
|
} |
|
``` |