File size: 2,459 Bytes
0f03d01 74eab29 c529b5a 74eab29 c529b5a 74eab29 c529b5a 74eab29 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 |
---
license: apache-2.0
language:
- ar
- bn
- cs
- de
- es
- en
- el
- fr
- id
- it
- he
- hu
- ja
- kk
- ko
- ro
- ru
- az
- uk
- ur
- vi
- zh
- ms
- nl
- ne
- th
- tr
- pt
- pl
base_model:
- Qwen/Qwen2-7B
---
# Marco-LLM-GLO
## Introduction
Marco-LLM is a series of advanced multilingual language models designed to bridge the performance gap between high-resource languages and low-resource languages. This repository contains the Marco-LLM base language model with 7 billion parameters.
The model has undergone extensive multilingual continual pretraining on a diverse dataset containing over 5 trillion tokens, with a particular focus on enhancing performance in low-resource languages while maintaining strong capabilities in high-resource languages like English and Chinese.
Compared to state-of-the-art open-source language models, Marco-LLM demonstrates significant improvements in multilingual tasks, including machine translation, question answering, and reasoning across multiple languages.
For more details, please refer to our [Hugging Face page](https://huggingface.co/AIDC-AI/Marco-LLM-GLO).
## Model Details
Marco-LLM includes a 7B parameter model based on the Transformer architecture. The key features of Marco-LLM are:
- Multilingual Training: The model is trained on a large-scale multilingual dataset covering 29 languages, including both high-resource languages (e.g., English, Chinese) and low-resource languages (e.g., Kazakh, Nepali).
- Enhanced Tokenizer: An improved tokenizer is used to better handle multilingual data, ensuring higher efficiency and accuracy in tokenization.
- Post-Training: Marco-LLM supports various post-training methods, such as Supervised Fine-tuning (SFT) and Direct Preference Optimization (DPO), to further enhance performance for specific tasks and languages.
## Usage
It is not advised to use the base language models for direct text generation tasks. Instead, it is recommended to apply post-training methods such as Supervised Fine-tuning (SFT), Reinforcement Learning with Human Feedback (RLHF), or continued pretraining to adapt the models for specific use cases.
## Citation
If you find our work helpful, please give us a citation.
```
@article{unique_identifier,
title={Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement},
journal={arXiv},
volume={},
number={2412.04003},
year={2024},
url={https://arxiv.org/abs/2412.04003}
}
``` |