Safetensors
qwen2
File size: 2,459 Bytes
0f03d01
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
74eab29
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c529b5a
74eab29
c529b5a
74eab29
c529b5a
74eab29
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
---
license: apache-2.0
language:
- ar
- bn
- cs
- de
- es
- en
- el
- fr
- id
- it
- he
- hu
- ja
- kk
- ko
- ro
- ru
- az
- uk
- ur
- vi
- zh
- ms
- nl
- ne
- th
- tr
- pt
- pl
base_model:
- Qwen/Qwen2-7B
---


# Marco-LLM-GLO

## Introduction

Marco-LLM is a series of advanced multilingual language models designed to bridge the performance gap between high-resource languages and low-resource languages. This repository contains the Marco-LLM base language model with 7 billion parameters. 

The model has undergone extensive multilingual continual pretraining on a diverse dataset containing over 5 trillion tokens, with a particular focus on enhancing performance in low-resource languages while maintaining strong capabilities in high-resource languages like English and Chinese.

Compared to state-of-the-art open-source language models, Marco-LLM demonstrates significant improvements in multilingual tasks, including machine translation, question answering, and reasoning across multiple languages.
For more details, please refer to our [Hugging Face page](https://huggingface.co/AIDC-AI/Marco-LLM-GLO).

## Model Details

Marco-LLM includes a 7B parameter model based on the Transformer architecture. The key features of Marco-LLM are:

- Multilingual Training: The model is trained on a large-scale multilingual dataset covering 29 languages, including both high-resource languages (e.g., English, Chinese) and low-resource languages (e.g., Kazakh, Nepali).

- Enhanced Tokenizer: An improved tokenizer is used to better handle multilingual data, ensuring higher efficiency and accuracy in tokenization.

- Post-Training: Marco-LLM supports various post-training methods, such as Supervised Fine-tuning (SFT) and Direct Preference Optimization (DPO), to further enhance performance for specific tasks and languages.
## Usage

It is not advised to use the base language models for direct text generation tasks. Instead, it is recommended to apply post-training methods such as Supervised Fine-tuning (SFT), Reinforcement Learning with Human Feedback (RLHF), or continued pretraining to adapt the models for specific use cases.


## Citation

If you find our work helpful, please give us a citation.
```
@article{unique_identifier,

title={Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement},

journal={arXiv},

volume={},

number={2412.04003},

year={2024},

url={https://arxiv.org/abs/2412.04003}

}
```