AIDC-AI
/

Marco-LLM-GLO

Model card Files Files and versions Community

Marco-LLM-GLO / README.md

StarscreamDeceptions's picture

StarscreamDeceptions

Update README.md

c529b5a verified 17 days ago

|

history blame contribute delete

2.46 kB

	---
	license: apache-2.0
	language:
	- ar
	- bn
	- cs
	- de
	- es
	- en
	- el
	- fr
	- id
	- it
	- he
	- hu
	- ja
	- kk
	- ko
	- ro
	- ru
	- az
	- uk
	- ur
	- vi
	- zh
	- ms
	- nl
	- ne
	- th
	- tr
	- pt
	- pl
	base_model:
	- Qwen/Qwen2-7B
	---


	# Marco-LLM-GLO

	## Introduction

	Marco-LLM is a series of advanced multilingual language models designed to bridge the performance gap between high-resource languages and low-resource languages. This repository contains the Marco-LLM base language model with 7 billion parameters.

	The model has undergone extensive multilingual continual pretraining on a diverse dataset containing over 5 trillion tokens, with a particular focus on enhancing performance in low-resource languages while maintaining strong capabilities in high-resource languages like English and Chinese.

	Compared to state-of-the-art open-source language models, Marco-LLM demonstrates significant improvements in multilingual tasks, including machine translation, question answering, and reasoning across multiple languages.
	For more details, please refer to our [Hugging Face page](https://huggingface.co/AIDC-AI/Marco-LLM-GLO).

	## Model Details

	Marco-LLM includes a 7B parameter model based on the Transformer architecture. The key features of Marco-LLM are:

	- Multilingual Training: The model is trained on a large-scale multilingual dataset covering 29 languages, including both high-resource languages (e.g., English, Chinese) and low-resource languages (e.g., Kazakh, Nepali).

	- Enhanced Tokenizer: An improved tokenizer is used to better handle multilingual data, ensuring higher efficiency and accuracy in tokenization.

	- Post-Training: Marco-LLM supports various post-training methods, such as Supervised Fine-tuning (SFT) and Direct Preference Optimization (DPO), to further enhance performance for specific tasks and languages.
	## Usage

	It is not advised to use the base language models for direct text generation tasks. Instead, it is recommended to apply post-training methods such as Supervised Fine-tuning (SFT), Reinforcement Learning with Human Feedback (RLHF), or continued pretraining to adapt the models for specific use cases.


	## Citation

	If you find our work helpful, please give us a citation.
	```
	@article{unique_identifier,

	title={Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement},

	journal={arXiv},

	volume={},

	number={2412.04003},

	year={2024},

	url={https://arxiv.org/abs/2412.04003}

	}
	```