Safetensors
English
Chinese
Indonesian
llama
hxssgaa's picture
Update README.md
9fa4d4d verified
|
raw
history blame
15.8 kB
metadata
license: mit
language:
  - en
  - zh
  - id

MeRALiON-LLaMA-3-8B-Instruct

MeRALiON-LLaMA-3-8B-Instruct is a large language model (LLM) designed to excel in multilingual understanding and instruction-following tasks. This model builds on the Llama-3-8B architecture and continue pretrained from Llama-3-8B-Base, enhanced through an extensive and meticulously curated continued pretraining process and careful merging of model weights.

Model Overview

MeRALiON-LLaMA-3-8B-Instruct is primarily trained on English, Chinese, and Indonesian, with a particular emphasis on elevating its understanding and generation capabilities in Southeast Asian languages—especially Chinese and Indonesian. By integrating corpus mixing strategies developed for regional multilingual datasets, we carefully diversified the training content through domain classification, hyperparameter tuning, and replay strategies. These measures not only help the model retain knowledge without catastrophic forgetting but also significantly enhance its performance in producing high-quality, contextually accurate responses within these Southeast Asian language contexts.

Key advancements include:

  • Extended Pretraining: Continued pretraining on over 120 billion tokens of primarily English, Chinese, and Indonesian text.
  • SEA Multilingual Corpus Mixing: Drawing on strategies from Southeast Asian multilingual corpora to enhance language understanding and generation capabilities.
  • Domain-Diversified Pretraining Corpus: Careful selection and classification of training data from a wide range of topics and genres.
  • Optimized Training Techniques: Implementing replay strategies and carefully selected hyperparameters to ensure stability, maintain quality, and avoid catastrophic forgetting.
  • Instruction Tuning via Model Merging: Rather than a standard instruction-tuning pipeline, this model was derived by merging the official Llama-3.1-8B-base and Llama-3.1-8B-instruct models to produce superior instruction-following capabilities without additional supervised instruction data.

Highlights

  • Enhanced Performance: MeRALiON-LLaMA-3-8B-Instruct demonstrates improved results on benchmarks including cross-MMLU, cross-LogiQA, cross-XQuAD, IndoMMLU, and CNEval, surpassing the capabilities of the official Llama-3 models.
  • Extensive Multilingual Support: Strong coverage of English, Chinese, and Indonesian text, coupled with strategies inspired by Southeast Asian multilingual approaches, ensures robust understanding of and responsiveness to diverse linguistic inputs.

Model Specifications

  • Model Type: Decoder
  • Architecture: Llama-3.1-8B
  • Context Length: 8192 tokens
  • Languages: English, Chinese, Indonesian
  • License: Llama3 Community License

Benchmark Performance

MeRALiON-LLaMA-3-8B-Instruct achieves notable improvements over official Llama-3 base and instruction-tuned models, highlighting the impact of our continued pretraining strategies. Through techniques such as corpus mixing, replay to prevent forgetting, and careful model merging, this model not only enhances general reasoning capabilities but also excels across multilingual and domain-specific benchmarks. In addition, we employed an LLM-based evaluation pipeline to standardize the judging process across varied output formats, ensuring fair and consistent comparisons. Building on the robust instruction-following proficiency of Llama-3.1-8B, MeRALiON-LLaMA-3-8B-Instruct extends its strengths to Southeast Asian languages, including Chinese and Indonesian.

Key highlights from the evaluations include:

  • Cross-MMLU, Cross-LogiQA: Enhanced reasoning and question-answering capabilities illustrate that continued pretraining improves multilingual understanding and accuracy over baseline Llama models.

  • IndoMMLU and CNEval: Performance boosts in Indonesian and Chinese benchmarks highlight that careful corpus mixing and replay strategies help maintain and improve language-specific strengths.

Cross-MMLU

Model Series Model Link English Chinese Indonesian Malay Avg (En/Zh/Id/Ms)
LLaMA Series MeRALiON-LLaMA-3-8B-Instruct 0.847 0.693 0.713 0.613 0.717
Meta-Llama-3.1-8B-Instruct Link 0.82 0.633 0.66 0.647 0.690
Llama3-8B-CPT-SEA-LION-v2.1-Instruct Link 0.753 0.667 0.693 0.64 0.688
Meta-Llama-3-8B-Instruct Link 0.767 0.653 0.573 0.573 0.642
Non-LLaMA Series GPT4o-0513 Link 0.927 0.887 0.88 0.907 0.900
Gemma-2-9B-IT Link 0.84 0.793 0.78 0.747 0.790
Gemma2-9B-CPT-SEA-Lion-v3-Instruct Link 0.847 0.787 0.793 0.733 0.790
Qwen2.5-7B-Instruct Link 0.847 0.84 0.753 0.713 0.788
SeaLLMs-v3-7B-Chat Link 0.833 0.727 0.74 0.687 0.747

Cross-LogiQA

Model Series Model Link English Chinese Indonesian Malay Avg (En/Zh/Id/Ms)
LLaMA Series Meta-Llama-3.1-8B-Instruct Link 0.585 0.585 0.455 0.523 0.537
MeRALiON-LLaMA-3-8B-Instruct 0.591 0.528 0.494 0.489 0.526
Llama3-8B-CPT-SEA-LION-v2.1-Instruct Link 0.528 0.517 0.403 0.443 0.473
Non-LLaMA Series Qwen2.5-7B-Instruct Link 0.693 0.71 0.631 0.534 0.642
Gemma-2-9B-IT Link 0.659 0.636 0.585 0.602 0.621
Gemma2-9B-CPT-SEA-Lion-v3-Instruct Link 0.636 0.642 0.557 0.551 0.597
SeaLLMs-v3-7B-Chat Link 0.568 0.585 0.494 0.517 0.541

IndoMMLU

Model Series Model Link Accuracy
LLaMA Series MeRALiON-LLaMA-3-8B-Instruct 0.576
Llama3-8B-CPT-SEA-LION-v2.1-Instruct Link 0.560
Meta-Llama-3.1-8B-Instruct Link 0.548
Meta-Llama-3-8B-Instruct Link 0.521
Non-LLaMA Series GPT4o-0513 Link 0.760
Gemma2-9B-CPT-SEA-Lion-v3-Instruct Link 0.626
Gemma-2-9B-IT Link 0.621
Qwen2.5-7B-Instruct Link 0.582
SeaLLMs-v3-7B-Chat Link 0.541

CNEval

Model Series Model Link Accuracy
LLaMA Series MeRALiON-LLaMA-3-8B-Instruct 0.514
Llama3-8B-CPT-SEA-LION-v2.1-Instruct Link 0.505
Llama3-8B-CPT-SEA-Lion-v2-Instruct Link 0.495
Meta-Llama-3-8B-Instruct Link 0.467
Meta-Llama-3.1-8B-Instruct Link 0.457
Non-LLaMA Series Qwen2-7B-Instruct Link 0.829
GPT4o-0513 Link 0.81
Qwen2.5-7B-Instruct Link 0.8
Gemma2-9B-CPT-SEA-Lion-v3-Instruct Link 0.59
Gemma-2-9B-IT Link 0.581

These results collectively show how the MeRALiON-LLaMA-3-8B-Instruct model builds upon the strengths of official Llama-3.1 variants. The techniques we employed can serve as a blueprint, potentially guiding future refinements and adaptations for other models and language sets.

Instruction-Following

By merging the official Llama-3.1-8B-base and Llama-3.1-8B-instruct weights, we inherit strong instruction-following behavior without additional instruction-tuning steps. The model can follow various user prompts accurately and coherently, producing well-structured, contextually relevant responses.

Usage

MeRALiON-LLaMA-3-8B-Instruct can be deployed using the 🤗 Transformers library. With careful device mapping and dtype settings, users can achieve efficient and high-quality text generation.

Example:

import transformers
import torch

model_id = "MERaLiON/MeRALiON-LLaMA-3-8B-Instruct"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)
messages = [
    {"role": "user", "content": "What is the sentiment of the following sentence?\nSentence: This book is incredibly dull.\nAnswer:"},
]

outputs = pipeline(
    messages,
    max_new_tokens=256,
)
print(outputs[0]["generated_text"][-1])

Note: We use same chat format as official llama-3.1-8b-instruct.

Caveats and Limitations

Like many LLMs, MeRALiON-LLaMA-3-8B-Instruct may hallucinate or produce irrelevant or incorrect content. While we have taken steps to mitigate these issues, users are advised to critically evaluate outputs, especially in high-stakes applications. The model has not undergone explicit safety alignment and filtering; users should implement their own safeguards, content moderation, and evaluation strategies.

Safety and Liability

This model is not strongly safety-aligned. Users are responsible for implementing their own safety checks and mitigations. The authors and affiliated institutions are not liable for any damages or losses arising from the use of this model.

Technical Specifications

MeRALiON-LLaMA-3-8B-Instruct underwent continued pretraining using computational resources provided by Singapore NSCC Aspire2A+ and The TPU Research Cloud. We utilized diverse data sources and adaptive strategies to ensure stable training without catastrophic forgetting.

Data and Licensing

All data used for continued pretraining and model merging adheres to commercially permissible licenses. We have ensured that sources are free of restricted content to the best of our abilities. Details on the dataset and licensing will be provided in the future.

Call for Contributions

We invite researchers, developers, and community members to contribute by:

  • Identifying and reporting issues or biases.
  • Providing additional pretraining or instruction data.
  • Suggesting enhancements to documentation or evaluation metrics.
  • Extending the model to support additional languages or domains.

Please visit our repository for more information and contribution guidelines.

The Team

  • Huang Xin
  • Tarun Kumar Vangani
  • Minh Duc Pham
  • Wang Bin
  • Liu Zhengyuan

Acknowledgements

Our work is supported by the resources and platforms provided by Singapore NSCC Aspire2A+ and The TPU Research Cloud. We thank all contributors and collaborators who have made this effort possible.

Contact

For additional information or inquiries, please reach out to us via our contact form (link to be provided) or check the GitHub repository for the latest updates and information.

Disclaimer

This repository contains the weights for a model not specifically aligned for safety. Users are advised to perform their own due diligence, safety fine-tuning, and compliance measures. The authors disclaim liability for any direct or indirect damages resulting from model use. ```