Quantization made by Richard Erkhov.

emma-500-llama2-7b - GGUF

Model creator: https://huggingface.co/MaLA-LM/
Original model: https://huggingface.co/MaLA-LM/emma-500-llama2-7b/

Name	Quant method	Size
emma-500-llama2-7b.Q2_K.gguf	Q2_K	2.36GB
emma-500-llama2-7b.IQ3_XS.gguf	IQ3_XS	2.6GB
emma-500-llama2-7b.IQ3_S.gguf	IQ3_S	2.75GB
emma-500-llama2-7b.Q3_K_S.gguf	Q3_K_S	2.75GB
emma-500-llama2-7b.IQ3_M.gguf	IQ3_M	2.9GB
emma-500-llama2-7b.Q3_K.gguf	Q3_K	3.07GB
emma-500-llama2-7b.Q3_K_M.gguf	Q3_K_M	3.07GB
emma-500-llama2-7b.Q3_K_L.gguf	Q3_K_L	3.35GB
emma-500-llama2-7b.IQ4_XS.gguf	IQ4_XS	3.4GB
emma-500-llama2-7b.Q4_0.gguf	Q4_0	3.56GB
emma-500-llama2-7b.IQ4_NL.gguf	IQ4_NL	3.58GB
emma-500-llama2-7b.Q4_K_S.gguf	Q4_K_S	3.59GB
emma-500-llama2-7b.Q4_K.gguf	Q4_K	3.8GB
emma-500-llama2-7b.Q4_K_M.gguf	Q4_K_M	3.8GB
emma-500-llama2-7b.Q4_1.gguf	Q4_1	3.95GB
emma-500-llama2-7b.Q5_0.gguf	Q5_0	4.33GB
emma-500-llama2-7b.Q5_K_S.gguf	Q5_K_S	4.33GB
emma-500-llama2-7b.Q5_K.gguf	Q5_K	4.45GB
emma-500-llama2-7b.Q5_K_M.gguf	Q5_K_M	4.45GB
emma-500-llama2-7b.Q5_1.gguf	Q5_1	4.72GB
emma-500-llama2-7b.Q6_K.gguf	Q6_K	5.15GB
emma-500-llama2-7b.Q8_0.gguf	Q8_0	6.67GB

Original model description:

license: llama2 datasets: - MaLA-LM/mala-monolingual-split base_model: - meta-llama/Llama-2-7b-hf library_name: transformers

EMMA-500: Enhancing Massively Multilingual Adaptation of Large Language Models

Model Description

EMMA-500 is a state-of-the-art multilingual language model designed to improve language representation, especially in low-resource languages, through continual pre-training on the Llama 2 7B architecture. Leveraging the MaLA Corpus, which spans over 500 languages and 74 billion tokens, EMMA-500 excels in multilingual tasks like commonsense reasoning, machine translation, open-ended generation, and text classification.

EMMA-500 outperforms other Llama 2-based models in diverse multilingual settings while maintaining robustness in specialized tasks.

Model Details

Architecture: Built on Llama 2 7B with enhanced language adaptation through continual pre-training.
Languages: Supports 546 languages with substantial training data (over 100k tokens each).
Data Mix: A diverse mix of text from domains like code, books, instruction data, and more.
Key Tasks: Commonsense reasoning, machine translation, text classification, natural language inference, code generation, and open-ended generation.

Data Access

Usage

You can use EMMA-500 for multilingual text generation. Below is an example to generate text using the model:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "MaLA-LM/emma-500-llama2-7b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

input_text = "Once upon a time"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Model Performance

EMMA-500 was evaluated across multiple benchmarks and tasks, demonstrating:

Lowest negative log-likelihood in intrinsic evaluations.
Significant improvements in commonsense reasoning, machine translation, and open-ended generation.
Outperformed all Llama 2-based models in text classification and natural language inference.
Enhanced performance in code generation and machine reading comprehension (MRC).

Challenges remain in low-resource languages, where the model tends to have higher Self-BLEU scores, indicating reduced output diversity.

Citation

@article{ji2024emma500enhancingmassivelymultilingual,
      title={{EMMA}-500: Enhancing Massively Multilingual Adaptation of Large Language Models}, 
      author={Shaoxiong Ji and Zihao Li and Indraneil Paul and Jaakko Paavola and Peiqin Lin and Pinzhen Chen and Dayyán O'Brien and Hengyu Luo and Hinrich Schütze and Jörg Tiedemann and Barry Haddow},
      year={2024},
      journal={arXiv preprint 2409.17892},
      url={https://arxiv.org/abs/2409.17892}, 
}

Acknowledgements

We extend our thanks to the language communities and contributors who helped source, clean, and validate the diverse data used in the MaLA Corpus. Their efforts are invaluable in supporting linguistic diversity in AI research.

This work is done by researchers at Helsinki-NLP in collaboration with partners from TU Darmstadt, the University of Edinburgh, and LMU Munich. It is funded by HPLT and UTTER.