|
--- |
|
language: |
|
- ko |
|
- en |
|
- zh |
|
- ja |
|
license: other |
|
library_name: transformers |
|
tags: |
|
- pytorch |
|
license_name: gemma-terms-of-use |
|
license_link: https://ai.google.dev/gemma/terms |
|
pipeline_tag: text-generation |
|
model-index: |
|
- name: gemma-mling-7b |
|
results: |
|
- task: |
|
type: text-generation |
|
name: Text Generation |
|
dataset: |
|
name: IFEval (0-Shot) |
|
type: HuggingFaceH4/ifeval |
|
args: |
|
num_few_shot: 0 |
|
metrics: |
|
- type: inst_level_strict_acc and prompt_level_strict_acc |
|
value: 20.29 |
|
name: strict accuracy |
|
source: |
|
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=beomi/gemma-mling-7b |
|
name: Open LLM Leaderboard |
|
- task: |
|
type: text-generation |
|
name: Text Generation |
|
dataset: |
|
name: BBH (3-Shot) |
|
type: BBH |
|
args: |
|
num_few_shot: 3 |
|
metrics: |
|
- type: acc_norm |
|
value: 17.63 |
|
name: normalized accuracy |
|
source: |
|
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=beomi/gemma-mling-7b |
|
name: Open LLM Leaderboard |
|
- task: |
|
type: text-generation |
|
name: Text Generation |
|
dataset: |
|
name: MATH Lvl 5 (4-Shot) |
|
type: hendrycks/competition_math |
|
args: |
|
num_few_shot: 4 |
|
metrics: |
|
- type: exact_match |
|
value: 4.15 |
|
name: exact match |
|
source: |
|
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=beomi/gemma-mling-7b |
|
name: Open LLM Leaderboard |
|
- task: |
|
type: text-generation |
|
name: Text Generation |
|
dataset: |
|
name: GPQA (0-shot) |
|
type: Idavidrein/gpqa |
|
args: |
|
num_few_shot: 0 |
|
metrics: |
|
- type: acc_norm |
|
value: 0.0 |
|
name: acc_norm |
|
source: |
|
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=beomi/gemma-mling-7b |
|
name: Open LLM Leaderboard |
|
- task: |
|
type: text-generation |
|
name: Text Generation |
|
dataset: |
|
name: MuSR (0-shot) |
|
type: TAUR-Lab/MuSR |
|
args: |
|
num_few_shot: 0 |
|
metrics: |
|
- type: acc_norm |
|
value: 6.85 |
|
name: acc_norm |
|
source: |
|
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=beomi/gemma-mling-7b |
|
name: Open LLM Leaderboard |
|
- task: |
|
type: text-generation |
|
name: Text Generation |
|
dataset: |
|
name: MMLU-PRO (5-shot) |
|
type: TIGER-Lab/MMLU-Pro |
|
config: main |
|
split: test |
|
args: |
|
num_few_shot: 5 |
|
metrics: |
|
- type: acc |
|
value: 18.14 |
|
name: accuracy |
|
source: |
|
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=beomi/gemma-mling-7b |
|
name: Open LLM Leaderboard |
|
--- |
|
|
|
# Gemma-Mling: Multilingual Gemma |
|
|
|
> Update @ 2024.04.15: First release of Gemma-Mling 7B model |
|
|
|
**Original Gemma Model Page**: [Gemma](https://ai.google.dev/gemma/docs) |
|
|
|
This model card corresponds to the 7B base version of the **Gemma-Mling** model, |
|
continual pretrained on mainly Korean/English/Chinese/Japanese + 500 multilingual corpus. |
|
|
|
**Resources and Technical Documentation**: |
|
|
|
* [Original Google's Gemma-7B](https://huggingface.co/google/gemma-7b) |
|
* [Training Code @ Github: Gemma-EasyLM](https://github.com/Beomi/Gemma-EasyLM) |
|
|
|
**Terms of Use**: [Terms](https://www.kaggle.com/models/google/gemma/license/consent) |
|
|
|
**Citation** |
|
|
|
```bibtex |
|
@misc {gemma_mling_7b, |
|
author = { {Junbum Lee, Taekyoon Choi} }, |
|
title = { gemma-mling-7b }, |
|
year = 2024, |
|
url = { https://huggingface.co/beomi/gemma-mling-7b }, |
|
publisher = { Hugging Face } |
|
} |
|
``` |
|
|
|
**Model Developers**: Junbum Lee (Beomi) & Taekyoon Choi (Taekyoon) |
|
|
|
## Model Information |
|
|
|
### Usage |
|
|
|
Below we share some code snippets on how to get quickly started with running the model. First make sure to `pip install -U transformers`, then copy the snippet from the section that is relevant for your usecase. |
|
|
|
#### Running the model on a CPU |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("beomi/gemma-mling-7b") |
|
model = AutoModelForCausalLM.from_pretrained("beomi/gemma-mling-7b") |
|
|
|
input_text = "머신러닝과 딥러닝의 차이는" |
|
input_ids = tokenizer(input_text, return_tensors="pt") |
|
|
|
outputs = model.generate(**input_ids) |
|
print(tokenizer.decode(outputs[0])) |
|
``` |
|
|
|
|
|
#### Running the model on a single / multi GPU |
|
|
|
```python |
|
# pip install accelerate |
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("beomi/gemma-mling-7b") |
|
model = AutoModelForCausalLM.from_pretrained("beomi/gemma-mling-7b", device_map="auto") |
|
|
|
input_text = "머신러닝과 딥러닝의 차이는" |
|
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda") |
|
|
|
outputs = model.generate(**input_ids) |
|
print(tokenizer.decode(outputs[0])) |
|
``` |
|
|
|
### Inputs and outputs |
|
|
|
* **Input:** Text string, such as a question, a prompt, or a document to be |
|
summarized. |
|
* **Output:** Generated Multilingual-language text in response to the input, such |
|
as an answer to a question, or a summary of a document. |
|
|
|
## Implementation Information |
|
|
|
Details about the model internals. |
|
|
|
### Software |
|
|
|
Training was done using [beomi/Gemma-EasyLM](https://github.com/Beomi/Gemma-EasyLM). |
|
|
|
### Dataset |
|
|
|
We trained a mixture of multiple language datasets and trained until 100B. |
|
The released model is the best performance model based on our Evaluation below from model checkpoints. |
|
|
|
For Korean and English datasets, we utilized sampled llama2ko training dataset which combined 1:1 ratio in each language. |
|
|
|
| Dataset | Jsonl (GB) | Sampled | |
|
|--------------------------|------------|---------| |
|
| range3/cc100-ja | 96.39 | No | |
|
| Skywork/SkyPile-150B | 100.57 | Yes | |
|
| llama2ko dataset (ko/en) | 108.5 | Yes | |
|
| cis-lmu/Glot500 | 181.24 | No | |
|
| Total | 486.7 | . | |
|
|
|
## Training Progress |
|
|
|
- Report Link: https://api.wandb.ai/links/tgchoi/6lt0ce3s |
|
|
|
## Evaluation |
|
|
|
Model evaluation metrics and results. |
|
|
|
### Evaluation Scripts |
|
|
|
- For Knowledge / KoBest / XCOPA / XWinograd |
|
- [EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) v0.4.2 |
|
```bash |
|
!git clone https://github.com/EleutherAI/lm-evaluation-harness.git |
|
!cd lm-evaluation-harness && pip install -r requirements.txt && pip install -e . |
|
|
|
!lm_eval --model hf \ |
|
--model_args pretrained=beomi/gemma-mling-7b,dtype="float16" \ |
|
--tasks "haerae,kobest,kmmlu_direct,cmmlu,ceval-valid,mmlu,xwinograd,xcopa \ |
|
--num_fewshot "0,5,5,5,5,5,0,5" \ |
|
--device cuda |
|
``` |
|
- For JP Eval Harness |
|
- [Stability-AI/lm-evaluation-harness (`jp-stable` branch)](https://github.com/Stability-AI/lm-evaluation-harness/tree/jp-stable) |
|
```bash |
|
!git clone -b jp-stable https://github.com/Stability-AI/lm-evaluation-harness.git |
|
!cd lm-evaluation-harness && pip install -e ".[ja]" |
|
!pip install 'fugashi[unidic]' && python -m unidic download |
|
|
|
!cd lm-evaluation-harness && python main.py \ |
|
--model hf-causal \ |
|
--model_args pretrained=beomi/gemma-mling-7b,torch_dtype='auto'" |
|
--tasks "jcommonsenseqa-1.1-0.3,jnli-1.3-0.3,marc_ja-1.1-0.3,jsquad-1.1-0.3,jaqket_v2-0.2-0.3,xlsum_ja,mgsm" |
|
--num_fewshot "3,3,3,2,1,1,5" |
|
``` |
|
|
|
### Benchmark Results |
|
|
|
| Category | Metric | Shots | Score | |
|
|----------------------------------|----------------------|------------|--------| |
|
| **Default Metric** | **ACC** | | | |
|
| **Knowledge (5-shot)** | MMLU | | 61.76 | |
|
| | KMMLU (Exact Match) | | 42.75 | |
|
| | CMLU | | 50.93 | |
|
| | JMLU | | | |
|
| | C-EVAL | | 50.07 | |
|
| | HAERAE | 0-shot | 63.89 | |
|
| **KoBest (5-shot)** | BoolQ | | 85.47 | |
|
| | COPA | | 83.5 | |
|
| | Hellaswag (acc-norm) | | 63.2 | |
|
| | Sentineg | | 97.98 | |
|
| | WiC | | 70.95 | |
|
| **XCOPA (5-shot)** | IT | | 72.8 | |
|
| | ID | | 76.4 | |
|
| | TH | | 60.2 | |
|
| | TR | | 65.6 | |
|
| | VI | | 77.2 | |
|
| | ZH | | 80.2 | |
|
| **JP Eval Harness (Prompt ver 0.3)** | JcommonsenseQA | 3-shot | 85.97 | |
|
| | JNLI | 3-shot | 39.11 | |
|
| | Marc_ja | 3-shot | 96.48 | |
|
| | JSquad (Exact Match) | 2-shot | 70.69 | |
|
| | Jaqket (Exact Match) | 1-shot | 81.53 | |
|
| | MGSM | 5-shot | 28.8 | |
|
| **XWinograd (0-shot)** | EN | | 89.03 | |
|
| | FR | | 72.29 | |
|
| | JP | | 82.69 | |
|
| | PT | | 73.38 | |
|
| | RU | | 68.57 | |
|
| | ZH | | 79.17 | |
|
|
|
|
|
|
|
## Usage and Limitations |
|
|
|
These models have certain limitations that users should be aware of. |
|
|
|
### Intended Usage |
|
|
|
Open Large Language Models (LLMs) have a wide range of applications across |
|
various industries and domains. The following list of potential uses is not |
|
comprehensive. The purpose of this list is to provide contextual information |
|
about the possible use-cases that the model creators considered as part of model |
|
training and development. |
|
|
|
* Content Creation and Communication |
|
* Text Generation: These models can be used to generate creative text formats |
|
such as poems, scripts, code, marketing copy, and email drafts. |
|
* Research and Education |
|
* Natural Language Processing (NLP) Research: These models can serve as a |
|
foundation for researchers to experiment with NLP techniques, develop |
|
algorithms, and contribute to the advancement of the field. |
|
* Language Learning Tools: Support interactive language learning experiences, |
|
aiding in grammar correction or providing writing practice. |
|
* Knowledge Exploration: Assist researchers in exploring large bodies of text |
|
by generating summaries or answering questions about specific topics. |
|
|
|
### Limitations |
|
|
|
* Training Data |
|
* The quality and diversity of the training data significantly influence the |
|
model's capabilities. Biases or gaps in the training data can lead to |
|
limitations in the model's responses. |
|
* The scope of the training dataset determines the subject areas the model can |
|
handle effectively. |
|
* Context and Task Complexity |
|
* LLMs are better at tasks that can be framed with clear prompts and |
|
instructions. Open-ended or highly complex tasks might be challenging. |
|
* A model's performance can be influenced by the amount of context provided |
|
(longer context generally leads to better outputs, up to a certain point). |
|
* Language Ambiguity and Nuance |
|
* Natural language is inherently complex. LLMs might struggle to grasp subtle |
|
nuances, sarcasm, or figurative language. |
|
* Factual Accuracy |
|
* LLMs generate responses based on information they learned from their |
|
training datasets, but they are not knowledge bases. They may generate |
|
incorrect or outdated factual statements. |
|
* Common Sense |
|
* LLMs rely on statistical patterns in language. They might lack the ability |
|
to apply common sense reasoning in certain situations. |
|
|
|
### Ethical Considerations and Risks |
|
|
|
The development of large language models (LLMs) raises several ethical concerns. |
|
In creating an open model, we have carefully considered the following: |
|
|
|
* Bias and Fairness |
|
* LLMs trained on large-scale, real-world text data can reflect socio-cultural |
|
biases embedded in the training material. These models underwent careful |
|
scrutiny, input data pre-processing described and posterior evaluations |
|
reported in this card. |
|
* Misinformation and Misuse |
|
* LLMs can be misused to generate text that is false, misleading, or harmful. |
|
* Guidelines are provided for responsible use with the model, see the |
|
[Responsible Generative AI Toolkit](http://ai.google.dev/gemma/responsible). |
|
* Transparency and Accountability: |
|
* This model card summarizes details on the models' architecture, |
|
capabilities, limitations, and evaluation processes. |
|
* A responsibly developed open model offers the opportunity to share |
|
innovation by making LLM technology accessible to developers and researchers |
|
across the AI ecosystem. |
|
|
|
Risks identified and mitigations: |
|
|
|
* Perpetuation of biases: It's encouraged to perform continuous monitoring |
|
(using evaluation metrics, human review) and the exploration of de-biasing |
|
techniques during model training, fine-tuning, and other use cases. |
|
* Generation of harmful content: Mechanisms and guidelines for content safety |
|
are essential. Developers are encouraged to exercise caution and implement |
|
appropriate content safety safeguards based on their specific product policies |
|
and application use cases. |
|
* Misuse for malicious purposes: Technical limitations and developer and |
|
end-user education can help mitigate against malicious applications of LLMs. |
|
Educational resources and reporting mechanisms for users to flag misuse are |
|
provided. Prohibited uses of Gemma models are outlined in the |
|
[Gemma Prohibited Use Policy](https://ai.google.dev/gemma/prohibited_use_policy). |
|
* Privacy violations: Models were trained on data filtered for removal of PII |
|
(Personally Identifiable Information). Developers are encouraged to adhere to |
|
privacy regulations with privacy-preserving techniques. |
|
|
|
## Acknowledgement |
|
|
|
The training is supported by [TPU Research Cloud](https://sites.research.google/trc/) program. |
|
# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) |
|
Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_beomi__gemma-mling-7b) |
|
|
|
| Metric |Value| |
|
|-------------------|----:| |
|
|Avg. |11.18| |
|
|IFEval (0-Shot) |20.29| |
|
|BBH (3-Shot) |17.63| |
|
|MATH Lvl 5 (4-Shot)| 4.15| |
|
|GPQA (0-shot) | 0.00| |
|
|MuSR (0-shot) | 6.85| |
|
|MMLU-PRO (5-shot) |18.14| |
|
|
|
|