File size: 7,450 Bytes
ba2aa4f a7d16fe dda59ea ce5ffcc a7d16fe ba2aa4f a7d16fe ba2aa4f 98d6429 ba2aa4f a7d16fe ba2aa4f 87def42 ba2aa4f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 |
---
library_name: transformers
tags:
- gemma2
- instruct
- bggpt
- insait
license: gemma
language:
- bg
- en
base_model:
- google/gemma-2-2b-it
- google/gemma-2-2b
pipeline_tag: text-generation
---
# INSAIT-Institute/BgGPT-Gemma-2-2.6B-IT-v1.0
![image/png](https://cdn-uploads.huggingface.co/production/uploads/637e1f8cf7e01589cc17bf7e/p6d0YFHjWCQ3S12jWqO1m.png)
INSAIT introduces **BgGPT-Gemma-2-2.6B-IT-v1.0**, a state-of-the-art Bulgarian language model based on **google/gemma-2-2b** and **google/gemma-2-2b-it**.
BgGPT-Gemma-2-2.6B-IT-v1.0 is **free to use** and distributed under the [Gemma Terms of Use](https://ai.google.dev/gemma/terms).
This model was created by [`INSAIT`](https://insait.ai/), part of Sofia University St. Kliment Ohridski, in Sofia, Bulgaria.
# Model description
The model was built on top of Google’s Gemma 2 2B open models.
It was continuously pre-trained on around 100 billion tokens (85 billion in Bulgarian) using the Branch-and-Merge strategy INSAIT presented at [EMNLP’24](https://aclanthology.org/2024.findings-emnlp.1000/),
allowing the model to gain outstanding Bulgarian cultural and linguistic capabilities while retaining its English performance.
During the pre-training stage, we use various datasets, including Bulgarian web crawl data, freely available datasets such as Wikipedia, a range of specialized Bulgarian datasets sourced by the INSAIT Institute,
and machine translations of popular English datasets.
The model was then instruction-fine-tuned on a newly constructed Bulgarian instruction dataset created using real-world conversations.
For more information check our [blogpost](https://models.bggpt.ai/blog/).
# Benchmarks and Results
![image/png](https://cdn-uploads.huggingface.co/production/uploads/65fefdc282708115868203aa/9pp8aD1yvoW-cJWzhbHXk.png)
![image/png](https://cdn-uploads.huggingface.co/production/uploads/65fefdc282708115868203aa/33CjjtmCeAcw5qq8DEtJj.png)
We evaluate our models on a set of standard English benchmarks, a translated version of them in Bulgarian, as well as, Bulgarian specific benchmarks we collected:
- **Winogrande challenge**: testing world knowledge and understanding
- **Hellaswag**: testing sentence completion
- **ARC Easy/Challenge**: testing logical reasoning
- **TriviaQA**: testing trivia knowledge
- **GSM-8k**: solving multiple-choice questions in high-school mathematics
- **Exams**: solving high school problems from natural and social sciences
- **MON**: contains exams across various subjects for grades 4 to 12
These benchmarks test logical reasoning, mathematics, knowledge, language understanding and other skills of the models and are provided at https://github.com/insait-institute/lm-evaluation-harness-bg.
The graphs above show the performance of BgGPT 2.6B compared to other small open language models such as Microsoft's Phi 3.5 and Alibaba's Qwen 2.5 3B.
The BgGPT model not only surpasses them, but also **retains English performance** inherited from the original Google Gemma 2 models upon which it is based.
# Use in 🤗 Transformers
First install the latest version of the transformers library:
```
pip install -U 'transformers[torch]'
```
Then load the model in transformers:
```python
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
"INSAIT-Institute/BgGPT-Gemma-2-2.6B-IT-v1.0",
torch_dtype=torch.bfloat16,
attn_implementation="eager",
device_map="auto",
)
```
# Recommended Parameters
For optimal performance, we recommend the following parameters for text generation, as we have extensively tested our model with them:
```python
from transformers import GenerationConfig
generation_params = GenerationConfig(
max_new_tokens=2048, # Choose maximum generation tokens
temperature=0.1,
top_k=25,
top_p=1,
repetition_penalty=1.1,
eos_token_id=[1,107],
do_sample=True
)
```
In principle, increasing temperature should work adequately as well.
# Instruction format
In order to leverage instruction fine-tuning, your prompt should begin with a beginning-of-sequence token `<bos>` and be formatted in the Gemma 2 chat template. `<bos>` should only be the first token in a chat sequence.
E.g.
```
<bos><start_of_turn>user
Кога е основан Софийският университет?<end_of_turn>
<start_of_turn>model
```
This format is also available as a [chat template](https://huggingface.co/docs/transformers/main/chat_templating) via the `apply_chat_template()` method:
```python
tokenizer = AutoTokenizer.from_pretrained(
"INSAIT-Institute/BgGPT-Gemma-2-2.6B-IT-v1.0",
use_default_system_prompt=False,
)
messages = [
{"role": "user", "content": "Кога е основан Софийският университет?"},
]
input_ids = tokenizer.apply_chat_template(
messages,
return_tensors="pt",
add_generation_prompt=True,
return_dict=True
)
outputs = model.generate(
**input_ids,
generation_config=generation_params
)
print(tokenizer.decode(outputs[0]))
```
**Important Note:** Models based on Gemma 2 such as BgGPT-Gemma-2-2.6B-IT-v1.0 do not support flash attention. Using it results in degraded performance.
# Use with vLLM
Example usage with vLLM:
```python
from vllm import LLM, SamplingParams
from vllm.inputs import TokensPrompt
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
"INSAIT-Institute/BgGPT-Gemma-2-2.6B-IT-v1.0",
use_default_system_prompt=False,
)
sampling_params = SamplingParams(
max_tokens=2048,
temperature=0.1,
top_k=25,
top_p=1,
repetition_penalty=1.1,
stop_token_ids=[1, 107],
)
llm = LLM(
model="INSAIT-Institute/BgGPT-Gemma-2-2.6B-IT-v1.0",
dtype="bfloat16",
enforce_eager=True
)
messages = [
{"role": "user", "content": "Кога е основан Софийският университет?"},
]
formatted_prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
input_ids = tokenizer(
formatted_prompt,
add_special_tokens=False
).input_ids
prompt = TokensPrompt(prompt_token_ids=input_ids)
output = llm.generate(
prompt,
sampling_params
)
generated_text = output[0].outputs[0].text
print(generated_text)
```
# Use with GGML / llama.cpp
The model and instructions for usage in GGUF format are available at [INSAIT-Institute/BgGPT-Gemma-2-2.6B-IT-v1.0-GGUF](https://huggingface.co/INSAIT-Institute/BgGPT-Gemma-2-2.6B-IT-v1.0-GGUF).
# Community Feedback
We welcome feedback from the community to help improve BgGPT. If you have suggestions, encounter any issues, or have ideas for improvements, please:
- Share your experience using the model through Hugging Face's community discussion feature or
- Contact us at [[email protected]](mailto:[email protected])
Your real-world usage and insights are valuable in helping us optimize the model's performance and behaviour for various use cases.
# Summary
- **Finetuned from:** [google/gemma-2-2b-it](https://huggingface.co/google/gemma-2-2b-it); [google/gemma-2-2b](https://huggingface.co/google/gemma-2-2b);
- **Model type:** Causal decoder-only transformer language model
- **Language:** Bulgarian and English
- **Contact:** [[email protected]](mailto:[email protected])
- **License:** BgGPT is distributed under [Gemma Terms of Use](https://huggingface.co/INSAIT-Institute/BgGPT-Gemma-2-2.6B-IT-v1.0/raw/main/LICENSE) |