aashish1904 commited on
Commit
0939b55
1 Parent(s): 59e39e2

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +166 -0
README.md ADDED
@@ -0,0 +1,166 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ---
3
+
4
+ library_name: transformers
5
+ tags:
6
+ - gemma2
7
+ - instruct
8
+ - bggpt
9
+ - insait
10
+ license: gemma
11
+ language:
12
+ - bg
13
+ - en
14
+ base_model:
15
+ - google/gemma-2-9b-it
16
+ - google/gemma-2-9b
17
+ pipeline_tag: text-generation
18
+
19
+ ---
20
+
21
+ [![QuantFactory Banner](https://lh7-rt.googleusercontent.com/docsz/AD_4nXeiuCm7c8lEwEJuRey9kiVZsRn2W-b4pWlu3-X534V3YmVuVc2ZL-NXg2RkzSOOS2JXGHutDuyyNAUtdJI65jGTo8jT9Y99tMi4H4MqL44Uc5QKG77B0d6-JfIkZHFaUA71-RtjyYZWVIhqsNZcx8-OMaA?key=xt3VSDoCbmTY7o-cwwOFwQ)](https://hf.co/QuantFactory)
22
+
23
+
24
+ # QuantFactory/BgGPT-Gemma-2-9B-IT-v1.0-GGUF
25
+ This is quantized version of [INSAIT-Institute/BgGPT-Gemma-2-9B-IT-v1.0](https://huggingface.co/INSAIT-Institute/BgGPT-Gemma-2-9B-IT-v1.0) created using llama.cpp
26
+
27
+ # Original Model Card
28
+
29
+
30
+ # INSAIT-Institute/BgGPT-Gemma-2-9B-IT-v1.0
31
+
32
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/637e1f8cf7e01589cc17bf7e/p6d0YFHjWCQ3S12jWqO1m.png)
33
+
34
+ INSAIT introduces **BgGPT-Gemma-2-9B-IT-v1.0**, a state-of-the-art Bulgarian language model based on **google/gemma-2-9b** and **google/gemma-2-9b-it**.
35
+ BgGPT-Gemma-2-9B-IT-v1.0 is **free to use** and distributed under the [Gemma Terms of Use](https://ai.google.dev/gemma/terms).
36
+ This model was created by [`INSAIT`](https://insait.ai/), part of Sofia University St. Kliment Ohridski, in Sofia, Bulgaria.
37
+
38
+ # Model description
39
+
40
+ The model was built on top of Google’s Gemma 2 9B open models.
41
+ It was continuously pre-trained on around 100 billion tokens (85 billion in Bulgarian) using the Branch-and-Merge strategy INSAIT presented at [EMNLP’24](https://aclanthology.org/2024.findings-emnlp.1000/),
42
+ allowing the model to gain outstanding Bulgarian cultural and linguistic capabilities while retaining its English performance.
43
+ During the pre-training stage, we use various datasets, including Bulgarian web crawl data, freely available datasets such as Wikipedia, a range of specialized Bulgarian datasets sourced by the INSAIT Institute,
44
+ and machine translations of popular English datasets.
45
+ The model was then instruction-fine-tuned on a newly constructed Bulgarian instruction dataset created using real-world conversations.
46
+ For more information check our [blogpost](https://models.bggpt.ai/blog/).
47
+
48
+ # Benchmarks and Results
49
+
50
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/65fefdc282708115868203aa/5knpdR-QDSuM3WlpRxe-M.png)
51
+
52
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/65fefdc282708115868203aa/TY8F34DpUf7uXbsFVywn2.png)
53
+
54
+ We evaluate our models on a set of standard English benchmarks, a translated version of them in Bulgarian, as well as, Bulgarian specific benchmarks we collected:
55
+
56
+ - **Winogrande challenge**: testing world knowledge and understanding
57
+ - **Hellaswag**: testing sentence completion
58
+ - **ARC Easy/Challenge**: testing logical reasoning
59
+ - **TriviaQA**: testing trivia knowledge
60
+ - **GSM-8k**: solving multiple-choice questions in high-school mathematics
61
+ - **Exams**: solving high school problems from natural and social sciences
62
+ - **MON**: contains exams across various subjects for grades 4 to 12
63
+
64
+
65
+ These benchmarks test logical reasoning, mathematics, knowledge, language understanding and other skills of the models and are provided at https://github.com/insait-institute/lm-evaluation-harness-bg.
66
+ The graphs above show the performance of BgGPT 9B and BgGPT 27B compared to other large open models. The results show the excellent abilities of both 9B and 27B models in Bulgarian, which allow them to **outperform much larger models**,
67
+ including Alibaba’s Qwen 2.5 72B and Meta’s Llama3.1 70B. Further, both BgGPT 9B and BgGPT 27B **significantly improve upon the previous version of BgGPT** based on Mistral-7B ([BgGPT-7B-Instruct-v0.2](https://huggingface.co/INSAIT-Institute/BgGPT-7B-Instruct-v0.2), shown in grey in the figure).
68
+ Finally, our models retain the **excellent English performance** inherited from the original Google Gemma 2 models upon which they are based.
69
+
70
+
71
+ # Use in 🤗 Transformers
72
+ First install the latest version of the transformers library:
73
+ ```
74
+ pip install -U 'transformers[torch]'
75
+ ```
76
+ Then load the model in transformers:
77
+ ```python
78
+ from transformers import AutoModelForCausalLM
79
+
80
+ model = AutoModelForCausalLM.from_pretrained(
81
+ "INSAIT-Institute/BgGPT-Gemma-2-9B-IT-v1.0",
82
+ torch_dtype=torch.bfloat16,
83
+ attn_implementation="eager",
84
+ device_map="auto",
85
+ )
86
+ ```
87
+
88
+ # Recommended Parameters
89
+
90
+ For optimal performance, we recommend the following parameters for text generation, as we have extensively tested our model with them:
91
+
92
+ ```python
93
+ from transformers import GenerationConfig
94
+
95
+ generation_params = GenerationConfig(
96
+ max_new_tokens=2048, # Choose maximum generation tokens
97
+ temperature=0.1,
98
+ top_k=25,
99
+ top_p=1,
100
+ repetition_penalty=1.1,
101
+ eos_token_id=[1,107]
102
+ )
103
+ ```
104
+
105
+ In principle, increasing temperature should work adequately as well.
106
+
107
+ # Instruction format
108
+
109
+ In order to leverage instruction fine-tuning, your prompt should begin with a beginning-of-sequence token `<bos>` and be formatted in the Gemma 2 chat template. `<bos>` should only be the first token in a chat sequence.
110
+
111
+ E.g.
112
+ ```
113
+ <bos><start_of_turn>user
114
+ Кога е основан Софийският университет?<end_of_turn>
115
+ <start_of_turn>model
116
+
117
+ ```
118
+
119
+ This format is also available as a [chat template](https://huggingface.co/docs/transformers/main/chat_templating) via the `apply_chat_template()` method:
120
+
121
+ ```python
122
+ tokenizer = AutoTokenizer.from_pretrained(
123
+ "INSAIT-Institute/BgGPT-Gemma-2-9B-IT-v1.0",
124
+ use_default_system_prompt=False,
125
+ )
126
+
127
+ messages = [
128
+ {"role": "user", "content": "Кога е основан Софийският университет?"},
129
+ ]
130
+
131
+ input_ids = tokenizer.apply_chat_template(
132
+ messages,
133
+ return_tensors="pt",
134
+ add_generation_prompt=True,
135
+ return_dict=True
136
+ )
137
+
138
+ outputs = model.generate(
139
+ **input_ids,
140
+ generation_config=generation_params
141
+ )
142
+ print(tokenizer.decode(outputs[0]))
143
+
144
+ ```
145
+
146
+
147
+ **Important Note:** Models based on Gemma 2 such as BgGPT-Gemma-2-9B-IT-v1.0 do not support flash attention. Using it results in degraded performance.
148
+
149
+ # Use with GGML / llama.cpp
150
+
151
+ The model and instructions for usage in GGUF format are available at [INSAIT-Institute/BgGPT-Gemma-2-9B-IT-v1.0-GGUF](https://huggingface.co/INSAIT-Institute/BgGPT-Gemma-2-9B-IT-v1.0-GGUF).
152
+
153
+ # Community Feedback
154
+
155
+ We welcome feedback from the community to help improve BgGPT. If you have suggestions, encounter any issues, or have ideas for improvements, please:
156
+ - Share your experience using the model through Hugging Face's community discussion feature or
157
+ - Contact us at [[email protected]](mailto:[email protected])
158
+
159
+ Your real-world usage and insights are valuable in helping us optimize the model's performance and behaviour for various use cases.
160
+
161
+ # Summary
162
+ - **Finetuned from:** [google/gemma-2-9b-it](https://huggingface.co/google/gemma-2-9b-it); [google/gemma-2-9b](https://huggingface.co/google/gemma-2-9b);
163
+ - **Model type:** Causal decoder-only transformer language model
164
+ - **Language:** Bulgarian and English
165
+ - **Contact:** [[email protected]](mailto:[email protected])
166
+ - **License:** BgGPT is distributed under [Gemma Terms of Use](https://huggingface.co/INSAIT-Institute/BgGPT-Gemma-2-9B-IT-v1.0/raw/main/LICENSE)