|
--- |
|
language: |
|
- bg |
|
- en |
|
license: apache-2.0 |
|
library_name: transformers |
|
tags: |
|
- mistral |
|
- instruct |
|
- bggpt |
|
- insait |
|
base_model: mistralai/Mistral-7B-v0.1 |
|
pipeline_tag: text-generation |
|
model-index: |
|
- name: BgGPT-7B-Instruct-v0.2 |
|
results: |
|
- task: |
|
type: text-generation |
|
name: Text Generation |
|
dataset: |
|
name: AI2 Reasoning Challenge (25-Shot) |
|
type: ai2_arc |
|
config: ARC-Challenge |
|
split: test |
|
args: |
|
num_few_shot: 25 |
|
metrics: |
|
- type: acc_norm |
|
value: 60.58 |
|
name: normalized accuracy |
|
source: |
|
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=INSAIT-Institute/BgGPT-7B-Instruct-v0.2 |
|
name: Open LLM Leaderboard |
|
- task: |
|
type: text-generation |
|
name: Text Generation |
|
dataset: |
|
name: HellaSwag (10-Shot) |
|
type: hellaswag |
|
split: validation |
|
args: |
|
num_few_shot: 10 |
|
metrics: |
|
- type: acc_norm |
|
value: 82.18 |
|
name: normalized accuracy |
|
source: |
|
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=INSAIT-Institute/BgGPT-7B-Instruct-v0.2 |
|
name: Open LLM Leaderboard |
|
- task: |
|
type: text-generation |
|
name: Text Generation |
|
dataset: |
|
name: MMLU (5-Shot) |
|
type: cais/mmlu |
|
config: all |
|
split: test |
|
args: |
|
num_few_shot: 5 |
|
metrics: |
|
- type: acc |
|
value: 60.5 |
|
name: accuracy |
|
source: |
|
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=INSAIT-Institute/BgGPT-7B-Instruct-v0.2 |
|
name: Open LLM Leaderboard |
|
- task: |
|
type: text-generation |
|
name: Text Generation |
|
dataset: |
|
name: TruthfulQA (0-shot) |
|
type: truthful_qa |
|
config: multiple_choice |
|
split: validation |
|
args: |
|
num_few_shot: 0 |
|
metrics: |
|
- type: mc2 |
|
value: 54.63 |
|
source: |
|
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=INSAIT-Institute/BgGPT-7B-Instruct-v0.2 |
|
name: Open LLM Leaderboard |
|
- task: |
|
type: text-generation |
|
name: Text Generation |
|
dataset: |
|
name: Winogrande (5-shot) |
|
type: winogrande |
|
config: winogrande_xl |
|
split: validation |
|
args: |
|
num_few_shot: 5 |
|
metrics: |
|
- type: acc |
|
value: 76.48 |
|
name: accuracy |
|
source: |
|
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=INSAIT-Institute/BgGPT-7B-Instruct-v0.2 |
|
name: Open LLM Leaderboard |
|
- task: |
|
type: text-generation |
|
name: Text Generation |
|
dataset: |
|
name: GSM8k (5-shot) |
|
type: gsm8k |
|
config: main |
|
split: test |
|
args: |
|
num_few_shot: 5 |
|
metrics: |
|
- type: acc |
|
value: 44.12 |
|
name: accuracy |
|
source: |
|
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=INSAIT-Institute/BgGPT-7B-Instruct-v0.2 |
|
name: Open LLM Leaderboard |
|
--- |
|
# INSAIT-Institute/BgGPT-7B-Instruct-v0.2 |
|
|
|
 |
|
|
|
Meet BgGPT-7B, a Bulgarian language model trained from mistralai/Mistral-7B-v0.1. BgGPT is distributed under Apache 2.0 license. |
|
|
|
This model was created by [`INSAIT Institute`](https://insait.ai/), part of Sofia University, in Sofia, Bulgaria. |
|
|
|
This is an improved version of the model - v0.2. |
|
|
|
## Model description |
|
|
|
The model is continously pretrained to gain its Bulgarian language and culture capabilities using multiple datasets, including Bulgarian web crawl data, a range of specialized Bulgarian datasets sourced by INSAIT Institute, and machine translations of popular English datasets. |
|
This Bulgarian data was augmented with English datasets to retain English and logical reasoning skills. |
|
|
|
The model's tokenizer has been extended to allow for a more efficient encoding of Bulgarian words written in Cyrillic. |
|
This not only increases throughput of Cyrillic text but also performance. |
|
|
|
## Instruction format |
|
|
|
In order to leverage instruction fine-tuning, your prompt should be surrounded by `[INST]` and `[/INST]` tokens. |
|
The very first instruction should begin with a begin of sequence token `<s>`. Following instructions should not. |
|
The assistant generation will be ended by the end-of-sequence token. |
|
|
|
E.g. |
|
``` |
|
text = "<s>[INST] Кога е основан Софийският университет? [/INST]" |
|
"Софийският университет „Св. Климент Охридски“ е създаден на 1 октомври 1888 г.</s> " |
|
"[INST] Кой го е основал? [/INST]" |
|
``` |
|
|
|
This format is available as a [chat template](https://huggingface.co/docs/transformers/main/chat_templating) via the `apply_chat_template()` method: |
|
|
|
## Benchmarks |
|
|
|
The model comes with a set of Benchmarks that are translations of the corresponding English-benchmarks. These are provided at [`https://github.com/insait-institute/lm-evaluation-harness-bg`](https://github.com/insait-institute/lm-evaluation-harness-bg) |
|
|
|
As this is an improved version over version 0.1 of the same model and we include benchmark comparisons. |
|
|
|
 |
|
|
|
 |
|
|
|
 |
|
|
|
## Summary |
|
- **Finetuned from:** [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) |
|
- **Model type:** Causal decoder-only transformer language model |
|
- **Language:** Bulgarian and English |
|
- **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0.html) |
|
- **Contact:** [[email protected]](mailto:[email protected]) |
|
|
|
## Use in 🤗Transformers |
|
First install direct dependencies: |
|
``` |
|
pip install transformers torch accelerate |
|
``` |
|
If you want faster inference using flash-attention2, you need to install these dependencies: |
|
```bash |
|
pip install packaging ninja |
|
pip install flash-attn |
|
``` |
|
Then load the model in transformers: |
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
import torch |
|
model = AutoModelForCausalLM.from_pretrained( |
|
model="INSAIT-Institute/BgGPT-7B-Instruct-v0.2", |
|
device_map="auto", |
|
torch_dtype=torch.bfloat16, |
|
use_flash_attn_2=True # optional |
|
) |
|
``` |
|
|
|
## Use with GGML / llama.cpp |
|
|
|
The model in GGUF format [INSAIT-Institute/BgGPT-7B-Instruct-v0.2-GGUF](https://huggingface.co/INSAIT-Institute/BgGPT-7B-Instruct-v0.2-GGUF) |
|
|
|
# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) |
|
Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_INSAIT-Institute__BgGPT-7B-Instruct-v0.2) |
|
|
|
| Metric |Value| |
|
|---------------------------------|----:| |
|
|Avg. |63.08| |
|
|AI2 Reasoning Challenge (25-Shot)|60.58| |
|
|HellaSwag (10-Shot) |82.18| |
|
|MMLU (5-Shot) |60.50| |
|
|TruthfulQA (0-shot) |54.63| |
|
|Winogrande (5-shot) |76.48| |
|
|GSM8k (5-shot) |44.12| |
|
|
|
|