File size: 29,129 Bytes
11a6169
 
 
466a067
11a6169
 
 
466a067
11a6169
 
 
 
 
 
 
466a067
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11a6169
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c9de005
11a6169
 
 
 
 
 
 
 
 
 
 
 
c32b25a
466a067
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
---
language:
- pl
license: apache-2.0
library_name: transformers
tags:
- finetuned
base_model: speakleash/Bielik-11B-v2
inference:
  parameters:
    temperature: 0.2
widget:
- messages:
  - role: user
    content: Co przedstawia polskie godło?
extra_gated_description: If you want to learn more about how you can use the model,
  please refer to our <a href="https://bielik.ai/terms/">Terms of Use</a>.
model-index:
- name: Bielik-11B-v2.1-Instruct
  results:
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: IFEval (0-Shot)
      type: HuggingFaceH4/ifeval
      args:
        num_few_shot: 0
    metrics:
    - type: inst_level_strict_acc and prompt_level_strict_acc
      value: 50.9
      name: strict accuracy
    source:
      url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=speakleash/Bielik-11B-v2.1-Instruct
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: BBH (3-Shot)
      type: BBH
      args:
        num_few_shot: 3
    metrics:
    - type: acc_norm
      value: 36.29
      name: normalized accuracy
    source:
      url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=speakleash/Bielik-11B-v2.1-Instruct
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: MATH Lvl 5 (4-Shot)
      type: hendrycks/competition_math
      args:
        num_few_shot: 4
    metrics:
    - type: exact_match
      value: 0.6
      name: exact match
    source:
      url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=speakleash/Bielik-11B-v2.1-Instruct
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: GPQA (0-shot)
      type: Idavidrein/gpqa
      args:
        num_few_shot: 0
    metrics:
    - type: acc_norm
      value: 11.63
      name: acc_norm
    source:
      url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=speakleash/Bielik-11B-v2.1-Instruct
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: MuSR (0-shot)
      type: TAUR-Lab/MuSR
      args:
        num_few_shot: 0
    metrics:
    - type: acc_norm
      value: 10.52
      name: acc_norm
    source:
      url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=speakleash/Bielik-11B-v2.1-Instruct
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: MMLU-PRO (5-shot)
      type: TIGER-Lab/MMLU-Pro
      config: main
      split: test
      args:
        num_few_shot: 5
    metrics:
    - type: acc
      value: 27.18
      name: accuracy
    source:
      url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=speakleash/Bielik-11B-v2.1-Instruct
      name: Open LLM Leaderboard
---

<p align="center">
  <img src="https://huggingface.co/speakleash/Bielik-11B-v2.1-Instruct/raw/main/speakleash_cyfronet.png">
</p>

# Bielik-11B-v2.1-Instruct

Bielik-11B-v2.1-Instruct is a generative text model featuring 11 billion parameters. 
It is an instruct fine-tuned version of the [Bielik-11B-v2](https://huggingface.co/speakleash/Bielik-11B-v2). 
Forementioned model stands as a testament to the unique collaboration between the open-science/open-souce project SpeakLeash and the High Performance Computing (HPC) center: ACK Cyfronet AGH. 
Developed and trained on Polish text corpora, which has been cherry-picked and processed by the SpeakLeash team, this endeavor leverages Polish large-scale computing infrastructure, 
specifically within the PLGrid environment, and more precisely, the HPC centers: ACK Cyfronet AGH. 
The creation and training of the Bielik-11B-v2.1-Instruct was propelled by the support of computational grant number PLG/2024/016951, conducted on the Athena and Helios supercomputer, 
enabling the use of cutting-edge technology and computational resources essential for large-scale machine learning processes. 
As a result, the model exhibits an exceptional ability to understand and process the Polish language, providing accurate responses and performing a variety of linguistic tasks with high precision.

🗣️ Chat Arena<span style="color:red;">*</span>: https://arena.speakleash.org.pl/

<span style="color:red;">*</span>Chat Arena is a platform for testing and comparing different AI language models, allowing users to evaluate their performance and quality.

## Model

The [SpeakLeash](https://speakleash.org/) team is working on their own set of instructions in Polish, which is continuously being expanded and refined by annotators. A portion of these instructions, which had been manually verified and corrected, has been utilized for training purposes. Moreover, due to the limited availability of high-quality instructions in Polish, synthetic instructions were generated with [Mixtral 8x22B](https://huggingface.co/mistralai/Mixtral-8x22B-v0.1) and used in training. The dataset used for training comprised over 20 million instructions, consisting of more than 10 billion tokens. The instructions varied in quality, leading to a deterioration in the model’s performance. To counteract this while still allowing ourselves to utilize the aforementioned datasets, several improvements were introduced:
* Weighted tokens level loss - a strategy inspired by [offline reinforcement learning](https://arxiv.org/abs/2005.01643) and [C-RLFT](https://arxiv.org/abs/2309.11235)
* Adaptive learning rate inspired by the study on [Learning Rates as a Function of Batch Size](https://arxiv.org/abs/2006.09092)
* Masked prompt tokens

To align the model with user preferences we tested many different techniques: DPO, PPO, KTO, SiMPO. Finally the [DPO-Positive](https://arxiv.org/abs/2402.13228) method was employed, utilizing both generated and manually corrected examples, which were scored by a metamodel. A dataset comprising over 60,000 examples of varying lengths to address different aspects of response style. It was filtered and evaluated by the reward model to select instructions with the right level of difference between chosen and rejected. The novelty introduced in DPO-P was multi-turn conversations introduction. 

Bielik-11B-v2.1-Instruct has been trained with the use of an original open source framework called [ALLaMo](https://github.com/chrisociepa/allamo) implemented by [Krzysztof Ociepa](https://www.linkedin.com/in/krzysztof-ociepa-44886550/). This framework allows users to train language models with architecture similar to LLaMA and Mistral in fast and efficient way.


### Model description:

* **Developed by:** [SpeakLeash](https://speakleash.org/) & [ACK Cyfronet AGH](https://www.cyfronet.pl/)
* **Language:** Polish
* **Model type:** causal decoder-only
* **Finetuned from:** [Bielik-11B-v2](https://huggingface.co/speakleash/Bielik-11B-v2)
* **License:** Apache 2.0 and [Terms of Use](https://bielik.ai/terms/)
* **Model ref:** speakleash:a05d7fe0995e191985a863b48a39259b


### Quantized models:
We know that some people want to explore smaller models or don't have the resources to run a full model. Therefore, we have prepared quantized versions of the Bielik-11B-v2.1-Instruct model in separate repositories:
- [GGUF - Q4_K_M, Q5_K_M, Q6_K, Q8_0](https://huggingface.co/speakleash/Bielik-11B-v2.1-Instruct-GGUF)
- [GPTQ - 4bit](https://huggingface.co/speakleash/Bielik-11B-v2.1-Instruct-GPTQ)
- [FP8](https://huggingface.co/speakleash/Bielik-11B-v2.1-Instruct-FP8) (vLLM, SGLang - Ada Lovelace, Hopper optimized)
- [GGUF - experimental - IQ imatrix IQ1_M, IQ2_XXS, IQ3_XXS, IQ4_XS and calibrated Q4_K_M, Q5_K_M, Q6_K, Q8_0](https://huggingface.co/speakleash/Bielik-11B-v2.1-Instruct-GGUF-IQ-Imatrix)

Please note that quantized models may offer lower quality of generated answers compared to full sized variatns.

  
### Chat template

Bielik-11B-v2.1-Instruct uses [ChatML](https://github.com/cognitivecomputations/OpenChatML) as the prompt format.

E.g.
```
prompt = "<s><|im_start|> user\nJakie mamy pory roku?<|im_end|> \n<|im_start|> assistant\n"
completion = "W Polsce mamy 4 pory roku: wiosna, lato, jesień i zima.<|im_end|> \n"
```

This format is available as a [chat template](https://huggingface.co/docs/transformers/main/chat_templating) via the `apply_chat_template()` method:

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda" # the device to load the model onto

model_name = "speakleash/Bielik-11B-v2.1-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)

messages = [
    {"role": "system", "content": "Odpowiadaj krótko, precyzyjnie i wyłącznie w języku polskim."},
    {"role": "user", "content": "Jakie mamy pory roku w Polsce?"},
    {"role": "assistant", "content": "W Polsce mamy 4 pory roku: wiosna, lato, jesień i zima."},
    {"role": "user", "content": "Która jest najcieplejsza?"}
]

input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt")

model_inputs = input_ids.to(device)
model.to(device)

generated_ids = model.generate(model_inputs, max_new_tokens=1000, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])
```

Fully formated input conversation by apply_chat_template from previous example:

```
<s><|im_start|> system
Odpowiadaj krótko, precyzyjnie i wyłącznie w języku polskim.<|im_end|> 
<|im_start|> user
Jakie mamy pory roku w Polsce?<|im_end|> 
<|im_start|> assistant
W Polsce mamy 4 pory roku: wiosna, lato, jesień i zima.<|im_end|> 
<|im_start|> user
Która jest najcieplejsza?<|im_end|>
```


## Evaluation

Bielik-11B-v2.1-Instruct has been evaluated on several benchmarks to assess its performance across various tasks and languages. These benchmarks include:

1. Open PL LLM Leaderboard
2. Open LLM Leaderboard
3. Polish MT-Bench
4. Polish EQ-Bench (Emotional Intelligence Benchmark)
5. MixEval

The following sections provide detailed results for each of these benchmarks, demonstrating the model's capabilities in both Polish and English language tasks.

### Open PL LLM Leaderboard

Models have been evaluated on [Open PL LLM Leaderboard](https://huggingface.co/spaces/speakleash/open_pl_llm_leaderboard) 5-shot. The benchmark evaluates models in NLP tasks like sentiment analysis, categorization, text classification but does not test chatting skills. Average column is an average score among all tasks normalized by baseline scores.
         

| Model                           | Parameters (B)| Average |
|---------------------------------|------------|---------|
| Meta-Llama-3.1-405B-Instruct-FP8,API | 405        | 69.44   |
| Mistral-Large-Instruct-2407     | 123        | 69.11   |
| Qwen2-72B-Instruct              | 72         | 65.87   |
| Bielik-11B-v2.2-Instruct        | 11         | 65.57   |
| Meta-Llama-3.1-70B-Instruct     | 70         | 65.49   |
| **Bielik-11B-v2.1-Instruct**        | **11**         | **65.45**   |
| Mixtral-8x22B-Instruct-v0.1     | 141        | 65.23   |
| Bielik-11B-v2.0-Instruct        | 11         | 64.98   |
| Meta-Llama-3-70B-Instruct       | 70         | 64.45   |
| Athene-70B                      | 70         | 63.65   |
| WizardLM-2-8x22B                | 141        | 62.35   |
| Qwen1.5-72B-Chat                | 72         | 58.67   |
| Qwen2-57B-A14B-Instruct         | 57         | 56.89   |
| glm-4-9b-chat                   | 9          | 56.61   |
| aya-23-35B                      | 35         | 56.37   |
| Phi-3.5-MoE-instruct            | 41.9       | 56.34   |
| openchat-3.5-0106-gemma         | 7          | 55.69   |
| Mistral-Nemo-Instruct-2407      | 12         | 55.27   |
| SOLAR-10.7B-Instruct-v1.0       | 10.7       | 55.24   |
| Mixtral-8x7B-Instruct-v0.1      | 46.7       | 55.07   |
| Bielik-7B-Instruct-v0.1         | 7          | 44.70   |
| trurl-2-13b-academic            | 13         | 36.28   |
| trurl-2-7b                      | 7          | 26.93   |

The results from the Open PL LLM Leaderboard demonstrate the exceptional performance of Bielik-11B-v2.1-Instruct:

1. Superior performance in its class: Bielik-11B-v2.1-Instruct outperforms all other models with less than 70B parameters. This is a significant achievement, showcasing its efficiency and effectiveness despite having fewer parameters than many competitors.

2. Competitive with larger models: with a score of 65.45, Bielik-11B-v2.1-Instruct performs on par with models in the 70B parameter range. This indicates that it achieves comparable results to much larger models, demonstrating its advanced architecture and training methodology.

3. Substantial improvement over previous version: the model shows a marked improvement over its predecessor, Bielik-7B-Instruct-v0.1, which scored 43.64. This leap in performance highlights the successful enhancements and optimizations implemented in this newer version.

4. Leading position for Polish language models: in the context of Polish language models, Bielik-11B-v2.1-Instruct stands out as a leader. There are no other competitive models specifically tailored for the Polish language that match its performance, making it a crucial resource for Polish NLP tasks.

These results underscore Bielik-11B-v2.1-Instruct's position as a state-of-the-art model for Polish language processing, offering high performance with relatively modest computational requirements.

#### Open PL LLM Leaderboard - Generative Tasks Performance

This section presents a focused comparison of generative Polish language task performance between Bielik models and GPT-3.5. The evaluation is limited to generative tasks due to the constraints of assessing OpenAI models. The comprehensive nature and associated costs of the benchmark explain the limited number of models evaluated.

| Model                         | Parameters (B) | Average g     |
|-------------------------------|----------------|---------------|
| **Bielik-11B-v2.1-Instruct**      | 11             | **66.58**         |
| Bielik-11B-v2.2-Instruct      | 11             | 66.11         |
| Bielik-11B-v2.0-Instruct      | 11             | 65.58         |
| gpt-3.5-turbo-instruct        | Unknown        | 55.65         |

The performance variation among Bielik versions is minimal, indicating consistent quality across iterations. Bielik-11B-v2.1-Instruct demonstrates an impressive 19.6% performance advantage over GPT-3.5.


### Open LLM Leaderboard

The Open LLM Leaderboard evaluates models on various English language tasks, providing insights into the model's performance across different linguistic challenges.

| Model                    | AVG   | arc_challenge | hellaswag | truthfulqa_mc2 | mmlu  | winogrande | gsm8k |
|--------------------------|-------|---------------|-----------|----------------|-------|------------|-------|
| Bielik-11B-v2.2-Instruct | 69.86 | 59.90         | 80.16     | 58.34          | 64.34 | 75.30      | 81.12 |
| **Bielik-11B-v2.1-Instruct** | **69.82** | 59.56 | 80.20 | 59.35 | 64.18 | 75.06 | 80.59 |
| Bielik-11B-v2.0-Instruct | 68.04 | 58.62 | 78.65 | 54.65 | 63.71 | 76.32 | 76.27 |
| Bielik-11B-v2          | 65.87 | 60.58         | 79.84     | 46.13          | 63.06 | 77.82      | 67.78 |
| Mistral-7B-Instruct-v0.2 | 65.71 | 63.14         | 84.88     | 68.26          | 60.78 | 77.19      | 40.03 |
| Bielik-7B-Instruct-v0.1  | 51.26 | 47.53         | 68.91     | 49.47          | 46.18 | 65.51      | 29.95 |



Bielik-11B-v2.1-Instruct shows impressive performance on English language tasks:

1. Significant improvement over its base model (4-point increase).
2. Substantial 18-point improvement over Bielik-7B-Instruct-v0.1.

These results demonstrate Bielik-11B-v2.1-Instruct's versatility in both Polish and English, highlighting the effectiveness of its instruction tuning process.

### Polish MT-Bench
The Bielik-11B-v2.1-Instruct (16 bit) model was also evaluated using the MT-Bench benchmark. The quality of the model was evaluated using the English version (original version without modifications) and the Polish version created by Speakleash (tasks and evaluation in Polish, the content of the tasks was also changed to take into account the context of the Polish language).

#### MT-Bench English
| Model           | Score    |
|-----------------|----------|
| **Bielik-11B-v2.1** | **8.537500** |
| Bielik-11B-v2.2 | 8.390625 |
| Bielik-11B-v2.0 | 8.159375 |

#### MT-Bench Polish
| Model                               | Parameters (B) | Score    |
|-------------------------------------|----------------|----------|
| Qwen2-72B-Instruct                  | 72             | 8.775000 |
| Mistral-Large-Instruct-2407 (123B)  | 123            | 8.662500 |
| gemma-2-27b-it                      | 27             | 8.618750 |
| Mixtral-8x22b                       | 141            | 8.231250 |
| Meta-Llama-3.1-405B-Instruct        | 405            | 8.168750 |
| Meta-Llama-3.1-70B-Instruct         | 70             | 8.150000 |
| Bielik-11B-v2.2-Instruct             | 11             | 8.115625 |
| **Bielik-11B-v2.1-Instruct**            | **11**             | **7.996875** |
| gpt-3.5-turbo                       | Unknown        | 7.868750 |
| Mixtral-8x7b                        | 46.7           | 7.637500 |
| Bielik-11B-v2.0-Instruct            | 11             | 7.562500 |
| Mistral-Nemo-Instruct-2407          | 12             | 7.368750 |
| openchat-3.5-0106-gemma             | 7              | 6.812500 |
| Mistral-7B-Instruct-v0.2            | 7              | 6.556250 |
| Meta-Llama-3.1-8B-Instruct          | 8              | 6.556250 |
| Bielik-7B-Instruct-v0.1             | 7              | 6.081250 |
| Mistral-7B-Instruct-v0.3            | 7              | 5.818750 |
| Polka-Mistral-7B-SFT                | 7              | 4.518750 |
| trurl-2-7b                          | 7              | 2.762500 |

Key observations on Bielik-11B-v2.1 performance:

1. Strong performance among mid-sized models: Bielik-11B-v2.1-Instruct scored **7.996875**, placing it ahead of several well-known models like GPT-3.5-turbo (7.868750) and Mixtral-8x7b (7.637500). This indicates that Bielik-11B-v2.1-Instruct is competitive among mid-sized models, particularly those in the 11B-70B parameter range.

2. Competitive against larger models: Bielik-11B-v2.1-Instruct performs close to Meta-Llama-3.1-70B-Instruct (8.150000), Meta-Llama-3.1-405B-Instruct (8.168750) and even Mixtral-8x22b (8.231250), which have significantly more parameters. This efficiency in performance relative to size could make it an attractive option for tasks where resource constraints are a consideration. Bielik 100% generated answers in Polish, while other models (not typically trained for Polish) can answer Polish questions in English. 

3. Significant improvement over previous versions: compared to its predecessor, **Bielik-7B-Instruct-v0.1**, which scored **6.081250**, the Bielik-11B-v2.1-Instruct shows a significant improvement. The score increased by almost **2 points**, highlighting substantial advancements in model quality, optimization and training methodology.

For more information - answers to test tasks and values in each category, visit the [MT-Bench PL](https://huggingface.co/spaces/speakleash/mt-bench-pl) website. 

### Polish EQ-Bench

[Polish Emotional Intelligence Benchmark for LLMs](https://huggingface.co/spaces/speakleash/polish_eq-bench)

| Model                         | Parameters (B) | Score |
|-------------------------------|--------|-------|
| Mistral-Large-Instruct-2407   | 123    | 78.07 |
| Meta-Llama-3.1-405B-Instruct-FP8 | 405    | 77.23 |
| gpt-4o-2024-08-06             | ?      | 75.15 |
| gpt-4-turbo-2024-04-09        | ?      | 74.59 |
| Meta-Llama-3.1-70B-Instruct   | 70     | 72.53 |
| Qwen2-72B-Instruct            | 72     | 71.23 |
| Meta-Llama-3-70B-Instruct     | 70     | 71.21 |
| gpt-4o-mini-2024-07-18        | ?      | 71.15 |
| WizardLM-2-8x22B              | 141    | 69.56 |
| Bielik-11B-v2.2-Instruct      | 11     | 69.05 |
| Bielik-11B-v2.0-Instruct      | 11     | 68.24 |
| Qwen1.5-72B-Chat              | 72     | 68.03 |
| Mixtral-8x22B-Instruct-v0.1   | 141    | 67.63 |
| **Bielik-11B-v2.1-Instruct**      | **11**     | **60.07** |
| Qwen1.5-32B-Chat              | 32     | 59.63 |
| openchat-3.5-0106-gemma       | 7      | 59.58 |
| aya-23-35B                    | 35     | 58.41 |
| gpt-3.5-turbo                 | ?      | 57.7  |
| Qwen2-57B-A14B-Instruct       | 57     | 57.64 |
| Mixtral-8x7B-Instruct-v0.1    | 47     | 57.61 |
| SOLAR-10.7B-Instruct-v1.0     | 10.7   | 55.21 |
| Mistral-7B-Instruct-v0.2      | 7      | 47.02 |



### MixEval

MixEval is a ground-truth-based English benchmark designed to evaluate Large Language Models (LLMs) efficiently and effectively. Key features of MixEval include:

1. Derived from off-the-shelf benchmark mixtures
2. Highly capable model ranking with a 0.96 correlation to Chatbot Arena
3. Local and quick execution, requiring only 6% of the time and cost compared to running MMLU

This benchmark provides a robust and time-efficient method for assessing LLM performance, making it a valuable tool for ongoing model evaluation and comparison.

| Model                         | MixEval | MixEval-Hard |
|-------------------------------|---------|--------------|
| **Bielik-11B-v2.1-Instruct**      | **74.55**   | **45.00**        |
| Bielik-11B-v2.2-Instruct      | 72.35   | 39.65        |
| Bielik-11B-v2.0-Instruct      | 72.10   | 40.20        |
| Mistral-7B-Instruct-v0.2      | 70.00   | 36.20        |

The results show that Bielik-11B-v2.1-Instruct performs well on the MixEval benchmark, achieving a score of 74.55 on the standard MixEval and 45.00 on MixEval-Hard. Notably, Bielik-11B-v2.1-Instruct significantly outperforms Mistral-7B-Instruct-v0.2 on both metrics, demonstrating its improved capabilities despite being based on a similar architecture.


### Chat Arena PL

Chat Arena PL is a human-evaluated benchmark that provides a direct comparison of model performance through head-to-head battles. Unlike the automated benchmarks mentioned above, this evaluation relies on human judgment to assess the quality and effectiveness of model responses. The results offer valuable insights into how different models perform in real-world, conversational scenarios as perceived by human evaluators.

Results accessed on 2024-08-26.

| # | Model | Battles | Won | Lost | Draws | Win % | ELO |
|---|-------|-------|---------|-----------|--------|-------------|-----|
| 1 | Bielik-11B-v2.2-Instruct | 92 | 72 | 14 | 6 | 83.72% | 1234 |
| 2 | **Bielik-11B-v2.1-Instruct** | 240 | 171 | 50 | 19 | **77.38%** | 1174 |
| 3 | gpt-4o-mini | 639 | 402 | 117 | 120 | 77.46% | 1141 |
| 4 | Mistral Large 2 (2024-07) | 324 | 188 | 69 | 67 | 73.15% | 1125 |
| 5 | Llama-3.1-405B | 548 | 297 | 144 | 107 | 67.35% | 1090 |
| 6 | Bielik-11B-v2.0-Instruct | 1289 | 695 | 352 | 242 | 66.38% | 1059 |
| 7 | Llama-3.1-70B | 498 | 221 | 187 | 90 | 54.17% | 1033 |
| 8 | Bielik-1-7B | 2041 | 1029 | 638 | 374 | 61.73% | 1020 |
| 9 | Mixtral-8x22B-v0.1 | 432 | 166 | 167 | 99 | 49.85% | 1018 |
| 10 | Qwen2-72B | 451 | 179 | 177 | 95 | 50.28% | 1011 |
| 11 | gpt-3.5-turbo | 2186 | 1007 | 731 | 448 | 57.94% | 1008 |
| 12 | Llama-3.1-8B | 440 | 155 | 227 | 58 | 40.58% | 975 |
| 13 | Mixtral-8x7B-v0.1 | 1997 | 794 | 804 | 399 | 49.69% | 973 |
| 14 | Llama-3-70b | 2008 | 733 | 909 | 366 | 44.64% | 956 |
| 15 | Mistral Nemo (2024-07) | 301 | 84 | 164 | 53 | 33.87% | 954 |
| 16 | Llama-3-8b | 1911 | 473 | 1091 | 347 | 30.24% | 909 |
| 17 | gemma-7b-it | 1928 | 418 | 1221 | 289 | 25.5% | 888 |

The results show that Bielik-11B-v2.1-Instruct outperforms almost all other models in this benchmark. This impressive performance demonstrates its effectiveness in real-world conversational scenarios, as judged by human evaluators.

## Limitations and Biases

Bielik-11B-v2.1-Instruct is a quick demonstration that the base model can be easily fine-tuned to achieve compelling and promising performance. It does not have any moderation mechanisms. We're looking forward to engaging with the community in ways to make the model respect guardrails, allowing for deployment in environments requiring moderated outputs.

Bielik-11B-v2.1-Instruct can produce factually incorrect output, and should not be relied on to produce factually accurate data. Bielik-11B-v2.1-Instruct was trained on various public datasets. While great efforts have been taken to clear the training data, it is possible that this model can generate lewd, false, biased or otherwise offensive outputs.

## Citation
Please cite this model using the following format:

```
@misc{Bielik11Bv21i,
    title     = {Bielik-11B-v2.1-Instruct model card},
    author    = {Ociepa, Krzysztof and Flis, Łukasz and Kinas, Remigiusz and Gwoździej, Adrian and Wróbel, Krzysztof and {SpeakLeash Team} and {Cyfronet Team}},
    year      = {2024},
    url       = {https://huggingface.co/speakleash/Bielik-11B-v2.1-Instruct},
    note      = {Accessed: 2024-09-10}, % change this date
    urldate   = {2024-09-10} % change this date
}
@unpublished{Bielik11Bv21a,
  author = {Ociepa, Krzysztof and Flis, Łukasz and Kinas, Remigiusz and Gwoździej, Adrian and Wróbel, Krzysztof},
  title  = {Bielik: A Family of Large Language Models for the Polish Language - Development, Insights, and Evaluation},
  year   = {2024},
}
```

## Responsible for training the model

* [Krzysztof Ociepa](https://www.linkedin.com/in/krzysztof-ociepa-44886550/)<sup>SpeakLeash</sup> - team leadership, conceptualizing, data preparation, process optimization and oversight of training
* [Łukasz Flis](https://www.linkedin.com/in/lukasz-flis-0a39631/)<sup>Cyfronet AGH</sup> - coordinating and supervising the training
* [Remigiusz Kinas](https://www.linkedin.com/in/remigiusz-kinas/)<sup>SpeakLeash</sup> - conceptualizing and coordinating DPO training, data preparation
* [Adrian Gwoździej](https://www.linkedin.com/in/adrgwo/)<sup>SpeakLeash</sup> - data preparation and ensuring data quality
* [Krzysztof Wróbel](https://www.linkedin.com/in/wrobelkrzysztof/)<sup>SpeakLeash</sup> - benchmarks


The model could not have been created without the commitment and work of the entire SpeakLeash team, whose contribution is invaluable. Thanks to the hard work of many individuals, it was possible to gather a large amount of content in Polish and establish collaboration between the open-science SpeakLeash project and the HPC center: ACK Cyfronet AGH. Individuals who contributed to the creation of the model:
[Sebastian Kondracki](https://www.linkedin.com/in/sebastian-kondracki/),
[Igor Ciuciura](https://www.linkedin.com/in/igor-ciuciura-1763b52a6/),
[Paweł Kiszczak](https://www.linkedin.com/in/paveu-kiszczak/),
[Szymon Baczyński](https://www.linkedin.com/in/szymon-baczynski/),
[Jacek Chwiła](https://www.linkedin.com/in/jacek-chwila/),
[Maria Filipkowska](https://www.linkedin.com/in/maria-filipkowska/),
[Jan Maria Kowalski](https://www.linkedin.com/in/janmariakowalski/),
[Karol Jezierski](https://www.linkedin.com/in/karol-jezierski/),
[Kacper Milan](https://www.linkedin.com/in/kacper-milan/),
[Jan Sowa](https://www.linkedin.com/in/janpiotrsowa/),
[Len Krawczyk](https://www.linkedin.com/in/magdalena-krawczyk-7810942ab/),
[Marta Seidler](https://www.linkedin.com/in/marta-seidler-751102259/),
[Agnieszka Ratajska](https://www.linkedin.com/in/agnieszka-ratajska/),
[Krzysztof Koziarek](https://www.linkedin.com/in/krzysztofkoziarek/),
[Szymon Pepliński](http://linkedin.com/in/szymonpeplinski/),
[Zuzanna Dabić](https://www.linkedin.com/in/zuzanna-dabic/),
[Filip Bogacz](https://linkedin.com/in/Fibogacci),
[Agnieszka Kosiak](https://www.linkedin.com/in/agn-kosiak),
[Izabela Babis](https://www.linkedin.com/in/izabela-babis-2274b8105/),
[Nina Babis](https://www.linkedin.com/in/nina-babis-00055a140/).

Members of the ACK Cyfronet AGH team providing valuable support and expertise: 
[Szymon Mazurek](https://www.linkedin.com/in/sz-mazurek-ai/),
[Marek Magryś](https://www.linkedin.com/in/magrys/),
[Mieszko Cholewa ](https://www.linkedin.com/in/mieszko-cholewa-613726301/).

## Contact Us

If you have any questions or suggestions, please use the discussion tab. If you want to contact us directly, join our [Discord SpeakLeash](https://discord.gg/pv4brQMDTy).

# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard)
Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_speakleash__Bielik-11B-v2.1-Instruct)

|      Metric       |Value|
|-------------------|----:|
|Avg.               |22.85|
|IFEval (0-Shot)    |50.90|
|BBH (3-Shot)       |36.29|
|MATH Lvl 5 (4-Shot)| 0.60|
|GPQA (0-shot)      |11.63|
|MuSR (0-shot)      |10.52|
|MMLU-PRO (5-shot)  |27.18|