Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,41 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: llama3.1
|
3 |
+
datasets:
|
4 |
+
- LVSTCK/macedonian-corpus-raw
|
5 |
+
language:
|
6 |
+
- mk
|
7 |
+
base_model:
|
8 |
+
- meta-llama/Llama-3.1-8B-Instruct
|
9 |
+
---
|
10 |
+
|
11 |
+
## Model Summary
|
12 |
+
This model is a Macedonian language adaptation of the Llama 3.1 8B model. It has undergone continued pretraining on a deduplicated version of the Macedonian Corpus Raw dataset, containing approximately 1.6 billion tokens. The model has been pretrained for one epoch on this corpus, making it well-suited for tasks involving the Macedonian language, such as text classification, language generation, and translation.
|
13 |
+
|
14 |
+
### Results
|
15 |
+
The table below compares the performance of our model, domestic-yak-8B, with its foundational model, LLaMA 3.1-8B Instruct evaluated using the [macedonian-llm-eval](https://github.com/LVSTCK/macedonian-llm-eval) benchmark.
|
16 |
+
|
17 |
+
As shown in the table, domestic-yak-8B consistently outperforms its foundational model on all tasks.
|
18 |
+
|
19 |
+
| **Task** | **domestic-yak-8B** | **Llama 3.1-8B Instruct** |
|
20 |
+
|-------------------------|---------------------------|-----------------------|
|
21 |
+
| **ARC Easy** | **0.5244 ± 0.0102** | 0.4453 ± 0.0102 |
|
22 |
+
| **ARC Challenge** | **0.3183 ± 0.0136** | 0.2824 ± 0.0132 |
|
23 |
+
| **BoolQ** | **0.7676 ± 0.0074** | 0.7639 ± 0.0074 |
|
24 |
+
| **HellaSwag** | **0.4324 ± 0.0049** | 0.3740 ± 0.0048 |
|
25 |
+
| **Openbook QA** | **0.2920 ± 0.0204** | 0.2520 ± 0.0194 |
|
26 |
+
| **PIQA** | **0.6687 ± 0.0110** | 0.5865 ± 0.0115 |
|
27 |
+
| **NQ Open** | **0.0416 ± 0.0033** | 0.0335 ± 0.0030 |
|
28 |
+
| **WinoGrande** | **0.6259 ± 0.0136** | 0.5683 ± 0.0139 |
|
29 |
+
|
30 |
+
## Key Details
|
31 |
+
- **Language:** Macedonian (`mk`)
|
32 |
+
- **Base Model:** [Meta Llama 3.1 8B](https://huggingface.co/meta-llama/Llama-3.1-8B)
|
33 |
+
- **Dataset:** [LVSTCK/macedonian-corpus-raw](https://huggingface.co/datasets/LVSTCK/macedonian-corpus-raw) (deduplicated version)
|
34 |
+
- **Training Tokens:** ~1.6 billion
|
35 |
+
- **Pretraining Epochs:** 1 epoch
|
36 |
+
- **Pretraining Objective:** Causal Language Modeling (continued pretraining using all the weights)
|
37 |
+
|
38 |
+
## Limitations
|
39 |
+
- **Biases:** The model may show biases present in the training dataset. Efforts were made to clean and deduplicate the corpus, but further bias mitigation might be necessary for sensitive applications.
|
40 |
+
- **Domain Specificity:** While the dataset covers diverse domains, performance may vary for niche or underrepresented topics. For example, the dataset is heavily skewed toward 'news'-themed texts, while domains such as 'science' or 'medicine' are less represented.
|
41 |
+
- **Chat Capabilities:** This version is the base model so its chat capabilities might be limited. If you would like to chat use the [instruct version](https://huggingface.co/LVSTCK/domestic-yak-8B-instruct).
|