StefanKrsteski commited on
Commit
fd75559
·
verified ·
1 Parent(s): a30e979

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +41 -0
README.md ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: llama3.1
3
+ datasets:
4
+ - LVSTCK/macedonian-corpus-raw
5
+ language:
6
+ - mk
7
+ base_model:
8
+ - meta-llama/Llama-3.1-8B-Instruct
9
+ ---
10
+
11
+ ## Model Summary
12
+ This model is a Macedonian language adaptation of the Llama 3.1 8B model. It has undergone continued pretraining on a deduplicated version of the Macedonian Corpus Raw dataset, containing approximately 1.6 billion tokens. The model has been pretrained for one epoch on this corpus, making it well-suited for tasks involving the Macedonian language, such as text classification, language generation, and translation.
13
+
14
+ ### Results
15
+ The table below compares the performance of our model, domestic-yak-8B, with its foundational model, LLaMA 3.1-8B Instruct evaluated using the [macedonian-llm-eval](https://github.com/LVSTCK/macedonian-llm-eval) benchmark.
16
+
17
+ As shown in the table, domestic-yak-8B consistently outperforms its foundational model on all tasks.
18
+
19
+ | **Task** | **domestic-yak-8B** | **Llama 3.1-8B Instruct** |
20
+ |-------------------------|---------------------------|-----------------------|
21
+ | **ARC Easy** | **0.5244 ± 0.0102** | 0.4453 ± 0.0102 |
22
+ | **ARC Challenge** | **0.3183 ± 0.0136** | 0.2824 ± 0.0132 |
23
+ | **BoolQ** | **0.7676 ± 0.0074** | 0.7639 ± 0.0074 |
24
+ | **HellaSwag** | **0.4324 ± 0.0049** | 0.3740 ± 0.0048 |
25
+ | **Openbook QA** | **0.2920 ± 0.0204** | 0.2520 ± 0.0194 |
26
+ | **PIQA** | **0.6687 ± 0.0110** | 0.5865 ± 0.0115 |
27
+ | **NQ Open** | **0.0416 ± 0.0033** | 0.0335 ± 0.0030 |
28
+ | **WinoGrande** | **0.6259 ± 0.0136** | 0.5683 ± 0.0139 |
29
+
30
+ ## Key Details
31
+ - **Language:** Macedonian (`mk`)
32
+ - **Base Model:** [Meta Llama 3.1 8B](https://huggingface.co/meta-llama/Llama-3.1-8B)
33
+ - **Dataset:** [LVSTCK/macedonian-corpus-raw](https://huggingface.co/datasets/LVSTCK/macedonian-corpus-raw) (deduplicated version)
34
+ - **Training Tokens:** ~1.6 billion
35
+ - **Pretraining Epochs:** 1 epoch
36
+ - **Pretraining Objective:** Causal Language Modeling (continued pretraining using all the weights)
37
+
38
+ ## Limitations
39
+ - **Biases:** The model may show biases present in the training dataset. Efforts were made to clean and deduplicate the corpus, but further bias mitigation might be necessary for sensitive applications.
40
+ - **Domain Specificity:** While the dataset covers diverse domains, performance may vary for niche or underrepresented topics. For example, the dataset is heavily skewed toward 'news'-themed texts, while domains such as 'science' or 'medicine' are less represented.
41
+ - **Chat Capabilities:** This version is the base model so its chat capabilities might be limited. If you would like to chat use the [instruct version](https://huggingface.co/LVSTCK/domestic-yak-8B-instruct).