LVSTCK
/

domestic-yak-8B

Text Generation

text-generation-inference

Model card Files Files and versions

StefanKrsteski commited on Jan 15

Commit

fd75559

·

verified ·

1 Parent(s): a30e979

Create README.md

Files changed (1) hide show

README.md +41 -0

README.md ADDED Viewed

	@@ -0,0 +1,41 @@

+---
+license: llama3.1
+datasets:
+- LVSTCK/macedonian-corpus-raw
+language:
+- mk
+base_model:
+- meta-llama/Llama-3.1-8B-Instruct
+---
+## Model Summary
+This model is a Macedonian language adaptation of the Llama 3.1 8B model. It has undergone continued pretraining on a deduplicated version of the Macedonian Corpus Raw dataset, containing approximately 1.6 billion tokens. The model has been pretrained for one epoch on this corpus, making it well-suited for tasks involving the Macedonian language, such as text classification, language generation, and translation.
+### Results
+The table below compares the performance of our model, domestic-yak-8B, with its foundational model, LLaMA 3.1-8B Instruct evaluated using the [macedonian-llm-eval](https://github.com/LVSTCK/macedonian-llm-eval) benchmark.
+As shown in the table, domestic-yak-8B consistently outperforms its foundational model on all tasks.
+| **Task**               | **domestic-yak-8B** | **Llama 3.1-8B Instruct** |
+|-------------------------|---------------------------|-----------------------|
+| **ARC Easy**           | **0.5244 ± 0.0102**       | 0.4453 ± 0.0102      |
+| **ARC Challenge**      | **0.3183 ± 0.0136**       | 0.2824 ± 0.0132      |
+| **BoolQ**              | **0.7676 ± 0.0074**       | 0.7639 ± 0.0074      |
+| **HellaSwag**          | **0.4324 ± 0.0049**       | 0.3740 ± 0.0048      |
+| **Openbook QA**        | **0.2920 ± 0.0204**       | 0.2520 ± 0.0194      |
+| **PIQA**               | **0.6687 ± 0.0110**       | 0.5865 ± 0.0115      |
+| **NQ Open**            | **0.0416 ± 0.0033**       | 0.0335 ± 0.0030      |
+| **WinoGrande**         | **0.6259 ± 0.0136**       | 0.5683 ± 0.0139      |
+## Key Details
+- **Language:** Macedonian (`mk`)
+- **Base Model:** [Meta Llama 3.1 8B](https://huggingface.co/meta-llama/Llama-3.1-8B)
+- **Dataset:** [LVSTCK/macedonian-corpus-raw](https://huggingface.co/datasets/LVSTCK/macedonian-corpus-raw) (deduplicated version)
+- **Training Tokens:** ~1.6 billion
+- **Pretraining Epochs:** 1 epoch
+- **Pretraining Objective:** Causal Language Modeling (continued pretraining using all the weights)
+## Limitations
+- **Biases:** The model may show biases present in the training dataset. Efforts were made to clean and deduplicate the corpus, but further bias mitigation might be necessary for sensitive applications.
+- **Domain Specificity:** While the dataset covers diverse domains, performance may vary for niche or underrepresented topics. For example, the dataset is heavily skewed toward 'news'-themed texts, while domains such as 'science' or 'medicine' are less represented.
+- **Chat Capabilities:** This version is the base model so its chat capabilities might be limited. If you would like to chat use the [instruct version](https://huggingface.co/LVSTCK/domestic-yak-8B-instruct).