Update README.md
Browse files
README.md
CHANGED
@@ -17,7 +17,7 @@ pipeline_tag: text-generation
|
|
17 |
|
18 |
# Ahma-3b for Finnish
|
19 |
|
20 |
-
Ahma is 3B parameter decoder-only transformer model based on Llama architecture pretrained on Finnish language. Original Llama model architecture was introduced in
|
21 |
[this paper](https://arxiv.org/abs/2302.13971)
|
22 |
and first released at [this page](https://github.com/facebookresearch/llama).
|
23 |
|
@@ -26,7 +26,7 @@ What does Ahma mean? Ahma is the Finnish word for wolverine! In the Finnish Lapl
|
|
26 |
There are two different sized Ahma models, all pretrained from scratch for 139B tokens:
|
27 |
|
28 |
| Model | Context length | Layers | Dim | Heads | Params |
|
29 |
-
|
30 |
| [Ahma-3B](https://huggingface.co/Finnish-NLP/Ahma-3B) | 2048 | 26 | 3200 | 32 | 3.6B |
|
31 |
| [Ahma-7B](https://huggingface.co/Finnish-NLP/Ahma-7B) | 2048 | 32 | 4096 | 32 | 7.0B |
|
32 |
|
@@ -36,7 +36,7 @@ This model was pretrained only in a self-supervised way, without any supervised
|
|
36 |
|
37 |
### How to use
|
38 |
|
39 |
-
If you want to use this model for instruction-following, you need to use the same prompt format we used in the second stage of the pretraining (basically the same format what Meta used in their Llama2 models). Note: do not use "LlamaTokenizer" from transformers library but always use the AutoTokenizer instead, or use the plain sentencepiece tokenizer
|
40 |
|
41 |
```python
|
42 |
from transformers import AutoTokenizer, AutoModelForCausalLM
|
@@ -122,7 +122,7 @@ The final training dataset had 23 billion words (calculated with regex "\w+") an
|
|
122 |
|
123 |
The first stage:
|
124 |
|Dataset | Words | Ratio |
|
125 |
-
|
126 |
|CulturaX | 12.820B | 59.88\% |
|
127 |
|HPLT v1.2 | 5.034B | 23.51\% |
|
128 |
|Suomi24 | 3.018B | 14.09\% |
|
@@ -135,7 +135,7 @@ The first stage:
|
|
135 |
|
136 |
The second stage:
|
137 |
|Dataset | Words | Ratio |
|
138 |
-
|
139 |
|CulturaX (cleaner sample using KenLM perplexity score) | 2.252B | 55.48\% |
|
140 |
|Wikipedia | 0.095B | 2.34\% |
|
141 |
|STT | 0.253B | 6.23\% |
|
@@ -179,7 +179,7 @@ Thanks to the WSD learning rate schedule, you can more easily experiment with di
|
|
179 |
This Ahma model was primarily evaluated using [FIN-bench by TurkuNLP](https://github.com/TurkuNLP/FIN-bench), and the same evaluation was carried out for other relevant Finnish models for comparison. Below are the results with 0-shot and 3-shot settings in FIN-bench:
|
180 |
|
181 |
| Benchmark | Ahma 3B (instruct prompt format) 0-shot | Ahma 7B (instruct prompt format) 0-shot | FinGPT 8B 0-shot | Viking 7B 0-shot | Poro 34B (8bit quant) 0-shot |
|
182 |
-
|
183 |
| Analogies | 50.77 | TBA | 49.23 | 40.00 | 54.62 |
|
184 |
| Arithmetic | 27.64 | TBA | 33.15 | 30.16 | 30.34 |
|
185 |
| Cause and Effect | 59.48 | TBA | 66.01 | 58.82 | 62.74 |
|
@@ -197,7 +197,7 @@ This Ahma model was primarily evaluated using [FIN-bench by TurkuNLP](https://gi
|
|
197 |
|
198 |
|
199 |
| Benchmark | Ahma 3B (instruct prompt format) 3-shot | Ahma 7B (instruct prompt format) 3-shot | FinGPT 8B 3-shot | Viking 7B 3-shot | Poro 34B (8bit quant) 3-shot |
|
200 |
-
|
201 |
| Analogies | 52.31 | TBA | 40.77 | 54.62 | 76.92 |
|
202 |
| Arithmetic | 44.59 | TBA | 43.63 | 45.78 | 53.68 |
|
203 |
| Cause and Effect | 61.44 | TBA | 64.05 | 58.17 | 67.32 |
|
@@ -224,7 +224,7 @@ In a 3-shot setting, the results are more mixed. The poorer performance of Ahma
|
|
224 |
This Ahma model was also evaluated using [MTBench Finnish by LumiOpen](https://github.com/LumiOpen/FastChat/tree/main/fastchat/llm_judge) even though this Ahma model is not fine-tuned for chat. Since the MTBench evaluates also multi-turn chats while Ahma models were only pretrained with single-turn instruction following examples, we have reported MTBench Finnish results separately for their single-turn and multi-turn evaluation examples. [Poro 34B Chat](https://huggingface.co/LumiOpen/Poro-34B-chat) model's results are copied from their model card for comparison.
|
225 |
|
226 |
| Benchmark | Ahma 3B (instruct prompt format) single-turn | Ahma 3B (instruct prompt format) multi-turn | Ahma 7B (instruct prompt format) single-turn | Ahma 7B (instruct prompt format) multi-turn | Poro 34B Chat multi-turn |
|
227 |
-
|
228 |
| Coding | 1.00 | 1.00 | TBA | TBA | 3.05 |
|
229 |
| Extraction | 2.00 | 1.55 | TBA | TBA | 6.05 |
|
230 |
| Humanities | 4.05 | 3.25 | TBA | TBA | 9.6 |
|
|
|
17 |
|
18 |
# Ahma-3b for Finnish
|
19 |
|
20 |
+
Ahma is 3B parameter decoder-only transformer model based on Meta's Llama (v1) architecture pretrained on Finnish language. Original Llama model architecture was introduced in
|
21 |
[this paper](https://arxiv.org/abs/2302.13971)
|
22 |
and first released at [this page](https://github.com/facebookresearch/llama).
|
23 |
|
|
|
26 |
There are two different sized Ahma models, all pretrained from scratch for 139B tokens:
|
27 |
|
28 |
| Model | Context length | Layers | Dim | Heads | Params |
|
29 |
+
|:--------------------------------------------------------------------------------|:---------------|:-------|:-----|:------|:-------|
|
30 |
| [Ahma-3B](https://huggingface.co/Finnish-NLP/Ahma-3B) | 2048 | 26 | 3200 | 32 | 3.6B |
|
31 |
| [Ahma-7B](https://huggingface.co/Finnish-NLP/Ahma-7B) | 2048 | 32 | 4096 | 32 | 7.0B |
|
32 |
|
|
|
36 |
|
37 |
### How to use
|
38 |
|
39 |
+
If you want to use this model for instruction-following, you need to use the same prompt format we used in the second stage of the pretraining (basically the same format what Meta used in their Llama2 models). **Note: do not use "LlamaTokenizer" from transformers library but always use the AutoTokenizer instead, or use the plain sentencepiece tokenizer.** Here is an example using the instruction-following prompt format, with some generation arguments you can modify for your use:
|
40 |
|
41 |
```python
|
42 |
from transformers import AutoTokenizer, AutoModelForCausalLM
|
|
|
122 |
|
123 |
The first stage:
|
124 |
|Dataset | Words | Ratio |
|
125 |
+
|:-----------------------------|:------------|:-------------|
|
126 |
|CulturaX | 12.820B | 59.88\% |
|
127 |
|HPLT v1.2 | 5.034B | 23.51\% |
|
128 |
|Suomi24 | 3.018B | 14.09\% |
|
|
|
135 |
|
136 |
The second stage:
|
137 |
|Dataset | Words | Ratio |
|
138 |
+
|:--------------------------------------------------------------|:------------|:------------|
|
139 |
|CulturaX (cleaner sample using KenLM perplexity score) | 2.252B | 55.48\% |
|
140 |
|Wikipedia | 0.095B | 2.34\% |
|
141 |
|STT | 0.253B | 6.23\% |
|
|
|
179 |
This Ahma model was primarily evaluated using [FIN-bench by TurkuNLP](https://github.com/TurkuNLP/FIN-bench), and the same evaluation was carried out for other relevant Finnish models for comparison. Below are the results with 0-shot and 3-shot settings in FIN-bench:
|
180 |
|
181 |
| Benchmark | Ahma 3B (instruct prompt format) 0-shot | Ahma 7B (instruct prompt format) 0-shot | FinGPT 8B 0-shot | Viking 7B 0-shot | Poro 34B (8bit quant) 0-shot |
|
182 |
+
|:---------------------------|:----------------------------------------|:----------------------------------------|:-----------------|:-----------------|:-----------------------------|
|
183 |
| Analogies | 50.77 | TBA | 49.23 | 40.00 | 54.62 |
|
184 |
| Arithmetic | 27.64 | TBA | 33.15 | 30.16 | 30.34 |
|
185 |
| Cause and Effect | 59.48 | TBA | 66.01 | 58.82 | 62.74 |
|
|
|
197 |
|
198 |
|
199 |
| Benchmark | Ahma 3B (instruct prompt format) 3-shot | Ahma 7B (instruct prompt format) 3-shot | FinGPT 8B 3-shot | Viking 7B 3-shot | Poro 34B (8bit quant) 3-shot |
|
200 |
+
|:---------------------------|:----------------------------------------|:----------------------------------------|:-----------------|:-----------------|:-----------------------------|
|
201 |
| Analogies | 52.31 | TBA | 40.77 | 54.62 | 76.92 |
|
202 |
| Arithmetic | 44.59 | TBA | 43.63 | 45.78 | 53.68 |
|
203 |
| Cause and Effect | 61.44 | TBA | 64.05 | 58.17 | 67.32 |
|
|
|
224 |
This Ahma model was also evaluated using [MTBench Finnish by LumiOpen](https://github.com/LumiOpen/FastChat/tree/main/fastchat/llm_judge) even though this Ahma model is not fine-tuned for chat. Since the MTBench evaluates also multi-turn chats while Ahma models were only pretrained with single-turn instruction following examples, we have reported MTBench Finnish results separately for their single-turn and multi-turn evaluation examples. [Poro 34B Chat](https://huggingface.co/LumiOpen/Poro-34B-chat) model's results are copied from their model card for comparison.
|
225 |
|
226 |
| Benchmark | Ahma 3B (instruct prompt format) single-turn | Ahma 3B (instruct prompt format) multi-turn | Ahma 7B (instruct prompt format) single-turn | Ahma 7B (instruct prompt format) multi-turn | Poro 34B Chat multi-turn |
|
227 |
+
|:--------------------|:---------------------------------------------|:--------------------------------------------|:---------------------------------------------|:--------------------------------------------|:-------------------------|
|
228 |
| Coding | 1.00 | 1.00 | TBA | TBA | 3.05 |
|
229 |
| Extraction | 2.00 | 1.55 | TBA | TBA | 6.05 |
|
230 |
| Humanities | 4.05 | 3.25 | TBA | TBA | 9.6 |
|