Finnish-NLP
/

Ahma-3B

@@ -17,7 +17,7 @@ pipeline_tag: text-generation
 # Ahma-3b for Finnish
-Ahma is 3B parameter decoder-only transformer model based on Llama architecture pretrained on Finnish language. Original Llama model architecture was introduced in
 [this paper](https://arxiv.org/abs/2302.13971)
 and first released at [this page](https://github.com/facebookresearch/llama).
@@ -26,7 +26,7 @@ What does Ahma mean? Ahma is the Finnish word for wolverine! In the Finnish Lapl
 There are two different sized Ahma models, all pretrained from scratch for 139B tokens:
 | Model                                                                           | Context length | Layers | Dim  | Heads | Params |
-|---------------------------------------------------------------------------------|----------------|--------|------|-------|--------|
 | [Ahma-3B](https://huggingface.co/Finnish-NLP/Ahma-3B)                           | 2048           | 26     | 3200 | 32    | 3.6B   |
 | [Ahma-7B](https://huggingface.co/Finnish-NLP/Ahma-7B)                           | 2048           | 32     | 4096 | 32    | 7.0B   |
@@ -36,7 +36,7 @@ This model was pretrained only in a self-supervised way, without any supervised
 ### How to use
-If you want to use this model for instruction-following, you need to use the same prompt format we used in the second stage of the pretraining (basically the same format what Meta used in their Llama2 models). Note: do not use "LlamaTokenizer" from transformers library but always use the AutoTokenizer instead, or use the plain sentencepiece tokenizer. Here is an example using the instruction-following prompt format, with some generation arguments you can modify for your use:
 ```python
 from transformers import AutoTokenizer, AutoModelForCausalLM
@@ -122,7 +122,7 @@ The final training dataset had 23 billion words (calculated with regex "\w+") an
 The first stage:
 |Dataset                       | Words       | Ratio        |
-|------------------------------|-------------|--------------|
 |CulturaX                      | 12.820B     | 59.88\%      |
 |HPLT v1.2                     | 5.034B      | 23.51\%      |
 |Suomi24                       | 3.018B      | 14.09\%      |
@@ -135,7 +135,7 @@ The first stage:
 The second stage:
 |Dataset                                                        | Words       | Ratio       |
-|---------------------------------------------------------------|-------------|-------------|
 |CulturaX (cleaner sample using KenLM perplexity score)         | 2.252B      | 55.48\%     |
 |Wikipedia                                                      | 0.095B      | 2.34\%      |
 |STT                                                            | 0.253B      | 6.23\%      |
@@ -179,7 +179,7 @@ Thanks to the WSD learning rate schedule, you can more easily experiment with di
 This Ahma model was primarily evaluated using [FIN-bench by TurkuNLP](https://github.com/TurkuNLP/FIN-bench), and the same evaluation was carried out for other relevant Finnish models for comparison. Below are the results with 0-shot and 3-shot settings in FIN-bench:
 | Benchmark                  | Ahma 3B (instruct prompt format) 0-shot | Ahma 7B (instruct prompt format) 0-shot | FinGPT 8B 0-shot | Viking 7B 0-shot | Poro 34B (8bit quant) 0-shot |
-|----------------------------|-----------------------------------------|-----------------------------------------|------------------|------------------|------------------------------|
 | Analogies                  | 50.77                                   | TBA                                     | 49.23            | 40.00            | 54.62                        |
 | Arithmetic                 | 27.64                                   | TBA                                     | 33.15            | 30.16            | 30.34                        |
 | Cause and Effect           | 59.48                                   | TBA                                     | 66.01            | 58.82            | 62.74                        |
@@ -197,7 +197,7 @@ This Ahma model was primarily evaluated using [FIN-bench by TurkuNLP](https://gi
 | Benchmark                  | Ahma 3B (instruct prompt format) 3-shot | Ahma 7B (instruct prompt format) 3-shot | FinGPT 8B 3-shot | Viking 7B 3-shot | Poro 34B (8bit quant) 3-shot |
-|----------------------------|-----------------------------------------|-----------------------------------------|------------------|------------------|------------------------------|
 | Analogies                  | 52.31                                   | TBA                                     | 40.77            | 54.62            | 76.92                        |
 | Arithmetic                 | 44.59                                   | TBA                                     | 43.63            | 45.78            | 53.68                        |
 | Cause and Effect           | 61.44                                   | TBA                                     | 64.05            | 58.17            | 67.32                        |
@@ -224,7 +224,7 @@ In a 3-shot setting, the results are more mixed. The poorer performance of Ahma
 This Ahma model was also evaluated using [MTBench Finnish by LumiOpen](https://github.com/LumiOpen/FastChat/tree/main/fastchat/llm_judge) even though this Ahma model is not fine-tuned for chat. Since the MTBench evaluates also multi-turn chats while Ahma models were only pretrained with single-turn instruction following examples, we have reported MTBench Finnish results separately for their single-turn and multi-turn evaluation examples. [Poro 34B Chat](https://huggingface.co/LumiOpen/Poro-34B-chat) model's results are copied from their model card for comparison.
 | Benchmark           | Ahma 3B (instruct prompt format) single-turn | Ahma 3B (instruct prompt format) multi-turn | Ahma 7B (instruct prompt format) single-turn | Ahma 7B (instruct prompt format) multi-turn | Poro 34B Chat multi-turn |
-|---------------------|----------------------------------------------|---------------------------------------------|----------------------------------------------|---------------------------------------------|--------------------------|
 | Coding              | 1.00                                         | 1.00                                        | TBA                                          | TBA                                         | 3.05                     |
 | Extraction          | 2.00                                         | 1.55                                        | TBA                                          | TBA                                         | 6.05                     |
 | Humanities          | 4.05                                         | 3.25                                        | TBA                                          | TBA                                         | 9.6                      |

 # Ahma-3b for Finnish
+Ahma is 3B parameter decoder-only transformer model based on Meta's Llama (v1) architecture pretrained on Finnish language. Original Llama model architecture was introduced in
 [this paper](https://arxiv.org/abs/2302.13971)
 and first released at [this page](https://github.com/facebookresearch/llama).
 There are two different sized Ahma models, all pretrained from scratch for 139B tokens:
 | Model                                                                           | Context length | Layers | Dim  | Heads | Params |
+|:--------------------------------------------------------------------------------|:---------------|:-------|:-----|:------|:-------|
 | [Ahma-3B](https://huggingface.co/Finnish-NLP/Ahma-3B)                           | 2048           | 26     | 3200 | 32    | 3.6B   |
 | [Ahma-7B](https://huggingface.co/Finnish-NLP/Ahma-7B)                           | 2048           | 32     | 4096 | 32    | 7.0B   |
 ### How to use
+If you want to use this model for instruction-following, you need to use the same prompt format we used in the second stage of the pretraining (basically the same format what Meta used in their Llama2 models). **Note: do not use "LlamaTokenizer" from transformers library but always use the AutoTokenizer instead, or use the plain sentencepiece tokenizer.** Here is an example using the instruction-following prompt format, with some generation arguments you can modify for your use:
 ```python
 from transformers import AutoTokenizer, AutoModelForCausalLM
 The first stage:
 |Dataset                       | Words       | Ratio        |
+|:-----------------------------|:------------|:-------------|
 |CulturaX                      | 12.820B     | 59.88\%      |
 |HPLT v1.2                     | 5.034B      | 23.51\%      |
 |Suomi24                       | 3.018B      | 14.09\%      |
 The second stage:
 |Dataset                                                        | Words       | Ratio       |
+|:--------------------------------------------------------------|:------------|:------------|
 |CulturaX (cleaner sample using KenLM perplexity score)         | 2.252B      | 55.48\%     |
 |Wikipedia                                                      | 0.095B      | 2.34\%      |
 |STT                                                            | 0.253B      | 6.23\%      |
 This Ahma model was primarily evaluated using [FIN-bench by TurkuNLP](https://github.com/TurkuNLP/FIN-bench), and the same evaluation was carried out for other relevant Finnish models for comparison. Below are the results with 0-shot and 3-shot settings in FIN-bench:
 | Benchmark                  | Ahma 3B (instruct prompt format) 0-shot | Ahma 7B (instruct prompt format) 0-shot | FinGPT 8B 0-shot | Viking 7B 0-shot | Poro 34B (8bit quant) 0-shot |
+|:---------------------------|:----------------------------------------|:----------------------------------------|:-----------------|:-----------------|:-----------------------------|
 | Analogies                  | 50.77                                   | TBA                                     | 49.23            | 40.00            | 54.62                        |
 | Arithmetic                 | 27.64                                   | TBA                                     | 33.15            | 30.16            | 30.34                        |
 | Cause and Effect           | 59.48                                   | TBA                                     | 66.01            | 58.82            | 62.74                        |
 | Benchmark                  | Ahma 3B (instruct prompt format) 3-shot | Ahma 7B (instruct prompt format) 3-shot | FinGPT 8B 3-shot | Viking 7B 3-shot | Poro 34B (8bit quant) 3-shot |
+|:---------------------------|:----------------------------------------|:----------------------------------------|:-----------------|:-----------------|:-----------------------------|
 | Analogies                  | 52.31                                   | TBA                                     | 40.77            | 54.62            | 76.92                        |
 | Arithmetic                 | 44.59                                   | TBA                                     | 43.63            | 45.78            | 53.68                        |
 | Cause and Effect           | 61.44                                   | TBA                                     | 64.05            | 58.17            | 67.32                        |
 This Ahma model was also evaluated using [MTBench Finnish by LumiOpen](https://github.com/LumiOpen/FastChat/tree/main/fastchat/llm_judge) even though this Ahma model is not fine-tuned for chat. Since the MTBench evaluates also multi-turn chats while Ahma models were only pretrained with single-turn instruction following examples, we have reported MTBench Finnish results separately for their single-turn and multi-turn evaluation examples. [Poro 34B Chat](https://huggingface.co/LumiOpen/Poro-34B-chat) model's results are copied from their model card for comparison.
 | Benchmark           | Ahma 3B (instruct prompt format) single-turn | Ahma 3B (instruct prompt format) multi-turn | Ahma 7B (instruct prompt format) single-turn | Ahma 7B (instruct prompt format) multi-turn | Poro 34B Chat multi-turn |
+|:--------------------|:---------------------------------------------|:--------------------------------------------|:---------------------------------------------|:--------------------------------------------|:-------------------------|
 | Coding              | 1.00                                         | 1.00                                        | TBA                                          | TBA                                         | 3.05                     |
 | Extraction          | 2.00                                         | 1.55                                        | TBA                                          | TBA                                         | 6.05                     |
 | Humanities          | 4.05                                         | 3.25                                        | TBA                                          | TBA                                         | 9.6                      |