Azurro
/

APT-1B-Base

@@ -13,11 +13,11 @@ pipeline_tag: text-generation
 # APT-1B-Base
-### Introduction
 At [Azurro](https://azurro.pl), we consistently place importance on using the Open Source technologies, both while working on the projects and in our everyday lives. We have decided to share a base language model trained by us. We are confident that smaller language models have great potential, and direct access to them for all people that are interested in such models democratizes this significant and dynamically changing field even more.
-### Statements
 Training large language models requires a lot of computing power and it is meant for the major players on the market. However, does it mean that individuals or small companies cannot train language models capable of performing specific tasks? We decided to answer this question and train our own language model from scratch.
 We have made the following statements:
@@ -35,7 +35,7 @@ All the currently available language models have been trained mainly with Englis
 It is important to remember that models are as good as the data with which they are trained. Having regard to the small size of the model, we trained it with carefully selected texts. This is why we have not used corpora such as Common Crawl that contain a lot of poor quality data. Our team has prepared a set of sources that then have been processed and used for training the model.
-### Model
 APT-1B-Base is a base model introducing a new series of the APT (Azurro Pretrained Transformer) models. It has been trained with the use of an original open source framework called [ALLaMo](https://github.com/chrisociepa/allamo). This framework allows the user to train language models similar to the Meta AI’s LLaMA models quickly and efficiently.
@@ -45,48 +45,58 @@ APT-1B-Base is an autoregressive language model based on the architecture of a t
 A special tokenizer has been prepared and trained for the purpose of training the model.
-Model description:
-* developed by [Azurro](https://azurro.pl)
-* language: Polish
-* model type: causal decoder-only
-* license: CC BY NC 4.0 (non-commercial use)
-Model details:
-* model parameters: 1060M
-* sequence length: 2048
-* vocabulary size: 8000
-* layers: 20
-* heads: 16
-* d_head: 128
-* d_model: 2048
-* dropout: 0.0
-* no bias
-* positional encoding: RoPE
-* activation function: SwiGLU
-* normalizing function: RMSNorm
-* intermediate size: 5632
-* norm epsilon: 1e-06
-Training hyperparameters:
-* micro batch size: 1
-* gradient accumulation steps: 264
-* batch size: 540672 = 1 * 264 * 2048
-* learning rate: 3e-04
-* optimizer: AdamW, (β1, β2) = (0.9, 0.95), adam_eps = 1e−8
-* weight decay: 0.1
-* grad clip: 1.0
-Tokenizer details:
 * type: BPE
 * special tokens: 7
 * alphabet size: 112
 * vocabulary size: 8000
-### Training dataset
 Collecting a large amount of high quality training data is a great challenge. Over the past years at Azurro, we have done a lot of projects connected with processing Big Data. Therefore, with our extensive experience, we have been able to prepare carefully selected training dataset quickly and efficiently.
@@ -96,7 +106,7 @@ Our training dataset contains:
 * Polish Wikipedia: 970 million tokens
 * web crawl data: 813 million tokens
-# How to Use
 Our model is fully compatible with HuggingFace - you can use it right away.
@@ -114,21 +124,21 @@ import transformers
 model = transformers.AutoModelForCausalLM.from_pretrained('Azurro/APT-1B-Base', torch_dtype=torch.bfloat16)
 ```
-### Limitations and Biases
 APT-1B-Base is not intended for deployment without fine-tuning. It should not be used for human-facing interactions without further guardrails and user consent.
 APT-1B-Base can produce factually incorrect output, and should not be relied on to produce factually accurate information. APT-1B-Base was trained on various public datasets. While great efforts have been taken to clean the pretraining data, it is possible that this model could generate lewd, biased or otherwise offensive outputs.
-### License
 Because of an unclear legal situation, we have decided to publish the model under CC BY NC 4.0 license - it allows for non-commercial use. The model can be used for scientific purposes and privately, as long as the license conditions are met.
-### Disclaimer
 The license on this model does not constitute legal advice. We are not responsible for the actions of third parties who use this model.
-### Citation
 Please cite this model using the following format:
 ```

 # APT-1B-Base
+## Introduction
 At [Azurro](https://azurro.pl), we consistently place importance on using the Open Source technologies, both while working on the projects and in our everyday lives. We have decided to share a base language model trained by us. We are confident that smaller language models have great potential, and direct access to them for all people that are interested in such models democratizes this significant and dynamically changing field even more.
+## Statements
 Training large language models requires a lot of computing power and it is meant for the major players on the market. However, does it mean that individuals or small companies cannot train language models capable of performing specific tasks? We decided to answer this question and train our own language model from scratch.
 We have made the following statements:
 It is important to remember that models are as good as the data with which they are trained. Having regard to the small size of the model, we trained it with carefully selected texts. This is why we have not used corpora such as Common Crawl that contain a lot of poor quality data. Our team has prepared a set of sources that then have been processed and used for training the model.
+## Model
 APT-1B-Base is a base model introducing a new series of the APT (Azurro Pretrained Transformer) models. It has been trained with the use of an original open source framework called [ALLaMo](https://github.com/chrisociepa/allamo). This framework allows the user to train language models similar to the Meta AI’s LLaMA models quickly and efficiently.
 A special tokenizer has been prepared and trained for the purpose of training the model.
+### Model description:
+* **Developed by:** [Azurro](https://azurro.pl)
+* **Language:** Polish
+* **Model type:** causal decoder-only
+* **License:** CC BY NC 4.0 (non-commercial use)
+### Model details:
+| **Hyperparameter** | **Value**   |
+|--------------------|-------------|
+| Model Parameters   | 1060M       |
+| Sequence Length    | 2048        |
+| Vocabulary Size    | 8000        |
+| Layers             | 20          |
+| Heads              | 16          |
+| d_head             | 128         |
+| d_model            | 2048        |
+| Dropout            | 0.0         |
+| Bias               | No          |
+| Positional Encoding | RoPE       |
+| Activation Function | SwiGLU     |
+| Normalizing Function | RMSNorm   |
+| Intermediate Size  | 5632        |
+| Norm Epsilon       | 1e-06       |
+### Tokenizer details:
 * type: BPE
 * special tokens: 7
 * alphabet size: 112
 * vocabulary size: 8000
+## Training
+### Training hyperparameters:
+| **Hyperparameter**          | **Value**        |
+|-----------------------------|------------------|
+| Micro Batch Size            | 1                |
+| Gradient Accumulation Steps | 264              |
+| Batch Size                  | 540672           |
+| Learning Rate               | 3e-04            |
+| Optimizer                   | AdamW            |
+| β1, β2                      | 0.9, 0.95        |
+| Adam_eps                    | 1e−8             |
+| Weight Decay                | 0.1              |
+| Grad Clip                   | 1.0              |
+### Dataset
 Collecting a large amount of high quality training data is a great challenge. Over the past years at Azurro, we have done a lot of projects connected with processing Big Data. Therefore, with our extensive experience, we have been able to prepare carefully selected training dataset quickly and efficiently.
 * Polish Wikipedia: 970 million tokens
 * web crawl data: 813 million tokens
+## How to Use
 Our model is fully compatible with HuggingFace - you can use it right away.
 model = transformers.AutoModelForCausalLM.from_pretrained('Azurro/APT-1B-Base', torch_dtype=torch.bfloat16)
 ```
+## Limitations and Biases
 APT-1B-Base is not intended for deployment without fine-tuning. It should not be used for human-facing interactions without further guardrails and user consent.
 APT-1B-Base can produce factually incorrect output, and should not be relied on to produce factually accurate information. APT-1B-Base was trained on various public datasets. While great efforts have been taken to clean the pretraining data, it is possible that this model could generate lewd, biased or otherwise offensive outputs.
+## License
 Because of an unclear legal situation, we have decided to publish the model under CC BY NC 4.0 license - it allows for non-commercial use. The model can be used for scientific purposes and privately, as long as the license conditions are met.
+## Disclaimer
 The license on this model does not constitute legal advice. We are not responsible for the actions of third parties who use this model.
+## Citation
 Please cite this model using the following format:
 ```