chrisociepa
commited on
Commit
•
cadc6fe
1
Parent(s):
6fb9f0c
Update README.md
Browse files
README.md
CHANGED
@@ -13,11 +13,11 @@ pipeline_tag: text-generation
|
|
13 |
|
14 |
# APT-1B-Base
|
15 |
|
16 |
-
|
17 |
|
18 |
At [Azurro](https://azurro.pl), we consistently place importance on using the Open Source technologies, both while working on the projects and in our everyday lives. We have decided to share a base language model trained by us. We are confident that smaller language models have great potential, and direct access to them for all people that are interested in such models democratizes this significant and dynamically changing field even more.
|
19 |
|
20 |
-
|
21 |
|
22 |
Training large language models requires a lot of computing power and it is meant for the major players on the market. However, does it mean that individuals or small companies cannot train language models capable of performing specific tasks? We decided to answer this question and train our own language model from scratch.
|
23 |
We have made the following statements:
|
@@ -35,7 +35,7 @@ All the currently available language models have been trained mainly with Englis
|
|
35 |
|
36 |
It is important to remember that models are as good as the data with which they are trained. Having regard to the small size of the model, we trained it with carefully selected texts. This is why we have not used corpora such as Common Crawl that contain a lot of poor quality data. Our team has prepared a set of sources that then have been processed and used for training the model.
|
37 |
|
38 |
-
|
39 |
|
40 |
APT-1B-Base is a base model introducing a new series of the APT (Azurro Pretrained Transformer) models. It has been trained with the use of an original open source framework called [ALLaMo](https://github.com/chrisociepa/allamo). This framework allows the user to train language models similar to the Meta AI’s LLaMA models quickly and efficiently.
|
41 |
|
@@ -45,48 +45,58 @@ APT-1B-Base is an autoregressive language model based on the architecture of a t
|
|
45 |
|
46 |
A special tokenizer has been prepared and trained for the purpose of training the model.
|
47 |
|
48 |
-
Model description:
|
49 |
|
50 |
-
*
|
51 |
-
*
|
52 |
-
*
|
53 |
-
*
|
54 |
|
55 |
-
Model details:
|
56 |
|
57 |
-
|
58 |
-
|
59 |
-
|
60 |
-
|
61 |
-
|
62 |
-
|
63 |
-
|
64 |
-
|
65 |
-
|
66 |
-
|
67 |
-
|
68 |
-
|
69 |
-
|
70 |
-
|
71 |
-
|
72 |
-
|
73 |
-
|
74 |
-
|
75 |
-
|
76 |
-
|
77 |
-
|
78 |
-
* optimizer: AdamW, (β1, β2) = (0.9, 0.95), adam_eps = 1e−8
|
79 |
-
* weight decay: 0.1
|
80 |
-
* grad clip: 1.0
|
81 |
-
|
82 |
-
Tokenizer details:
|
83 |
|
84 |
* type: BPE
|
85 |
* special tokens: 7
|
86 |
* alphabet size: 112
|
87 |
* vocabulary size: 8000
|
88 |
|
89 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
90 |
|
91 |
Collecting a large amount of high quality training data is a great challenge. Over the past years at Azurro, we have done a lot of projects connected with processing Big Data. Therefore, with our extensive experience, we have been able to prepare carefully selected training dataset quickly and efficiently.
|
92 |
|
@@ -96,7 +106,7 @@ Our training dataset contains:
|
|
96 |
* Polish Wikipedia: 970 million tokens
|
97 |
* web crawl data: 813 million tokens
|
98 |
|
99 |
-
|
100 |
|
101 |
Our model is fully compatible with HuggingFace - you can use it right away.
|
102 |
|
@@ -114,21 +124,21 @@ import transformers
|
|
114 |
model = transformers.AutoModelForCausalLM.from_pretrained('Azurro/APT-1B-Base', torch_dtype=torch.bfloat16)
|
115 |
```
|
116 |
|
117 |
-
|
118 |
|
119 |
APT-1B-Base is not intended for deployment without fine-tuning. It should not be used for human-facing interactions without further guardrails and user consent.
|
120 |
|
121 |
APT-1B-Base can produce factually incorrect output, and should not be relied on to produce factually accurate information. APT-1B-Base was trained on various public datasets. While great efforts have been taken to clean the pretraining data, it is possible that this model could generate lewd, biased or otherwise offensive outputs.
|
122 |
|
123 |
-
|
124 |
|
125 |
Because of an unclear legal situation, we have decided to publish the model under CC BY NC 4.0 license - it allows for non-commercial use. The model can be used for scientific purposes and privately, as long as the license conditions are met.
|
126 |
|
127 |
-
|
128 |
|
129 |
The license on this model does not constitute legal advice. We are not responsible for the actions of third parties who use this model.
|
130 |
|
131 |
-
|
132 |
Please cite this model using the following format:
|
133 |
|
134 |
```
|
|
|
13 |
|
14 |
# APT-1B-Base
|
15 |
|
16 |
+
## Introduction
|
17 |
|
18 |
At [Azurro](https://azurro.pl), we consistently place importance on using the Open Source technologies, both while working on the projects and in our everyday lives. We have decided to share a base language model trained by us. We are confident that smaller language models have great potential, and direct access to them for all people that are interested in such models democratizes this significant and dynamically changing field even more.
|
19 |
|
20 |
+
## Statements
|
21 |
|
22 |
Training large language models requires a lot of computing power and it is meant for the major players on the market. However, does it mean that individuals or small companies cannot train language models capable of performing specific tasks? We decided to answer this question and train our own language model from scratch.
|
23 |
We have made the following statements:
|
|
|
35 |
|
36 |
It is important to remember that models are as good as the data with which they are trained. Having regard to the small size of the model, we trained it with carefully selected texts. This is why we have not used corpora such as Common Crawl that contain a lot of poor quality data. Our team has prepared a set of sources that then have been processed and used for training the model.
|
37 |
|
38 |
+
## Model
|
39 |
|
40 |
APT-1B-Base is a base model introducing a new series of the APT (Azurro Pretrained Transformer) models. It has been trained with the use of an original open source framework called [ALLaMo](https://github.com/chrisociepa/allamo). This framework allows the user to train language models similar to the Meta AI’s LLaMA models quickly and efficiently.
|
41 |
|
|
|
45 |
|
46 |
A special tokenizer has been prepared and trained for the purpose of training the model.
|
47 |
|
48 |
+
### Model description:
|
49 |
|
50 |
+
* **Developed by:** [Azurro](https://azurro.pl)
|
51 |
+
* **Language:** Polish
|
52 |
+
* **Model type:** causal decoder-only
|
53 |
+
* **License:** CC BY NC 4.0 (non-commercial use)
|
54 |
|
|
|
55 |
|
56 |
+
### Model details:
|
57 |
+
|
58 |
+
| **Hyperparameter** | **Value** |
|
59 |
+
|--------------------|-------------|
|
60 |
+
| Model Parameters | 1060M |
|
61 |
+
| Sequence Length | 2048 |
|
62 |
+
| Vocabulary Size | 8000 |
|
63 |
+
| Layers | 20 |
|
64 |
+
| Heads | 16 |
|
65 |
+
| d_head | 128 |
|
66 |
+
| d_model | 2048 |
|
67 |
+
| Dropout | 0.0 |
|
68 |
+
| Bias | No |
|
69 |
+
| Positional Encoding | RoPE |
|
70 |
+
| Activation Function | SwiGLU |
|
71 |
+
| Normalizing Function | RMSNorm |
|
72 |
+
| Intermediate Size | 5632 |
|
73 |
+
| Norm Epsilon | 1e-06 |
|
74 |
+
|
75 |
+
|
76 |
+
### Tokenizer details:
|
|
|
|
|
|
|
|
|
|
|
77 |
|
78 |
* type: BPE
|
79 |
* special tokens: 7
|
80 |
* alphabet size: 112
|
81 |
* vocabulary size: 8000
|
82 |
|
83 |
+
## Training
|
84 |
+
|
85 |
+
### Training hyperparameters:
|
86 |
+
|
87 |
+
| **Hyperparameter** | **Value** |
|
88 |
+
|-----------------------------|------------------|
|
89 |
+
| Micro Batch Size | 1 |
|
90 |
+
| Gradient Accumulation Steps | 264 |
|
91 |
+
| Batch Size | 540672 |
|
92 |
+
| Learning Rate | 3e-04 |
|
93 |
+
| Optimizer | AdamW |
|
94 |
+
| β1, β2 | 0.9, 0.95 |
|
95 |
+
| Adam_eps | 1e−8 |
|
96 |
+
| Weight Decay | 0.1 |
|
97 |
+
| Grad Clip | 1.0 |
|
98 |
+
|
99 |
+
### Dataset
|
100 |
|
101 |
Collecting a large amount of high quality training data is a great challenge. Over the past years at Azurro, we have done a lot of projects connected with processing Big Data. Therefore, with our extensive experience, we have been able to prepare carefully selected training dataset quickly and efficiently.
|
102 |
|
|
|
106 |
* Polish Wikipedia: 970 million tokens
|
107 |
* web crawl data: 813 million tokens
|
108 |
|
109 |
+
## How to Use
|
110 |
|
111 |
Our model is fully compatible with HuggingFace - you can use it right away.
|
112 |
|
|
|
124 |
model = transformers.AutoModelForCausalLM.from_pretrained('Azurro/APT-1B-Base', torch_dtype=torch.bfloat16)
|
125 |
```
|
126 |
|
127 |
+
## Limitations and Biases
|
128 |
|
129 |
APT-1B-Base is not intended for deployment without fine-tuning. It should not be used for human-facing interactions without further guardrails and user consent.
|
130 |
|
131 |
APT-1B-Base can produce factually incorrect output, and should not be relied on to produce factually accurate information. APT-1B-Base was trained on various public datasets. While great efforts have been taken to clean the pretraining data, it is possible that this model could generate lewd, biased or otherwise offensive outputs.
|
132 |
|
133 |
+
## License
|
134 |
|
135 |
Because of an unclear legal situation, we have decided to publish the model under CC BY NC 4.0 license - it allows for non-commercial use. The model can be used for scientific purposes and privately, as long as the license conditions are met.
|
136 |
|
137 |
+
## Disclaimer
|
138 |
|
139 |
The license on this model does not constitute legal advice. We are not responsible for the actions of third parties who use this model.
|
140 |
|
141 |
+
## Citation
|
142 |
Please cite this model using the following format:
|
143 |
|
144 |
```
|