Update README.md
Browse files
README.md
CHANGED
@@ -2,7 +2,7 @@
|
|
2 |
license: apache-2.0
|
3 |
---
|
4 |
# Introduction
|
5 |
-
CSMPT7b is a large Czech language model continously pretrained from English [MPT7b](https://huggingface.co/mosaicml/mpt-7b) model. Model
|
6 |
|
7 |
# Eval
|
8 |
Dev eval at CS-HellaSwag (automatically translated HellaSwag benchmark).
|
@@ -48,7 +48,33 @@ Figure 3: Test loss closeup, testing performed on split of internal-corpus #1. S
|
|
48 |
|
49 |
|
50 |
## Training Method
|
51 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
52 |
|
53 |
|
54 |
# Usage
|
@@ -92,7 +118,7 @@ with torch.autocast('cuda', dtype=torch.bfloat16):
|
|
92 |
|
93 |
```
|
94 |
# Training Data
|
95 |
-
We release most (95.79%) of our training data corpus [BUT-Large Czech Collection](https://huggingface.co/datasets/BUT-FIT/but_lcc).
|
96 |
|
97 |
|
98 |
# Our Release Plan
|
|
|
2 |
license: apache-2.0
|
3 |
---
|
4 |
# Introduction
|
5 |
+
CSMPT7b is a large Czech language model continously pretrained for 272b training steps from English [MPT7b](https://huggingface.co/mosaicml/mpt-7b) model. Model was pretrained on ~67b token [Large Czech Collection](https://huggingface.co/datasets/BUT-FIT/but_lcc) using Czech tokenizer, obtained using our vocabulary swap method (see below).
|
6 |
|
7 |
# Eval
|
8 |
Dev eval at CS-HellaSwag (automatically translated HellaSwag benchmark).
|
|
|
48 |
|
49 |
|
50 |
## Training Method
|
51 |
+
### Vocabulary Swap
|
52 |
+
The vocabulary swap was done the same way as our [Czech-GPT-2](https://huggingface.co/BUT-FIT/Czech-GPT-2-XL-133k) model (check it out for comprehensive description.)
|
53 |
+
We managed to align 4,177 english tokens with corresponding czech tokens.
|
54 |
+
|
55 |
+
## Hyperparameters
|
56 |
+
Not mentioned hyperparameters were kept the same as for MPT.
|
57 |
+
| **Name** | **Value** | **Note** |
|
58 |
+
|----------------------------|---------------|----------------------------------------------------------------------------------------------|
|
59 |
+
| training sw | llm-foundry | We've done some minor patching (e.g., to allow DDP sync over file) |
|
60 |
+
| dataset_type | Concat | Sequences at the model's input were concatenated up to `$max_seq_len`, divided by EOS token. |
|
61 |
+
| tokenizer_size | 64k | Same as in [Czech-GPT-2](https://huggingface.co/BUT-FIT/Czech-GPT-2-XL-133k) |
|
62 |
+
| max_seq_len | 2048 | |
|
63 |
+
| batch_size | 1024 | |
|
64 |
+
| learning_rate | 1.0e-4 | |
|
65 |
+
| optimizer | LionW | |
|
66 |
+
| optimizer_betas | 0.9/0.95 | |
|
67 |
+
| optimizer_weight_decay | 0 | |
|
68 |
+
| optimizer_eps | 1.0e-08 | |
|
69 |
+
| gradient_clipping_max_norm | 1.0 | |
|
70 |
+
| attn_impl | flash2 | we used triton flash-attn 1 implementation for initial ~60k steps |
|
71 |
+
| positional_encoding | alibi | |
|
72 |
+
| fsdp | FULL_SHARD | (we had implementation issues with hybrid sharding in llm-foundry) |
|
73 |
+
| precision | bf16 | |
|
74 |
+
| scheduler | cosine | |
|
75 |
+
| scheduler_warmup | 100 steps | |
|
76 |
+
| scheduler_steps | 170,000 | |
|
77 |
+
| scheduler_alpha | 0.1 | So LR on last step is 0.1*(vanilla LR) |
|
78 |
|
79 |
|
80 |
# Usage
|
|
|
118 |
|
119 |
```
|
120 |
# Training Data
|
121 |
+
We release most (95.79%) of our training data corpus as [BUT-Large Czech Collection](https://huggingface.co/datasets/BUT-FIT/but_lcc).
|
122 |
|
123 |
|
124 |
# Our Release Plan
|