Update README.md
Browse files
README.md
CHANGED
@@ -27,14 +27,24 @@ We encountered loss spikes during training. As the model always recovered, and o
|
|
27 |
- (b) in preliminary ablations, they only appear for continuously pretrained models. While we do not know why do they appear, we hypothesize this might be linked to theory on [Adam instability in time-domain correlation of update vectors](https://arxiv.org/pdf/2304.09871.pdf). However
|
28 |
such instabilities were previously observed only for much larger models (larger than 65b).
|
29 |
|
30 |
-
|
|
|
|
|
|
|
|
|
|
|
31 |
|
32 |
<img src="figures/tloss_full.png" width="900"/>
|
33 |
Figure 1: Training loss.
|
34 |
<img src="figures/tloss_closeup.png" width="900"/>
|
35 |
-
Figure 2: Training loss closeup. We mark two hotswap places, where the training corpus #1 was switched for internal-corpus #2 and internal-corpus #2.1 respectively.
|
|
|
|
|
|
|
|
|
|
|
36 |
<img src="figures/vloss_closeup.png" width="900"/>
|
37 |
-
Figure 3: Test loss closeup, testing performed on internal-corpus #1.
|
38 |
|
39 |
|
40 |
## Training Method
|
@@ -83,7 +93,7 @@ with torch.autocast('cuda', dtype=torch.bfloat16):
|
|
83 |
|
84 |
```
|
85 |
# Training Data
|
86 |
-
We release most of our training data
|
87 |
|
88 |
|
89 |
# Our Release Plan
|
@@ -98,10 +108,24 @@ We release most of our training data here \[TBD MDocekal.\].
|
|
98 |
For further questions, email to `[email protected]`.
|
99 |
|
100 |
# Disclaimer
|
101 |
-
This is a probabilistic model,
|
102 |
-
|
103 |
|
104 |
# Acknowledgement
|
105 |
This work was supported by NAKI III program of Ministry of Culture Czech Republic, project semANT ---
|
106 |
"Sémantický průzkumník textového kulturního dědictví" grant no. `DH23P03OVV060` and
|
107 |
-
by the Ministry of Education, Youth and Sports of the Czech Republic through the e-INFRA CZ (ID:`90254`).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
27 |
- (b) in preliminary ablations, they only appear for continuously pretrained models. While we do not know why do they appear, we hypothesize this might be linked to theory on [Adam instability in time-domain correlation of update vectors](https://arxiv.org/pdf/2304.09871.pdf). However
|
28 |
such instabilities were previously observed only for much larger models (larger than 65b).
|
29 |
|
30 |
+
### Corpora
|
31 |
+
The model was trained on 3 corpora, which were hot-swapped during the training. These were collected/filtered during the course of training.
|
32 |
+
- Corpus #1 was the same we used for our [Czech GPT-2](https://huggingface.co/BUT-FIT/Czech-GPT-2-XL-133k) training (15,621,685,248 tokens).
|
33 |
+
- Corpus #2 contained 67,981,934,592 tokens, coming mostly from HPLT and CulturaX corpora.
|
34 |
+
- Corpus #3 is Corpus #2 after we removed proportions of the unappropriate content (which avoided our other checks) through linear classifier.
|
35 |
+
|
36 |
|
37 |
<img src="figures/tloss_full.png" width="900"/>
|
38 |
Figure 1: Training loss.
|
39 |
<img src="figures/tloss_closeup.png" width="900"/>
|
40 |
+
Figure 2: Training loss closeup. We mark two hotswap places, where the training corpus #1 was switched for internal-corpus #2 and internal-corpus #2.1 respectively.
|
41 |
+
Additionaly, we perform two ablations:
|
42 |
+
|
43 |
+
- (a) After first hot swap, we continued training on the corpus #1 for a while.
|
44 |
+
- (b) On step 94,000, the training loss stopped decreasing, increased, and around step 120,000 (near hot swap #2) started decreasing again. To ablate whether this was an effect of hot-swap,
|
45 |
+
- we resume training from step 93,000 using corpus #3. The optimizer states were reinitialized.
|
46 |
<img src="figures/vloss_closeup.png" width="900"/>
|
47 |
+
Figure 3: Test loss closeup, testing performed on split of internal-corpus #1. See Figure 2 description for ablation explanation.
|
48 |
|
49 |
|
50 |
## Training Method
|
|
|
93 |
|
94 |
```
|
95 |
# Training Data
|
96 |
+
We release most (95.79%) of our training data corpus [BUT-Large Czech Collection](https://huggingface.co/datasets/BUT-FIT/but_lcc).
|
97 |
|
98 |
|
99 |
# Our Release Plan
|
|
|
108 |
For further questions, email to `[email protected]`.
|
109 |
|
110 |
# Disclaimer
|
111 |
+
This is a probabilistic model, it can output stochastic information. Authors are not responsible for the model outputs. Use at your own risk.
|
|
|
112 |
|
113 |
# Acknowledgement
|
114 |
This work was supported by NAKI III program of Ministry of Culture Czech Republic, project semANT ---
|
115 |
"Sémantický průzkumník textového kulturního dědictví" grant no. `DH23P03OVV060` and
|
116 |
+
by the Ministry of Education, Youth and Sports of the Czech Republic through the e-INFRA CZ (ID:`90254`).
|
117 |
+
|
118 |
+
# Citation
|
119 |
+
```bibtex
|
120 |
+
@article{benczechmark,
|
121 |
+
author = {Martin Fajčík, Martin Dočekal, Jan Doležal, Karel Beneš, Michal Hradiš},
|
122 |
+
title = {BenCzechMark: Machine Language Understanding Benchmark for Czech Language},
|
123 |
+
journal = {arXiv preprint arXiv:insert-arxiv-number-here},
|
124 |
+
year = {2024},
|
125 |
+
month = {March},
|
126 |
+
eprint = {insert-arxiv-number-here},
|
127 |
+
archivePrefix = {arXiv},
|
128 |
+
primaryClass = {cs.CL},
|
129 |
+
}
|
130 |
+
|
131 |
+
```
|