Update README.md
Browse files
README.md
CHANGED
@@ -35,16 +35,17 @@ The model was trained on 3 corpora, which were hot-swapped during the training.
|
|
35 |
|
36 |
|
37 |
<img src="figures/tloss_full.png" width="900"/>
|
38 |
-
|
39 |
<img src="figures/tloss_closeup.png" width="900"/>
|
40 |
-
|
|
|
41 |
Additionaly, we perform two ablations:
|
42 |
|
43 |
- (a) After first hot swap, we continued training on the corpus #1 for a while.
|
44 |
- (b) On step 94,000, the training loss stopped decreasing, increased, and around step 120,000 (near hot swap #2) started decreasing again. To ablate whether this was an effect of hot-swap,
|
45 |
- we resume training from step 93,000 using corpus #3. The optimizer states were reinitialized.
|
46 |
<img src="figures/vloss_closeup.png" width="900"/>
|
47 |
-
|
48 |
|
49 |
|
50 |
## Training Method
|
@@ -52,7 +53,7 @@ Figure 3: Test loss closeup, testing performed on split of internal-corpus #1. S
|
|
52 |
To transfer knowledge from English model to Czech, we developed a simple method that (i) aligns several tokens between two vocabularies and (ii) copies the embeddings from original language to new language.
|
53 |
<img src="figures/tllama_test.png" width="900"/>
|
54 |
|
55 |
-
|
56 |
|
57 |
The vocabulary swap was done the same way as our [Czech-GPT-2](https://huggingface.co/BUT-FIT/Czech-GPT-2-XL-133k) model (check it out for comprehensive description.)
|
58 |
We managed to align 4,177 english tokens with corresponding czech tokens.
|
|
|
35 |
|
36 |
|
37 |
<img src="figures/tloss_full.png" width="900"/>
|
38 |
+
_Figure 1: Training loss._
|
39 |
<img src="figures/tloss_closeup.png" width="900"/>
|
40 |
+
_Figure 2: Training loss closeup. We mark two hotswap places, where the training corpus #1 was switched for internal-corpus #2 and internal-corpus #2.1 respectively._
|
41 |
+
|
42 |
Additionaly, we perform two ablations:
|
43 |
|
44 |
- (a) After first hot swap, we continued training on the corpus #1 for a while.
|
45 |
- (b) On step 94,000, the training loss stopped decreasing, increased, and around step 120,000 (near hot swap #2) started decreasing again. To ablate whether this was an effect of hot-swap,
|
46 |
- we resume training from step 93,000 using corpus #3. The optimizer states were reinitialized.
|
47 |
<img src="figures/vloss_closeup.png" width="900"/>
|
48 |
+
_Figure 3: Test loss closeup, testing performed on split of internal-corpus #1. See Figure 2 description for ablation explanation._
|
49 |
|
50 |
|
51 |
## Training Method
|
|
|
53 |
To transfer knowledge from English model to Czech, we developed a simple method that (i) aligns several tokens between two vocabularies and (ii) copies the embeddings from original language to new language.
|
54 |
<img src="figures/tllama_test.png" width="900"/>
|
55 |
|
56 |
+
_Figure 4: Ablation: Test perplexity over the course of training for vocabulary swap method on TinyLLAMA. Our method (green curve) vs TinyLLAMA training from scratch (blue curve)._
|
57 |
|
58 |
The vocabulary swap was done the same way as our [Czech-GPT-2](https://huggingface.co/BUT-FIT/Czech-GPT-2-XL-133k) model (check it out for comprehensive description.)
|
59 |
We managed to align 4,177 english tokens with corresponding czech tokens.
|