BUT-FIT
/

csmpt7b

@@ -35,16 +35,17 @@ The model was trained on 3 corpora, which were hot-swapped during the training.
 <img src="figures/tloss_full.png"  width="900"/>
-Figure 1: Training loss.
 <img src="figures/tloss_closeup.png" width="900"/>
-Figure 2: Training loss closeup. We mark two hotswap places, where the training corpus #1 was switched for internal-corpus #2 and internal-corpus #2.1 respectively.
 Additionaly, we perform  two ablations:
 - (a) After first hot swap, we continued training on the corpus #1 for a while.
 - (b) On step 94,000, the training loss stopped decreasing, increased, and around step 120,000 (near hot swap #2) started decreasing again. To ablate whether this was an effect of hot-swap,
 - we resume training from step 93,000 using corpus #3. The optimizer states were reinitialized.
 <img src="figures/vloss_closeup.png" width="900"/>
-Figure 3: Test loss closeup, testing performed on split of internal-corpus #1. See Figure 2 description for ablation explanation.
 ## Training Method
@@ -52,7 +53,7 @@ Figure 3: Test loss closeup, testing performed on split of internal-corpus #1. S
 To transfer knowledge from English model to Czech, we developed a simple method that (i) aligns several tokens between two vocabularies and (ii) copies the embeddings from original language to new language.
 <img src="figures/tllama_test.png" width="900"/>
-Figure 4: Ablation: Test perplexity over the course of training for vocabulary swap method on TinyLLAMA. Our method (green curve) vs TinyLLAMA training from scratch (blue curve).
 The vocabulary swap was done the same way as our [Czech-GPT-2](https://huggingface.co/BUT-FIT/Czech-GPT-2-XL-133k) model (check it out for comprehensive description.)
 We managed to align 4,177 english tokens with corresponding czech tokens.

 <img src="figures/tloss_full.png"  width="900"/>
+_Figure 1: Training loss._
 <img src="figures/tloss_closeup.png" width="900"/>
+_Figure 2: Training loss closeup. We mark two hotswap places, where the training corpus #1 was switched for internal-corpus #2 and internal-corpus #2.1 respectively._
 Additionaly, we perform  two ablations:
 - (a) After first hot swap, we continued training on the corpus #1 for a while.
 - (b) On step 94,000, the training loss stopped decreasing, increased, and around step 120,000 (near hot swap #2) started decreasing again. To ablate whether this was an effect of hot-swap,
 - we resume training from step 93,000 using corpus #3. The optimizer states were reinitialized.
 <img src="figures/vloss_closeup.png" width="900"/>
+_Figure 3: Test loss closeup, testing performed on split of internal-corpus #1. See Figure 2 description for ablation explanation._
 ## Training Method
 To transfer knowledge from English model to Czech, we developed a simple method that (i) aligns several tokens between two vocabularies and (ii) copies the embeddings from original language to new language.
 <img src="figures/tllama_test.png" width="900"/>
+_Figure 4: Ablation: Test perplexity over the course of training for vocabulary swap method on TinyLLAMA. Our method (green curve) vs TinyLLAMA training from scratch (blue curve)._
 The vocabulary swap was done the same way as our [Czech-GPT-2](https://huggingface.co/BUT-FIT/Czech-GPT-2-XL-133k) model (check it out for comprehensive description.)
 We managed to align 4,177 english tokens with corresponding czech tokens.