Text Generation
Transformers
Safetensors
Czech
mpt
custom_code
text-generation-inference
Inference Endpoints
mfajcik commited on
Commit
be07e94
·
verified ·
1 Parent(s): 6668b28

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -3
README.md CHANGED
@@ -40,11 +40,13 @@ Figure 1: Training loss.
40
  <img src="figures/tloss_closeup.png" width="900"/>
41
  Figure 2: Training loss closeup. We mark two hotswap places, where the training corpus #1 was switched for internal-corpus #2 and internal-corpus #2.1 respectively. The flat region between 112k steps and 119.5k steps is caused by missing data---due to an accident, we lost these logs.
42
 
43
- Additionaly, we perform two ablations:
44
 
45
- - (a) After first hot swap, we continued training on the corpus #1 for a while.
46
  - (b) On step 94,000, the training loss stopped decreasing, increased, and around step 120,000 (near hot swap #2) started decreasing again. To ablate whether this was an effect of hot-swap,
47
- - we resume training from step 93,000 using corpus #3. The optimizer states were reinitialized.
 
 
48
  <img src="figures/vloss_closeup.png" width="900"/>
49
  Figure 3: Test loss closeup, testing performed on split of internal-corpus #1. See Figure 2 description for ablation explanation.
50
 
 
40
  <img src="figures/tloss_closeup.png" width="900"/>
41
  Figure 2: Training loss closeup. We mark two hotswap places, where the training corpus #1 was switched for internal-corpus #2 and internal-corpus #2.1 respectively. The flat region between 112k steps and 119.5k steps is caused by missing data---due to an accident, we lost these logs.
42
 
43
+ In Figure 2, we perform two ablations:
44
 
45
+ - (a) After first hot swap, we continued training on the corpus #1 for a while. Result: The fact that test loss is slightly better, signifies the slight difference between distribution of corpus #1 and corpus #2.
46
  - (b) On step 94,000, the training loss stopped decreasing, increased, and around step 120,000 (near hot swap #2) started decreasing again. To ablate whether this was an effect of hot-swap,
47
+ - we resume training from step 93,000 using corpus #3. The optimizer states were reinitialized. Result: Neither corpus #3, nor optimizier state reinitialization seems to mitigate the issue of local divergence at step 94,000.
48
+
49
+ -
50
  <img src="figures/vloss_closeup.png" width="900"/>
51
  Figure 3: Test loss closeup, testing performed on split of internal-corpus #1. See Figure 2 description for ablation explanation.
52