jsaizant commited on
Commit
d7fb243
·
verified ·
1 Parent(s): 59aec5d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -285,7 +285,7 @@ The pre-training corpus comprises data from 35 European languages and 92 program
285
  The initial 1.5 training epochs used 2.4 trillion tokens, obtained by manually adjusting data proportion to balance the representation
286
  and give more importance to Spain’s co-official (Spanish, Catalan, Galician, and Basque). This way, we downsampled code and English data to half,
287
  Spanish co-official languages were oversampled by 2x, and the remaining languages were kept in their original proportions.
288
- Following, during the following two epochs, the Colossal OSCAR dataset was replaced with the FineWebEdu dataset.
289
  This adjustment resulted in a total of 2.68 trillion tokens, distributed as outlined below:
290
 
291
  ![lang distrib](./images/corpus_languages.png)
 
285
  The initial 1.5 training epochs used 2.4 trillion tokens, obtained by manually adjusting data proportion to balance the representation
286
  and give more importance to Spain’s co-official (Spanish, Catalan, Galician, and Basque). This way, we downsampled code and English data to half,
287
  Spanish co-official languages were oversampled by 2x, and the remaining languages were kept in their original proportions.
288
+ Following, during the following epochs (still training), the Colossal OSCAR dataset was replaced with the FineWebEdu dataset.
289
  This adjustment resulted in a total of 2.68 trillion tokens, distributed as outlined below:
290
 
291
  ![lang distrib](./images/corpus_languages.png)