BSC-LT
/

ALIA-40b

@@ -285,7 +285,7 @@ The pre-training corpus comprises data from 35 European languages and 92 program
 The initial 1.5 training epochs used 2.4 trillion tokens, obtained by manually adjusting data proportion to balance the representation
 and give more importance to Spain’s co-official (Spanish, Catalan, Galician, and Basque). This way, we downsampled code and English data to half,
 Spanish co-official languages were oversampled by 2x, and the remaining languages were kept in their original proportions.
-Following, during the following two epochs, the Colossal OSCAR dataset was replaced with the FineWebEdu dataset.
 This adjustment resulted in a total of 2.68 trillion tokens, distributed as outlined below:
 ![lang distrib](./images/corpus_languages.png)

 The initial 1.5 training epochs used 2.4 trillion tokens, obtained by manually adjusting data proportion to balance the representation
 and give more importance to Spain’s co-official (Spanish, Catalan, Galician, and Basque). This way, we downsampled code and English data to half,
 Spanish co-official languages were oversampled by 2x, and the remaining languages were kept in their original proportions.
+Following, during the following epochs (still training), the Colossal OSCAR dataset was replaced with the FineWebEdu dataset.
 This adjustment resulted in a total of 2.68 trillion tokens, distributed as outlined below:
 ![lang distrib](./images/corpus_languages.png)