Update README.md
Browse files
README.md
CHANGED
@@ -285,7 +285,7 @@ The pre-training corpus comprises data from 35 European languages and 92 program
|
|
285 |
The initial 1.5 training epochs used 2.4 trillion tokens, obtained by manually adjusting data proportion to balance the representation
|
286 |
and give more importance to Spain’s co-official (Spanish, Catalan, Galician, and Basque). This way, we downsampled code and English data to half,
|
287 |
Spanish co-official languages were oversampled by 2x, and the remaining languages were kept in their original proportions.
|
288 |
-
Following, during the following
|
289 |
This adjustment resulted in a total of 2.68 trillion tokens, distributed as outlined below:
|
290 |
|
291 |
![lang distrib](./images/corpus_languages.png)
|
|
|
285 |
The initial 1.5 training epochs used 2.4 trillion tokens, obtained by manually adjusting data proportion to balance the representation
|
286 |
and give more importance to Spain’s co-official (Spanish, Catalan, Galician, and Basque). This way, we downsampled code and English data to half,
|
287 |
Spanish co-official languages were oversampled by 2x, and the remaining languages were kept in their original proportions.
|
288 |
+
Following, during the following epochs (still training), the Colossal OSCAR dataset was replaced with the FineWebEdu dataset.
|
289 |
This adjustment resulted in a total of 2.68 trillion tokens, distributed as outlined below:
|
290 |
|
291 |
![lang distrib](./images/corpus_languages.png)
|