jsaizant commited on
Commit
59aec5d
·
verified ·
1 Parent(s): 5c5381e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +17 -17
README.md CHANGED
@@ -281,18 +281,19 @@ for output in outputs:
281
 
282
  ### Pretraining Data
283
 
284
- The training corpus consists of 2.4 trillion tokens, including 35 European languages and 92 programming languages. It amounts to a total of 33TB of pre-processed text.
285
- Languages were sampled manually by giving x2 oversampling to Spain's co-official languages (Spanish, Catalan, Galician and Basque), code was undersampled by half,
286
- and the rest of the languages were kept as is, resulting in the following distribution:
 
 
 
287
 
288
  ![lang distrib](./images/corpus_languages.png)
289
 
290
- This highly multilingual corpus is predominantly composed of data from Colossal OSCAR,
291
- which contributes a significant 66.06% of the total tokens.
292
- Following this, Starcoder provides 11.91%, and HPLT adds 3.34%.
293
- The next largest sources are French PD at 3.12% and Proof Pile at 1.98%.
294
- Other notable contributions include Macocu, Pile of Law, and Eurlex, each contributing around 1.5% to 1.3%.
295
- These major sources collectively form the bulk of the corpus, ensuring a rich and diverse dataset for training the language model.
296
  The remaining 10% comes from smaller sources in various languages.
297
 
298
  Feel free to click the expand button below to see the full list of sources.
@@ -431,8 +432,6 @@ To consult the data summary document with the respective licences, please send a
431
 
432
  </details>
433
 
434
- The model is being trained for 3 epochs meaning that the total number of tokens seen during pre-training will amount to roughly 9.2 trillion tokens (currently still training).
435
-
436
  We provide an extense Datasheet section following the best practices defined by [(Gebru et al., 2021)](https://arxiv.org/pdf/1803.09010).
437
 
438
  <details>
@@ -442,7 +441,7 @@ We provide an extense Datasheet section following the best practices defined by
442
 
443
  **For what purpose was the dataset created? Was there a specific task in mind? Was there a specific gap that needed to be filled? Please provide a description.**
444
 
445
- The purpose of creating this dataset is to pre-train a family of multilingual models with high performance in a large number of
446
  European languages (35) and code (including 92 different programming languages). In addition, we aim to represent especially the co-official
447
  languages of Spain: Spanish, Catalan, Galician, and Basque. This is the reason why we carry out an oversampling of these languages.
448
 
@@ -464,6 +463,7 @@ and public institutions, which can be found in detail in the acknowledgements.
464
  **Who funded the creation of the dataset? If there is an associated grant, please provide the name of the grantor and the grant name and number.**
465
 
466
  This work/research has been promoted and financed by the Government of Catalonia through the [Aina project](https://projecteaina.cat/).
 
467
  This work is funded by the _Ministerio para la Transformación Digital y de la Función Pública_ - Funded by EU – NextGenerationEU
468
  within the framework of [ILENIA Project](https://proyectoilenia.es/) with reference 2022/TL22/00215337.
469
 
@@ -490,10 +490,10 @@ We provide a complete list of dataset sources at the end of this section.
490
  **How many instances are there in total (of each type, if appropriate)?**
491
 
492
  The dataset contains a diverse range of instances across multiple languages, with notable adjustments for certain languages. English
493
- represents the largest portion, accounting for 39.08% of the total data. Spanish was upsampled by a factor of 2, bringing its share to 16.59%,
494
- while Catalan (1.84%), Basque (0.26%), and Galician (0.36%) were also upsampled by 2. On the other hand, code-related data was downsampled
495
- by half, making up 6.42% of the total. Other prominent languages include French (6.59%), Russian (5.39%), German (4.25%), and Hungarian
496
- (3.93%), with several additional languages contributing between 1% and 2%, and smaller portions represented by a variety of others.
497
 
498
  **Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set? If the dataset is a sample, then what is the larger set? Is the sample representative of the larger set (e.g., geographic coverage)? If so, please describe how this representativeness was validated/verified. If it is not representative of the larger set, please describe why not (e.g., to cover a more diverse range of instances, because instances were withheld or unavailable).**
499
 
@@ -628,7 +628,7 @@ and the [Ungoliant](https://github.com/oscar-project/ungoliant) pipeline was use
628
 
629
  **Has the dataset been used for any tasks already? If so, please provide a description.**
630
 
631
- Pre-train the ALIA model and the Salamandra model family.
632
 
633
  **What (other) tasks could the dataset be used for?**
634
 
 
281
 
282
  ### Pretraining Data
283
 
284
+ The pre-training corpus comprises data from 35 European languages and 92 programming languages, with detailed data sources provided below.
285
+ The initial 1.5 training epochs used 2.4 trillion tokens, obtained by manually adjusting data proportion to balance the representation
286
+ and give more importance to Spain’s co-official (Spanish, Catalan, Galician, and Basque). This way, we downsampled code and English data to half,
287
+ Spanish co-official languages were oversampled by 2x, and the remaining languages were kept in their original proportions.
288
+ Following, during the following two epochs, the Colossal OSCAR dataset was replaced with the FineWebEdu dataset.
289
+ This adjustment resulted in a total of 2.68 trillion tokens, distributed as outlined below:
290
 
291
  ![lang distrib](./images/corpus_languages.png)
292
 
293
+ The pretraining corpus is predominantly composed of data from Colossal OSCAR, which contributes a significant 53,05% of the total tokens.
294
+ Following this, Starcoder provides 13,67%, and FineWebEdu (350B tokens subset) adds 10,24%. The next largest sources are HPLT at 4,21% and French-PD at 3,59%.
295
+ Other notable contributions include MaCoCu, Legal-ES, and EurLex, each contributing around 1.72% to 1.41%.
296
+ These major sources collectively form the bulk of the corpus, ensuring a rich and diverse dataset for training the language model.
 
 
297
  The remaining 10% comes from smaller sources in various languages.
298
 
299
  Feel free to click the expand button below to see the full list of sources.
 
432
 
433
  </details>
434
 
 
 
435
  We provide an extense Datasheet section following the best practices defined by [(Gebru et al., 2021)](https://arxiv.org/pdf/1803.09010).
436
 
437
  <details>
 
441
 
442
  **For what purpose was the dataset created? Was there a specific task in mind? Was there a specific gap that needed to be filled? Please provide a description.**
443
 
444
+ The purpose of creating this dataset is to pre-train the Salamandra family of multilingual models with high performance in a large number of
445
  European languages (35) and code (including 92 different programming languages). In addition, we aim to represent especially the co-official
446
  languages of Spain: Spanish, Catalan, Galician, and Basque. This is the reason why we carry out an oversampling of these languages.
447
 
 
463
  **Who funded the creation of the dataset? If there is an associated grant, please provide the name of the grantor and the grant name and number.**
464
 
465
  This work/research has been promoted and financed by the Government of Catalonia through the [Aina project](https://projecteaina.cat/).
466
+
467
  This work is funded by the _Ministerio para la Transformación Digital y de la Función Pública_ - Funded by EU – NextGenerationEU
468
  within the framework of [ILENIA Project](https://proyectoilenia.es/) with reference 2022/TL22/00215337.
469
 
 
490
  **How many instances are there in total (of each type, if appropriate)?**
491
 
492
  The dataset contains a diverse range of instances across multiple languages, with notable adjustments for certain languages. English
493
+ represents the largest portion, accounting for 39.31% of the total data. Spanish was upsampled by a factor of 2, bringing its share to 16.12%,
494
+ while Catalan (1.97%), Basque (0.24%), and Galician (0.31%) were also upsampled by 2. On the other hand, code-related data was downsampled
495
+ by half, making up 5.78% of the total. Other prominent languages include French (6.6%), Russian (5.56%), German (4.79%), and Hungarian
496
+ (4.59%), with several additional languages contributing between 1% and 2%, and smaller portions represented by a variety of others.
497
 
498
  **Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set? If the dataset is a sample, then what is the larger set? Is the sample representative of the larger set (e.g., geographic coverage)? If so, please describe how this representativeness was validated/verified. If it is not representative of the larger set, please describe why not (e.g., to cover a more diverse range of instances, because instances were withheld or unavailable).**
499
 
 
628
 
629
  **Has the dataset been used for any tasks already? If so, please provide a description.**
630
 
631
+ Pre-train the Salamandra model family.
632
 
633
  **What (other) tasks could the dataset be used for?**
634