gonzalez-agirre commited on
Commit
5110c34
1 Parent(s): cd6011a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +24 -12
README.md CHANGED
@@ -146,7 +146,7 @@ In order to fully take advantage of the English Pre-Training of the original Fal
146
 
147
  ### Training data
148
 
149
- Once the model has been successfully initialized, we continue its pre-training in the two target languages: Catalan and Spanish. We also kept a small amount of English in order to avoid catastrophic forgetting. The composition of our 26B token dataset used to train this model is the following:
150
 
151
  | Dataset | Language | Tokens (pre-epoch) | Epochs |
152
  |---------------------|----------|--------------------|--------------|
@@ -164,7 +164,7 @@ Once the model has been successfully initialized, we continue its pre-training i
164
  | Wikipedia | ca | 228.01M | 3.570361212 |
165
  | Vilaweb | ca | 50.34M | 2.142216727 |
166
 
167
- The resulting dataset has the following language distribution:
168
 
169
  |Language|%|
170
  |---|---|
@@ -172,16 +172,11 @@ The resulting dataset has the following language distribution:
172
  |Es|41.38%|
173
  |Ca|41.79%|
174
 
 
175
 
 
176
 
177
 
178
-
179
- ## Training and evaluation data
180
-
181
- More information needed
182
-
183
- ## Training procedure
184
-
185
  ### Training hyperparameters
186
 
187
  The following hyperparameters were used during training:
@@ -203,9 +198,9 @@ The following hyperparameters were used during training:
203
  ![Validation Loss](https://huggingface.co/BSC-LT/falcon_7b_CPT_open_data_26B_tokens_balanced_es_ca/resolve/main/images/validation_loss_condor.png)
204
  ![Accuracy](https://huggingface.co/BSC-LT/falcon_7b_CPT_open_data_26B_tokens_balanced_es_ca/resolve/main/images/accuracy_condor.png)
205
 
206
- ## Eval results
207
 
208
- It achieves the following results on the evaluation set:
209
  - Loss: 2.1504
210
  - Accuracy: 0.5258
211
 
@@ -214,4 +209,21 @@ It achieves the following results on the evaluation set:
214
  - Transformers 4.30.2
215
  - Pytorch 2.0.0
216
  - Datasets 2.13.1
217
- - Tokenizers 0.13.3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
146
 
147
  ### Training data
148
 
149
+ The training corpus consists 26B tokens of several corpora gathered from web crawlings and public corpora.
150
 
151
  | Dataset | Language | Tokens (pre-epoch) | Epochs |
152
  |---------------------|----------|--------------------|--------------|
 
164
  | Wikipedia | ca | 228.01M | 3.570361212 |
165
  | Vilaweb | ca | 50.34M | 2.142216727 |
166
 
167
+ The dataset has the following language distribution:
168
 
169
  |Language|%|
170
  |---|---|
 
172
  |Es|41.38%|
173
  |Ca|41.79%|
174
 
175
+ ## Training procedure
176
 
177
+ The training corpus has been tokenized using a byte version of [Byte-Pair Encoding (BPE)](https://github.com/openai/gpt-2) used in the original [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model with a vocabulary size of 50,262 tokens. Once the model has been successfully initialized, we continued its pre-training in the three target languages: Catalan, Spanish, and English. We kept a small amount of English in order to avoid catastrophic forgetting. The training lasted a total of 96 hours with 8 NVIDIA H100 GPUs of 80GB of RAM.
178
 
179
 
 
 
 
 
 
 
 
180
  ### Training hyperparameters
181
 
182
  The following hyperparameters were used during training:
 
198
  ![Validation Loss](https://huggingface.co/BSC-LT/falcon_7b_CPT_open_data_26B_tokens_balanced_es_ca/resolve/main/images/validation_loss_condor.png)
199
  ![Accuracy](https://huggingface.co/BSC-LT/falcon_7b_CPT_open_data_26B_tokens_balanced_es_ca/resolve/main/images/accuracy_condor.png)
200
 
201
+ ### Validation results
202
 
203
+ It achieves the following results on the validation set:
204
  - Loss: 2.1504
205
  - Accuracy: 0.5258
206
 
 
209
  - Transformers 4.30.2
210
  - Pytorch 2.0.0
211
  - Datasets 2.13.1
212
+ - Tokenizers 0.13.3
213
+
214
+ ## Additional information
215
+
216
+ ### Author
217
+ Language Technologies Unir at the Barcelona Supercomputing Center ([email protected])
218
+
219
+ ### Contact information
220
+ For further information, send an email to [email protected]
221
+
222
+ ### Copyright
223
+ Copyright (c) 2023 Langtech Unit at Barcelona Supercomputing Center
224
+
225
+ ### Licensing information
226
+ [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
227
+
228
+ ### Funding
229
+ This work was funded by the [Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya](https://politiquesdigitals.gencat.cat/ca/inici/index.html#googtrans(ca|en) within the framework of [Projecte AINA](https://politiquesdigitals.gencat.cat/ca/economia/catalonia-ai/aina). This work was also partially funded by the [Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA)](https://portal.mineco.gob.es/en-us/digitalizacionIA/Pages/sedia.aspx) within the framework of the Plan-TL.