gonzalez-agirre
commited on
Commit
•
5110c34
1
Parent(s):
cd6011a
Update README.md
Browse files
README.md
CHANGED
@@ -146,7 +146,7 @@ In order to fully take advantage of the English Pre-Training of the original Fal
|
|
146 |
|
147 |
### Training data
|
148 |
|
149 |
-
|
150 |
|
151 |
| Dataset | Language | Tokens (pre-epoch) | Epochs |
|
152 |
|---------------------|----------|--------------------|--------------|
|
@@ -164,7 +164,7 @@ Once the model has been successfully initialized, we continue its pre-training i
|
|
164 |
| Wikipedia | ca | 228.01M | 3.570361212 |
|
165 |
| Vilaweb | ca | 50.34M | 2.142216727 |
|
166 |
|
167 |
-
The
|
168 |
|
169 |
|Language|%|
|
170 |
|---|---|
|
@@ -172,16 +172,11 @@ The resulting dataset has the following language distribution:
|
|
172 |
|Es|41.38%|
|
173 |
|Ca|41.79%|
|
174 |
|
|
|
175 |
|
|
|
176 |
|
177 |
|
178 |
-
|
179 |
-
## Training and evaluation data
|
180 |
-
|
181 |
-
More information needed
|
182 |
-
|
183 |
-
## Training procedure
|
184 |
-
|
185 |
### Training hyperparameters
|
186 |
|
187 |
The following hyperparameters were used during training:
|
@@ -203,9 +198,9 @@ The following hyperparameters were used during training:
|
|
203 |
![Validation Loss](https://huggingface.co/BSC-LT/falcon_7b_CPT_open_data_26B_tokens_balanced_es_ca/resolve/main/images/validation_loss_condor.png)
|
204 |
![Accuracy](https://huggingface.co/BSC-LT/falcon_7b_CPT_open_data_26B_tokens_balanced_es_ca/resolve/main/images/accuracy_condor.png)
|
205 |
|
206 |
-
|
207 |
|
208 |
-
It achieves the following results on the
|
209 |
- Loss: 2.1504
|
210 |
- Accuracy: 0.5258
|
211 |
|
@@ -214,4 +209,21 @@ It achieves the following results on the evaluation set:
|
|
214 |
- Transformers 4.30.2
|
215 |
- Pytorch 2.0.0
|
216 |
- Datasets 2.13.1
|
217 |
-
- Tokenizers 0.13.3
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
146 |
|
147 |
### Training data
|
148 |
|
149 |
+
The training corpus consists 26B tokens of several corpora gathered from web crawlings and public corpora.
|
150 |
|
151 |
| Dataset | Language | Tokens (pre-epoch) | Epochs |
|
152 |
|---------------------|----------|--------------------|--------------|
|
|
|
164 |
| Wikipedia | ca | 228.01M | 3.570361212 |
|
165 |
| Vilaweb | ca | 50.34M | 2.142216727 |
|
166 |
|
167 |
+
The dataset has the following language distribution:
|
168 |
|
169 |
|Language|%|
|
170 |
|---|---|
|
|
|
172 |
|Es|41.38%|
|
173 |
|Ca|41.79%|
|
174 |
|
175 |
+
## Training procedure
|
176 |
|
177 |
+
The training corpus has been tokenized using a byte version of [Byte-Pair Encoding (BPE)](https://github.com/openai/gpt-2) used in the original [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model with a vocabulary size of 50,262 tokens. Once the model has been successfully initialized, we continued its pre-training in the three target languages: Catalan, Spanish, and English. We kept a small amount of English in order to avoid catastrophic forgetting. The training lasted a total of 96 hours with 8 NVIDIA H100 GPUs of 80GB of RAM.
|
178 |
|
179 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
180 |
### Training hyperparameters
|
181 |
|
182 |
The following hyperparameters were used during training:
|
|
|
198 |
![Validation Loss](https://huggingface.co/BSC-LT/falcon_7b_CPT_open_data_26B_tokens_balanced_es_ca/resolve/main/images/validation_loss_condor.png)
|
199 |
![Accuracy](https://huggingface.co/BSC-LT/falcon_7b_CPT_open_data_26B_tokens_balanced_es_ca/resolve/main/images/accuracy_condor.png)
|
200 |
|
201 |
+
### Validation results
|
202 |
|
203 |
+
It achieves the following results on the validation set:
|
204 |
- Loss: 2.1504
|
205 |
- Accuracy: 0.5258
|
206 |
|
|
|
209 |
- Transformers 4.30.2
|
210 |
- Pytorch 2.0.0
|
211 |
- Datasets 2.13.1
|
212 |
+
- Tokenizers 0.13.3
|
213 |
+
|
214 |
+
## Additional information
|
215 |
+
|
216 |
+
### Author
|
217 |
+
Language Technologies Unir at the Barcelona Supercomputing Center ([email protected])
|
218 |
+
|
219 |
+
### Contact information
|
220 |
+
For further information, send an email to [email protected]
|
221 |
+
|
222 |
+
### Copyright
|
223 |
+
Copyright (c) 2023 Langtech Unit at Barcelona Supercomputing Center
|
224 |
+
|
225 |
+
### Licensing information
|
226 |
+
[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
|
227 |
+
|
228 |
+
### Funding
|
229 |
+
This work was funded by the [Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya](https://politiquesdigitals.gencat.cat/ca/inici/index.html#googtrans(ca|en) within the framework of [Projecte AINA](https://politiquesdigitals.gencat.cat/ca/economia/catalonia-ai/aina). This work was also partially funded by the [Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA)](https://portal.mineco.gob.es/en-us/digitalizacionIA/Pages/sedia.aspx) within the framework of the Plan-TL.
|