Update README.md
Browse files
README.md
CHANGED
@@ -175,12 +175,13 @@ The dataset has the following language distribution:
|
|
175 |
|Es|41.38%|
|
176 |
|Ca|41.79%|
|
177 |
|
|
|
|
|
178 |
## Training procedure
|
179 |
|
180 |
The training corpus has been tokenized using a byte version of [Byte-Pair Encoding (BPE)](https://github.com/openai/gpt-2) used
|
181 |
in the original [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model with a vocabulary size of 50,257 tokens.
|
182 |
-
|
183 |
-
We kept a small amount of English data in order to avoid catastrophic forgetting.
|
184 |
The training lasted a total of 320 hours on 8 NVIDIA H100 GPUs with 80GB RAM.
|
185 |
|
186 |
|
@@ -217,7 +218,7 @@ The Language Technologies Unit from Barcelona Supercomputing Center.
|
|
217 |
For further information, please send an email to <[email protected]>.
|
218 |
|
219 |
### Copyright
|
220 |
-
Copyright (c) 2023
|
221 |
|
222 |
### License
|
223 |
[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
|
@@ -225,7 +226,7 @@ Copyright (c) 2023 Langtech Unit at Barcelona Supercomputing Center.
|
|
225 |
### Funding
|
226 |
This work was partially funded by:
|
227 |
- The [Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya](https://politiquesdigitals.gencat.cat/ca/inici/index.html#googtrans(ca|en) within the framework of [Projecte AINA](https://politiquesdigitals.gencat.cat/ca/economia/catalonia-ai/aina).
|
228 |
-
- The [Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA)](https://portal.mineco.gob.es/en-us/digitalizacionIA/Pages/sedia.aspx) within the framework of the [Plan
|
229 |
|
230 |
### Disclaimer
|
231 |
|
|
|
175 |
|Es|41.38%|
|
176 |
|Ca|41.79%|
|
177 |
|
178 |
+
Note: We kept a small amount of English data in order to avoid catastrophic forgetting.
|
179 |
+
|
180 |
## Training procedure
|
181 |
|
182 |
The training corpus has been tokenized using a byte version of [Byte-Pair Encoding (BPE)](https://github.com/openai/gpt-2) used
|
183 |
in the original [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model with a vocabulary size of 50,257 tokens.
|
184 |
+
After training a new tokenizer and adapting falcon-7b's embedding layer, we continued its pre-training in three target languages: Catalan, Spanish, and English.
|
|
|
185 |
The training lasted a total of 320 hours on 8 NVIDIA H100 GPUs with 80GB RAM.
|
186 |
|
187 |
|
|
|
218 |
For further information, please send an email to <[email protected]>.
|
219 |
|
220 |
### Copyright
|
221 |
+
Copyright (c) 2023 by Language Technologies Unit at Barcelona Supercomputing Center.
|
222 |
|
223 |
### License
|
224 |
[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
|
|
|
226 |
### Funding
|
227 |
This work was partially funded by:
|
228 |
- The [Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya](https://politiquesdigitals.gencat.cat/ca/inici/index.html#googtrans(ca|en) within the framework of [Projecte AINA](https://politiquesdigitals.gencat.cat/ca/economia/catalonia-ai/aina).
|
229 |
+
- The [Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA)](https://portal.mineco.gob.es/en-us/digitalizacionIA/Pages/sedia.aspx) within the framework of the [Plan de Impulso de las Tecnologías del Lenguaje](https://plantl.mineco.gob.es/Paginas/index.aspx).
|
230 |
|
231 |
### Disclaimer
|
232 |
|