Update README.md

Browse files

Files changed (1) hide show

README.md +5 -4

README.md CHANGED Viewed

@@ -175,12 +175,13 @@ The dataset has the following language distribution:
 |Es|41.38%|
 |Ca|41.79%|
 ## Training procedure
 The training corpus has been tokenized using a byte version of [Byte-Pair Encoding (BPE)](https://github.com/openai/gpt-2) used
 in the original [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model with a vocabulary size of 50,257 tokens.
-Once the model has been successfully initialized, we continued its pre-training in the three target languages: Catalan, Spanish, and English.
-We kept a small amount of English data in order to avoid catastrophic forgetting.
 The training lasted a total of 320 hours on 8 NVIDIA H100 GPUs with 80GB RAM.
@@ -217,7 +218,7 @@ The Language Technologies Unit from Barcelona Supercomputing Center.
 For further information, please send an email to <[email protected]>.
 ### Copyright
-Copyright (c) 2023 Langtech Unit at Barcelona Supercomputing Center.
 ### License
 [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
@@ -225,7 +226,7 @@ Copyright (c) 2023 Langtech Unit at Barcelona Supercomputing Center.
 ### Funding
 This work was partially funded by:
   - The [Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya](https://politiquesdigitals.gencat.cat/ca/inici/index.html#googtrans(ca|en) within the framework of [Projecte AINA](https://politiquesdigitals.gencat.cat/ca/economia/catalonia-ai/aina).
-  - The [Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA)](https://portal.mineco.gob.es/en-us/digitalizacionIA/Pages/sedia.aspx) within the framework of the [Plan-TL](https://plantl.mineco.gob.es/Paginas/index.aspx).
 ### Disclaimer

 |Es|41.38%|
 |Ca|41.79%|
+Note: We kept a small amount of English data in order to avoid catastrophic forgetting.
 ## Training procedure
 The training corpus has been tokenized using a byte version of [Byte-Pair Encoding (BPE)](https://github.com/openai/gpt-2) used
 in the original [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model with a vocabulary size of 50,257 tokens.
+After training a new tokenizer and adapting falcon-7b's embedding layer, we continued its pre-training in three target languages: Catalan, Spanish, and English.
 The training lasted a total of 320 hours on 8 NVIDIA H100 GPUs with 80GB RAM.
 For further information, please send an email to <[email protected]>.
 ### Copyright
+Copyright (c) 2023 by Language Technologies Unit at Barcelona Supercomputing Center.
 ### License
 [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
 ### Funding
 This work was partially funded by:
   - The [Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya](https://politiquesdigitals.gencat.cat/ca/inici/index.html#googtrans(ca|en) within the framework of [Projecte AINA](https://politiquesdigitals.gencat.cat/ca/economia/catalonia-ai/aina).
+  - The [Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA)](https://portal.mineco.gob.es/en-us/digitalizacionIA/Pages/sedia.aspx) within the framework of the [Plan de Impulso de las Tecnologías del Lenguaje](https://plantl.mineco.gob.es/Paginas/index.aspx).
 ### Disclaimer