Update README.md
Browse files
README.md
CHANGED
@@ -179,9 +179,9 @@ Note: A small amount of English data was kept to avoid catastrophic forgetting.
|
|
179 |
|
180 |
## Training procedure
|
181 |
|
182 |
-
The training corpus has been tokenized using a byte version of [Byte-Pair Encoding (BPE)](https://github.com/openai/gpt-2)
|
183 |
-
|
184 |
-
|
185 |
The training lasted a total of 320 hours on 8 NVIDIA H100 GPUs with 80GB RAM.
|
186 |
|
187 |
|
|
|
179 |
|
180 |
## Training procedure
|
181 |
|
182 |
+
The training corpus has been tokenized using a byte version of [Byte-Pair Encoding (BPE)](https://github.com/openai/gpt-2) with a vocabulary size of 50,257 tokens.
|
183 |
+
After training a new tokenizer and adapting [falcon-7b](https://huggingface.co/tiiuae/falcon-7b)'s embedding layer, the model was
|
184 |
+
further pre-trained in three target languages: Catalan, Spanish and English.
|
185 |
The training lasted a total of 320 hours on 8 NVIDIA H100 GPUs with 80GB RAM.
|
186 |
|
187 |
|