gonzalez-agirre
commited on
Commit
•
7aecd66
1
Parent(s):
698101b
Update README.md
Browse files
README.md
CHANGED
@@ -148,22 +148,6 @@ At the time of submission, no measures have been taken to estimate the bias and
|
|
148 |
|
149 |
We adapted the original Falcon-7B model to Spanish and Catalan by swapping the tokenizer and adjusting the embedding layer. The adaptation procedure is explained in this [blog](https://medium.com/@mpamies247/ee1ebc70bc79).
|
150 |
|
151 |
-
### New vocabulary
|
152 |
-
We trained a new BPE Tokenizer for the Catalan and Spanish languages (equal representation). We shuffled a small amount of English in the mixture (since English is in the model training data).
|
153 |
-
The resulting data has the following language distribution:
|
154 |
-
|
155 |
-
|Language|%|
|
156 |
-
|---|---|
|
157 |
-
|En|16.84%|
|
158 |
-
|Es|41.38%|
|
159 |
-
|Ca|41.79%|
|
160 |
-
|
161 |
-
This reduced drastically the number of tokens required to tokenize a text in the target language while the English tokenization shows a small increase.
|
162 |
-
|
163 |
-
### Embedding Layer Initialization
|
164 |
-
In order to fully take advantage of the English Pre-Training of the original Falcon model, we decided to re-use the embedding weights of the original model for those tokens shared between the two Tokenizers (the new and the old one). The rest of the embedding weights are initialized as the mean value of the weights of the original Tokenizer.
|
165 |
-
|
166 |
-
|
167 |
## Training
|
168 |
|
169 |
### Training data
|
|
|
148 |
|
149 |
We adapted the original Falcon-7B model to Spanish and Catalan by swapping the tokenizer and adjusting the embedding layer. The adaptation procedure is explained in this [blog](https://medium.com/@mpamies247/ee1ebc70bc79).
|
150 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
151 |
## Training
|
152 |
|
153 |
### Training data
|