Update README.md
Browse files
README.md
CHANGED
@@ -91,7 +91,7 @@ pipeline_tag: text-generation
|
|
91 |
|
92 |
## Model description
|
93 |
|
94 |
-
|
95 |
It is based on the [Falcon-7B](https://huggingface.co/tiiuae/falcon-7b) model and has been trained on a 26B token
|
96 |
trilingual corpus collected from publicly available corpora and crawlers.
|
97 |
|
@@ -99,7 +99,7 @@ trilingual corpus collected from publicly available corpora and crawlers.
|
|
99 |
## Intended uses and limitations
|
100 |
|
101 |
The **Ǎguila-7B** model is ready-to-use only for causal language modeling to perform text-generation tasks.
|
102 |
-
However, it is intended to be fine-tuned
|
103 |
|
104 |
## How to use
|
105 |
|
@@ -141,15 +141,15 @@ on multiple web sources. We intend to conduct research in these areas in the fut
|
|
141 |
|
142 |
## Language adaptation
|
143 |
|
144 |
-
We adapted the original Falcon-7B model to Spanish and Catalan by swapping the tokenizer and adjusting the embedding layer.
|
145 |
|
146 |
-
The adaptation procedure is explained in this
|
147 |
|
148 |
## Training
|
149 |
|
150 |
### Training data
|
151 |
|
152 |
-
The training corpus consists 26B tokens of several corpora gathered from web crawlings and public domain data.
|
153 |
|
154 |
| Dataset | Language | Tokens (per-epoch) | Epochs |
|
155 |
|---------------------|----------|--------------------|--------------|
|
@@ -170,10 +170,10 @@ The training corpus consists 26B tokens of several corpora gathered from web cra
|
|
170 |
The dataset has the following language distribution:
|
171 |
|
172 |
|Language|Percentage|
|
173 |
-
|
174 |
-
|En|16.84
|
175 |
-
|Es|41.38
|
176 |
-
|Ca|41.79
|
177 |
|
178 |
Note: A small amount of English data was kept to avoid catastrophic forgetting.
|
179 |
|
@@ -181,7 +181,7 @@ Note: A small amount of English data was kept to avoid catastrophic forgetting.
|
|
181 |
|
182 |
The training corpus has been tokenized using a byte version of [Byte-Pair Encoding (BPE)](https://github.com/openai/gpt-2) used
|
183 |
in the original [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model with a vocabulary size of 50,257 tokens.
|
184 |
-
After training a new tokenizer and adapting falcon-7b's embedding layer, we continued its pre-training in three target languages: Catalan, Spanish, and English.
|
185 |
The training lasted a total of 320 hours on 8 NVIDIA H100 GPUs with 80GB RAM.
|
186 |
|
187 |
|
@@ -191,9 +191,9 @@ The training lasted a total of 320 hours on 8 NVIDIA H100 GPUs with 80GB RAM.
|
|
191 |
- distributed_type: multi-GPU
|
192 |
- num_devices: 8
|
193 |
- train_batch_size: 1
|
194 |
-
- eval_batch_size:
|
195 |
- total_train_batch_size: 8
|
196 |
-
- total_eval_batch_size:
|
197 |
- optimizer: Adam
|
198 |
- betas: (0.9,0.999)
|
199 |
- epsilon: 1e-08
|
|
|
91 |
|
92 |
## Model description
|
93 |
|
94 |
+
**Ǎguila-7B** is a transformer-based causal language model for Catalan, Spanish, and English.
|
95 |
It is based on the [Falcon-7B](https://huggingface.co/tiiuae/falcon-7b) model and has been trained on a 26B token
|
96 |
trilingual corpus collected from publicly available corpora and crawlers.
|
97 |
|
|
|
99 |
## Intended uses and limitations
|
100 |
|
101 |
The **Ǎguila-7B** model is ready-to-use only for causal language modeling to perform text-generation tasks.
|
102 |
+
However, it is intended to be fine-tuned for downstream tasks.
|
103 |
|
104 |
## How to use
|
105 |
|
|
|
141 |
|
142 |
## Language adaptation
|
143 |
|
144 |
+
We adapted the original [Falcon-7B](https://huggingface.co/tiiuae/falcon-7b) model to Spanish and Catalan by swapping the tokenizer and adjusting the embedding layer.
|
145 |
|
146 |
+
The adaptation procedure is explained in [this blog post](https://medium.com/@mpamies247/ee1ebc70bc79).
|
147 |
|
148 |
## Training
|
149 |
|
150 |
### Training data
|
151 |
|
152 |
+
The training corpus consists of 26B tokens of several corpora gathered from web crawlings and public domain data.
|
153 |
|
154 |
| Dataset | Language | Tokens (per-epoch) | Epochs |
|
155 |
|---------------------|----------|--------------------|--------------|
|
|
|
170 |
The dataset has the following language distribution:
|
171 |
|
172 |
|Language|Percentage|
|
173 |
+
|--------|----------|
|
174 |
+
| En | 16.84% |
|
175 |
+
| Es | 41.38% |
|
176 |
+
| Ca | 41.79% |
|
177 |
|
178 |
Note: A small amount of English data was kept to avoid catastrophic forgetting.
|
179 |
|
|
|
181 |
|
182 |
The training corpus has been tokenized using a byte version of [Byte-Pair Encoding (BPE)](https://github.com/openai/gpt-2) used
|
183 |
in the original [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model with a vocabulary size of 50,257 tokens.
|
184 |
+
After training a new tokenizer and adapting [falcon-7b](https://huggingface.co/tiiuae/falcon-7b)'s embedding layer, we continued its pre-training in three target languages: Catalan, Spanish, and English.
|
185 |
The training lasted a total of 320 hours on 8 NVIDIA H100 GPUs with 80GB RAM.
|
186 |
|
187 |
|
|
|
191 |
- distributed_type: multi-GPU
|
192 |
- num_devices: 8
|
193 |
- train_batch_size: 1
|
194 |
+
- eval_batch_size: 1
|
195 |
- total_train_batch_size: 8
|
196 |
+
- total_eval_batch_size: 8
|
197 |
- optimizer: Adam
|
198 |
- betas: (0.9,0.999)
|
199 |
- epsilon: 1e-08
|