mapama247 commited on
Commit
0f53b7c
1 Parent(s): d30688b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +12 -12
README.md CHANGED
@@ -91,7 +91,7 @@ pipeline_tag: text-generation
91
 
92
  ## Model description
93
 
94
- The **Ǎguila-7B** is a transformer-based causal language model for Catalan, Spanish, and English.
95
  It is based on the [Falcon-7B](https://huggingface.co/tiiuae/falcon-7b) model and has been trained on a 26B token
96
  trilingual corpus collected from publicly available corpora and crawlers.
97
 
@@ -99,7 +99,7 @@ trilingual corpus collected from publicly available corpora and crawlers.
99
  ## Intended uses and limitations
100
 
101
  The **Ǎguila-7B** model is ready-to-use only for causal language modeling to perform text-generation tasks.
102
- However, it is intended to be fine-tuned on a generative downstream task.
103
 
104
  ## How to use
105
 
@@ -141,15 +141,15 @@ on multiple web sources. We intend to conduct research in these areas in the fut
141
 
142
  ## Language adaptation
143
 
144
- We adapted the original Falcon-7B model to Spanish and Catalan by swapping the tokenizer and adjusting the embedding layer.
145
 
146
- The adaptation procedure is explained in this [blog](https://medium.com/@mpamies247/ee1ebc70bc79).
147
 
148
  ## Training
149
 
150
  ### Training data
151
 
152
- The training corpus consists 26B tokens of several corpora gathered from web crawlings and public domain data.
153
 
154
  | Dataset | Language | Tokens (per-epoch) | Epochs |
155
  |---------------------|----------|--------------------|--------------|
@@ -170,10 +170,10 @@ The training corpus consists 26B tokens of several corpora gathered from web cra
170
  The dataset has the following language distribution:
171
 
172
  |Language|Percentage|
173
- |---|---|
174
- |En|16.84%|
175
- |Es|41.38%|
176
- |Ca|41.79%|
177
 
178
  Note: A small amount of English data was kept to avoid catastrophic forgetting.
179
 
@@ -181,7 +181,7 @@ Note: A small amount of English data was kept to avoid catastrophic forgetting.
181
 
182
  The training corpus has been tokenized using a byte version of [Byte-Pair Encoding (BPE)](https://github.com/openai/gpt-2) used
183
  in the original [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model with a vocabulary size of 50,257 tokens.
184
- After training a new tokenizer and adapting falcon-7b's embedding layer, we continued its pre-training in three target languages: Catalan, Spanish, and English.
185
  The training lasted a total of 320 hours on 8 NVIDIA H100 GPUs with 80GB RAM.
186
 
187
 
@@ -191,9 +191,9 @@ The training lasted a total of 320 hours on 8 NVIDIA H100 GPUs with 80GB RAM.
191
  - distributed_type: multi-GPU
192
  - num_devices: 8
193
  - train_batch_size: 1
194
- - eval_batch_size: 1
195
  - total_train_batch_size: 8
196
- - total_eval_batch_size: 8
197
  - optimizer: Adam
198
  - betas: (0.9,0.999)
199
  - epsilon: 1e-08
 
91
 
92
  ## Model description
93
 
94
+ **Ǎguila-7B** is a transformer-based causal language model for Catalan, Spanish, and English.
95
  It is based on the [Falcon-7B](https://huggingface.co/tiiuae/falcon-7b) model and has been trained on a 26B token
96
  trilingual corpus collected from publicly available corpora and crawlers.
97
 
 
99
  ## Intended uses and limitations
100
 
101
  The **Ǎguila-7B** model is ready-to-use only for causal language modeling to perform text-generation tasks.
102
+ However, it is intended to be fine-tuned for downstream tasks.
103
 
104
  ## How to use
105
 
 
141
 
142
  ## Language adaptation
143
 
144
+ We adapted the original [Falcon-7B](https://huggingface.co/tiiuae/falcon-7b) model to Spanish and Catalan by swapping the tokenizer and adjusting the embedding layer.
145
 
146
+ The adaptation procedure is explained in [this blog post](https://medium.com/@mpamies247/ee1ebc70bc79).
147
 
148
  ## Training
149
 
150
  ### Training data
151
 
152
+ The training corpus consists of 26B tokens of several corpora gathered from web crawlings and public domain data.
153
 
154
  | Dataset | Language | Tokens (per-epoch) | Epochs |
155
  |---------------------|----------|--------------------|--------------|
 
170
  The dataset has the following language distribution:
171
 
172
  |Language|Percentage|
173
+ |--------|----------|
174
+ | En | 16.84% |
175
+ | Es | 41.38% |
176
+ | Ca | 41.79% |
177
 
178
  Note: A small amount of English data was kept to avoid catastrophic forgetting.
179
 
 
181
 
182
  The training corpus has been tokenized using a byte version of [Byte-Pair Encoding (BPE)](https://github.com/openai/gpt-2) used
183
  in the original [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model with a vocabulary size of 50,257 tokens.
184
+ After training a new tokenizer and adapting [falcon-7b](https://huggingface.co/tiiuae/falcon-7b)'s embedding layer, we continued its pre-training in three target languages: Catalan, Spanish, and English.
185
  The training lasted a total of 320 hours on 8 NVIDIA H100 GPUs with 80GB RAM.
186
 
187
 
 
191
  - distributed_type: multi-GPU
192
  - num_devices: 8
193
  - train_batch_size: 1
194
+ - eval_batch_size: 1
195
  - total_train_batch_size: 8
196
+ - total_eval_batch_size: 8
197
  - optimizer: Adam
198
  - betas: (0.9,0.999)
199
  - epsilon: 1e-08