Update README.md

Browse files

Files changed (1) hide show

README.md +12 -12

README.md CHANGED Viewed

@@ -91,7 +91,7 @@ pipeline_tag: text-generation
 ## Model description
-The **Ǎguila-7B** is a transformer-based causal language model for Catalan, Spanish, and English.
 It is based on the [Falcon-7B](https://huggingface.co/tiiuae/falcon-7b) model and has been trained on a 26B token
 trilingual corpus collected from publicly available corpora and crawlers.
@@ -99,7 +99,7 @@ trilingual corpus collected from publicly available corpora and crawlers.
 ## Intended uses and limitations
 The **Ǎguila-7B** model is ready-to-use only for causal language modeling to perform text-generation tasks.
-However, it is intended to be fine-tuned on a generative downstream task.
 ## How to use
@@ -141,15 +141,15 @@ on multiple web sources. We intend to conduct research in these areas in the fut
 ## Language adaptation
-We adapted the original Falcon-7B model to Spanish and Catalan by swapping the tokenizer and adjusting the embedding layer.
-The adaptation procedure is explained in this [blog](https://medium.com/@mpamies247/ee1ebc70bc79).
 ## Training
 ### Training data
-The training corpus consists 26B tokens of several corpora gathered from web crawlings and public domain data.
 | Dataset             | Language | Tokens (per-epoch) | Epochs       |
 |---------------------|----------|--------------------|--------------|
@@ -170,10 +170,10 @@ The training corpus consists 26B tokens of several corpora gathered from web cra
 The dataset has the following language distribution:
 |Language|Percentage|
-|---|---|
-|En|16.84%|
-|Es|41.38%|
-|Ca|41.79%|
 Note: A small amount of English data was kept to avoid catastrophic forgetting.
@@ -181,7 +181,7 @@ Note: A small amount of English data was kept to avoid catastrophic forgetting.
 The training corpus has been tokenized using a byte version of [Byte-Pair Encoding (BPE)](https://github.com/openai/gpt-2) used
 in the original [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model with a vocabulary size of 50,257 tokens.
-After training a new tokenizer and adapting falcon-7b's embedding layer, we continued its pre-training in three target languages: Catalan, Spanish, and English.
 The training lasted a total of 320 hours on 8 NVIDIA H100 GPUs with 80GB RAM.
@@ -191,9 +191,9 @@ The training lasted a total of 320 hours on 8 NVIDIA H100 GPUs with 80GB RAM.
 - distributed_type: multi-GPU
 - num_devices: 8
 - train_batch_size: 1
-- eval_batch_size: 1
 - total_train_batch_size: 8
-- total_eval_batch_size: 8
 - optimizer: Adam
 - betas: (0.9,0.999)
 - epsilon: 1e-08

 ## Model description
+**Ǎguila-7B** is a transformer-based causal language model for Catalan, Spanish, and English.
 It is based on the [Falcon-7B](https://huggingface.co/tiiuae/falcon-7b) model and has been trained on a 26B token
 trilingual corpus collected from publicly available corpora and crawlers.
 ## Intended uses and limitations
 The **Ǎguila-7B** model is ready-to-use only for causal language modeling to perform text-generation tasks.
+However, it is intended to be fine-tuned for downstream tasks.
 ## How to use
 ## Language adaptation
+We adapted the original [Falcon-7B](https://huggingface.co/tiiuae/falcon-7b) model to Spanish and Catalan by swapping the tokenizer and adjusting the embedding layer.
+The adaptation procedure is explained in [this blog post](https://medium.com/@mpamies247/ee1ebc70bc79).
 ## Training
 ### Training data
+The training corpus consists of 26B tokens of several corpora gathered from web crawlings and public domain data.
 | Dataset             | Language | Tokens (per-epoch) | Epochs       |
 |---------------------|----------|--------------------|--------------|
 The dataset has the following language distribution:
 |Language|Percentage|
+|--------|----------|
+|   En   |  16.84%  |
+|   Es   |  41.38%  |
+|   Ca   |  41.79%  |
 Note: A small amount of English data was kept to avoid catastrophic forgetting.
 The training corpus has been tokenized using a byte version of [Byte-Pair Encoding (BPE)](https://github.com/openai/gpt-2) used
 in the original [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model with a vocabulary size of 50,257 tokens.
+After training a new tokenizer and adapting [falcon-7b](https://huggingface.co/tiiuae/falcon-7b)'s embedding layer, we continued its pre-training in three target languages: Catalan, Spanish, and English.
 The training lasted a total of 320 hours on 8 NVIDIA H100 GPUs with 80GB RAM.
 - distributed_type: multi-GPU
 - num_devices: 8
 - train_batch_size: 1
+- eval_batch_size:  1
 - total_train_batch_size: 8
+- total_eval_batch_size:  8
 - optimizer: Adam
 - betas: (0.9,0.999)
 - epsilon: 1e-08