projecte-aina
/

aina-translator-en-ca

Fairseq

English

Catalan

Model card Files Files and versions Community

aleixsant commited on May 10, 2024

Commit

5197d82

verified ·

1 Parent(s): 069cfa5

Update README.md

Browse files

Files changed (1) hide show

README.md +21 -45

README.md CHANGED Viewed

@@ -9,13 +9,12 @@ metrics:
 - bleu
 library_name: fairseq
 ---
-## Aina Project's English-Catalan machine translation model
 ## Model description
 This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of English-Catalan datasets,
-up to 11  million sentences. Additionally, the model is evaluated on several public datasets comprising 5 different domains (general, adminstrative, technology,
-biomedical, and news).
 ## Intended uses and limitations
@@ -52,54 +51,31 @@ However, we are well aware that our models may be biased. We intend to conduct r
 ### Training data
-The model was trained on a combination of the following datasets:
-| Dataset            | Sentences      |
-|--------------------|----------------|
-| Global Voices      | 21.342         |
-| Memories Lluires   | 1.173.055      |
-| Wikimatrix         | 1.205.908      |
-| TED Talks          | 50.979         |
-| Tatoeba            | 5.500          |
-| CoVost 2 ca-en     | 79.633         |
-| CoVost 2 en-ca     | 263.891        |
-| Europarl           | 1.965.734      |
-| jw300              | 97.081         |
-| Crawled Generalitat| 38.595         |
-| Opus Books         | 4.580          |
-| CC Aligned         | 5.787.682      |
-| COVID_Wikipedia    | 1.531          |
-| EuroBooks          | 3.746          |
-| Gnome              | 2.183          |
-| KDE 4              | 144.153        |
-| OpenSubtitles      | 427.913        |
-| QED                | 69.823         |
-| Ubuntu             | 6.781          |
-| Wikimedia          | 208.073        |
-|--------------------|----------------|
-| **Total**          | **11.558.183** |
 ### Training procedure
 ### Data preparation
- All datasets are concatenated and filtered using the [mBERT Gencata parallel filter](https://huggingface.co/projecte-aina/mbert-base-gencata).
- Before training, the punctuation is normalized using a modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
 #### Tokenization
- All data is tokenized using sentencepiece, using 50 thousand token sentencepiece model learned from the combination of all filtered training data.
- This model is included.
 #### Hyperparameters
 The model is based on the Transformer-XLarge proposed by [Subramanian et al.](https://aclanthology.org/2021.wmt-1.18.pdf)
 The following hyperparamenters were set on the Fairseq toolkit:
 | Hyperparameter                     | Value                            |
 |------------------------------------|----------------------------------|
-| Architecture                       | transformer_vaswani_wmt_en_de_bi |
 | Embedding size                     | 1024                             |
 | Feedforward size                   | 4096                             |
 | Number of heads                    | 16                               |
@@ -118,7 +94,7 @@ The following hyperparamenters were set on the Fairseq toolkit:
 | Dropout                            | 0.1                              |
 | Label smoothing                    | 0.1                              |
-The model was trained for a total of 45.000 updates. Weights were saved every 1000 updates and reported results are the average of the last 32 checkpoints.
 ## Evaluation
@@ -141,15 +117,15 @@ Below are the evaluation results on the machine translation from English to Cata
 | Test set             | SoftCatalà | Google Translate | aina-translator-en-ca |
 |----------------------|------------|------------------|---------------|
-| Spanish Constitution | 32,6       | 37,6             | **37,7**      |
-| United Nations       | 39,0       | 39,7             | **39,8**      |
-| European Commission  | 49,1       | **52**           | 49,5          |
-| Flores 101 dev       | 41,0       | 41,6             | **42,9**      |
-| Flores 101 devtest   | 42,1       | 42,2             | **44,0**      |
-| Cybersecurity        | 42,5       | **46,5**         | 45,8          |
-| wmt 19 biomedical    | 21,7       | **25,2**         | 25,1          |
-| wmt 13 news          | 34,9       | 33,8             | **35,6**      |
-| **Average**          | 37,9       | 39,8             | **40,1**      |
 ## Additional information

 - bleu
 library_name: fairseq
 ---
+## Projecte Aina's English-Catalan machine translation model
 ## Model description
 This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of English-Catalan datasets,
+which after filtering and cleaning comprised 30.023.034 sentence pairs. The model was evaluated on the Flores and NTREX evaluation datasets.
 ## Intended uses and limitations
 ### Training data
+The model was trained on a combination of several datasets, including data collected from Opus, HPLT and other sources.
 ### Training procedure
 ### Data preparation
+ All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
+ This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
+ The filtered datasets are then concatenated to form a final corpus of 30.023.034 and before training the punctuation is normalized using a modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
 #### Tokenization
+ All data is tokenized using sentencepiece, with a 50 thousand token sentencepiece model learned from the combination of all filtered training data.
+ This model is included.
 #### Hyperparameters
 The model is based on the Transformer-XLarge proposed by [Subramanian et al.](https://aclanthology.org/2021.wmt-1.18.pdf)
 The following hyperparamenters were set on the Fairseq toolkit:
+mirar hyperparam
 | Hyperparameter                     | Value                            |
 |------------------------------------|----------------------------------|
+| Architecture                       | transformer_vaswani_wmt_en_de_big |
 | Embedding size                     | 1024                             |
 | Feedforward size                   | 4096                             |
 | Number of heads                    | 16                               |
 | Dropout                            | 0.1                              |
 | Label smoothing                    | 0.1                              |
+The model was trained for a total of 16000 updates. Weights were saved every 1000 updates and reported results are the average of the last 6 checkpoints.
 ## Evaluation
 | Test set             | SoftCatalà | Google Translate | aina-translator-en-ca |
 |----------------------|------------|------------------|---------------|
+| Spanish Constitution | 32,6       | 37,8             | **41,2**      |
+| United Nations       | 39,0       | 40,5             | **41,2**      |
+| European Commission  | 49,1       | **52,0**         | 51            |
+| Flores 101 dev       | 41,0       | **45,1**         | 43,3          |
+| Flores 101 devtest   | 42,1       | **46,0**         | 44,1          |
+| Cybersecurity        | 42,5       | **48,1**         | 45,8          |
+| wmt 19 biomedical    | 21,7       | 25,5             | **26,7**      |
+| wmt 13 news          | 34,9       | **35,7**         | 34,0          |
+| **Average**          | 37,9       | **41,34**        | 40,91         |
 ## Additional information