Fairseq
English
Catalan
aleixsant commited on
Commit
5197d82
1 Parent(s): 069cfa5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +21 -45
README.md CHANGED
@@ -9,13 +9,12 @@ metrics:
9
  - bleu
10
  library_name: fairseq
11
  ---
12
- ## Aina Project's English-Catalan machine translation model
13
 
14
  ## Model description
15
 
16
  This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of English-Catalan datasets,
17
- up to 11 million sentences. Additionally, the model is evaluated on several public datasets comprising 5 different domains (general, adminstrative, technology,
18
- biomedical, and news).
19
 
20
  ## Intended uses and limitations
21
 
@@ -52,54 +51,31 @@ However, we are well aware that our models may be biased. We intend to conduct r
52
 
53
  ### Training data
54
 
55
- The model was trained on a combination of the following datasets:
56
-
57
- | Dataset | Sentences |
58
- |--------------------|----------------|
59
- | Global Voices | 21.342 |
60
- | Memories Lluires | 1.173.055 |
61
- | Wikimatrix | 1.205.908 |
62
- | TED Talks | 50.979 |
63
- | Tatoeba | 5.500 |
64
- | CoVost 2 ca-en | 79.633 |
65
- | CoVost 2 en-ca | 263.891 |
66
- | Europarl | 1.965.734 |
67
- | jw300 | 97.081 |
68
- | Crawled Generalitat| 38.595 |
69
- | Opus Books | 4.580 |
70
- | CC Aligned | 5.787.682 |
71
- | COVID_Wikipedia | 1.531 |
72
- | EuroBooks | 3.746 |
73
- | Gnome | 2.183 |
74
- | KDE 4 | 144.153 |
75
- | OpenSubtitles | 427.913 |
76
- | QED | 69.823 |
77
- | Ubuntu | 6.781 |
78
- | Wikimedia | 208.073 |
79
- |--------------------|----------------|
80
- | **Total** | **11.558.183** |
81
 
82
  ### Training procedure
83
 
84
  ### Data preparation
85
 
86
- All datasets are concatenated and filtered using the [mBERT Gencata parallel filter](https://huggingface.co/projecte-aina/mbert-base-gencata).
87
- Before training, the punctuation is normalized using a modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
88
-
89
 
90
  #### Tokenization
91
 
92
- All data is tokenized using sentencepiece, using 50 thousand token sentencepiece model learned from the combination of all filtered training data.
93
- This model is included.
94
 
95
  #### Hyperparameters
96
 
97
  The model is based on the Transformer-XLarge proposed by [Subramanian et al.](https://aclanthology.org/2021.wmt-1.18.pdf)
98
  The following hyperparamenters were set on the Fairseq toolkit:
99
 
 
 
100
  | Hyperparameter | Value |
101
  |------------------------------------|----------------------------------|
102
- | Architecture | transformer_vaswani_wmt_en_de_bi |
103
  | Embedding size | 1024 |
104
  | Feedforward size | 4096 |
105
  | Number of heads | 16 |
@@ -118,7 +94,7 @@ The following hyperparamenters were set on the Fairseq toolkit:
118
  | Dropout | 0.1 |
119
  | Label smoothing | 0.1 |
120
 
121
- The model was trained for a total of 45.000 updates. Weights were saved every 1000 updates and reported results are the average of the last 32 checkpoints.
122
 
123
  ## Evaluation
124
 
@@ -141,15 +117,15 @@ Below are the evaluation results on the machine translation from English to Cata
141
 
142
  | Test set | SoftCatalà | Google Translate | aina-translator-en-ca |
143
  |----------------------|------------|------------------|---------------|
144
- | Spanish Constitution | 32,6 | 37,6 | **37,7** |
145
- | United Nations | 39,0 | 39,7 | **39,8** |
146
- | European Commission | 49,1 | **52** | 49,5 |
147
- | Flores 101 dev | 41,0 | 41,6 | **42,9** |
148
- | Flores 101 devtest | 42,1 | 42,2 | **44,0** |
149
- | Cybersecurity | 42,5 | **46,5** | 45,8 |
150
- | wmt 19 biomedical | 21,7 | **25,2** | 25,1 |
151
- | wmt 13 news | 34,9 | 33,8 | **35,6** |
152
- | **Average** | 37,9 | 39,8 | **40,1** |
153
 
154
  ## Additional information
155
 
 
9
  - bleu
10
  library_name: fairseq
11
  ---
12
+ ## Projecte Aina's English-Catalan machine translation model
13
 
14
  ## Model description
15
 
16
  This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of English-Catalan datasets,
17
+ which after filtering and cleaning comprised 30.023.034 sentence pairs. The model was evaluated on the Flores and NTREX evaluation datasets.
 
18
 
19
  ## Intended uses and limitations
20
 
 
51
 
52
  ### Training data
53
 
54
+ The model was trained on a combination of several datasets, including data collected from Opus, HPLT and other sources.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
55
 
56
  ### Training procedure
57
 
58
  ### Data preparation
59
 
60
+ All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
61
+ This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
62
+ The filtered datasets are then concatenated to form a final corpus of 30.023.034 and before training the punctuation is normalized using a modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
63
 
64
  #### Tokenization
65
 
66
+ All data is tokenized using sentencepiece, with a 50 thousand token sentencepiece model learned from the combination of all filtered training data.
67
+ This model is included.
68
 
69
  #### Hyperparameters
70
 
71
  The model is based on the Transformer-XLarge proposed by [Subramanian et al.](https://aclanthology.org/2021.wmt-1.18.pdf)
72
  The following hyperparamenters were set on the Fairseq toolkit:
73
 
74
+ mirar hyperparam
75
+
76
  | Hyperparameter | Value |
77
  |------------------------------------|----------------------------------|
78
+ | Architecture | transformer_vaswani_wmt_en_de_big |
79
  | Embedding size | 1024 |
80
  | Feedforward size | 4096 |
81
  | Number of heads | 16 |
 
94
  | Dropout | 0.1 |
95
  | Label smoothing | 0.1 |
96
 
97
+ The model was trained for a total of 16000 updates. Weights were saved every 1000 updates and reported results are the average of the last 6 checkpoints.
98
 
99
  ## Evaluation
100
 
 
117
 
118
  | Test set | SoftCatalà | Google Translate | aina-translator-en-ca |
119
  |----------------------|------------|------------------|---------------|
120
+ | Spanish Constitution | 32,6 | 37,8 | **41,2** |
121
+ | United Nations | 39,0 | 40,5 | **41,2** |
122
+ | European Commission | 49,1 | **52,0** | 51 |
123
+ | Flores 101 dev | 41,0 | **45,1** | 43,3 |
124
+ | Flores 101 devtest | 42,1 | **46,0** | 44,1 |
125
+ | Cybersecurity | 42,5 | **48,1** | 45,8 |
126
+ | wmt 19 biomedical | 21,7 | 25,5 | **26,7** |
127
+ | wmt 13 news | 34,9 | **35,7** | 34,0 |
128
+ | **Average** | 37,9 | **41,34** | 40,91 |
129
 
130
  ## Additional information
131