Update README.md
Browse files
README.md
CHANGED
@@ -9,13 +9,12 @@ metrics:
|
|
9 |
- bleu
|
10 |
library_name: fairseq
|
11 |
---
|
12 |
-
## Aina
|
13 |
|
14 |
## Model description
|
15 |
|
16 |
This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of English-Catalan datasets,
|
17 |
-
|
18 |
-
biomedical, and news).
|
19 |
|
20 |
## Intended uses and limitations
|
21 |
|
@@ -52,54 +51,31 @@ However, we are well aware that our models may be biased. We intend to conduct r
|
|
52 |
|
53 |
### Training data
|
54 |
|
55 |
-
The model was trained on a combination of
|
56 |
-
|
57 |
-
| Dataset | Sentences |
|
58 |
-
|--------------------|----------------|
|
59 |
-
| Global Voices | 21.342 |
|
60 |
-
| Memories Lluires | 1.173.055 |
|
61 |
-
| Wikimatrix | 1.205.908 |
|
62 |
-
| TED Talks | 50.979 |
|
63 |
-
| Tatoeba | 5.500 |
|
64 |
-
| CoVost 2 ca-en | 79.633 |
|
65 |
-
| CoVost 2 en-ca | 263.891 |
|
66 |
-
| Europarl | 1.965.734 |
|
67 |
-
| jw300 | 97.081 |
|
68 |
-
| Crawled Generalitat| 38.595 |
|
69 |
-
| Opus Books | 4.580 |
|
70 |
-
| CC Aligned | 5.787.682 |
|
71 |
-
| COVID_Wikipedia | 1.531 |
|
72 |
-
| EuroBooks | 3.746 |
|
73 |
-
| Gnome | 2.183 |
|
74 |
-
| KDE 4 | 144.153 |
|
75 |
-
| OpenSubtitles | 427.913 |
|
76 |
-
| QED | 69.823 |
|
77 |
-
| Ubuntu | 6.781 |
|
78 |
-
| Wikimedia | 208.073 |
|
79 |
-
|--------------------|----------------|
|
80 |
-
| **Total** | **11.558.183** |
|
81 |
|
82 |
### Training procedure
|
83 |
|
84 |
### Data preparation
|
85 |
|
86 |
-
All datasets are
|
87 |
-
|
88 |
-
|
89 |
|
90 |
#### Tokenization
|
91 |
|
92 |
-
All data is tokenized using sentencepiece,
|
93 |
-
This model is included.
|
94 |
|
95 |
#### Hyperparameters
|
96 |
|
97 |
The model is based on the Transformer-XLarge proposed by [Subramanian et al.](https://aclanthology.org/2021.wmt-1.18.pdf)
|
98 |
The following hyperparamenters were set on the Fairseq toolkit:
|
99 |
|
|
|
|
|
100 |
| Hyperparameter | Value |
|
101 |
|------------------------------------|----------------------------------|
|
102 |
-
| Architecture |
|
103 |
| Embedding size | 1024 |
|
104 |
| Feedforward size | 4096 |
|
105 |
| Number of heads | 16 |
|
@@ -118,7 +94,7 @@ The following hyperparamenters were set on the Fairseq toolkit:
|
|
118 |
| Dropout | 0.1 |
|
119 |
| Label smoothing | 0.1 |
|
120 |
|
121 |
-
The model was trained for a total of
|
122 |
|
123 |
## Evaluation
|
124 |
|
@@ -141,15 +117,15 @@ Below are the evaluation results on the machine translation from English to Cata
|
|
141 |
|
142 |
| Test set | SoftCatalà | Google Translate | aina-translator-en-ca |
|
143 |
|----------------------|------------|------------------|---------------|
|
144 |
-
| Spanish Constitution | 32,6 | 37,
|
145 |
-
| United Nations | 39,0 |
|
146 |
-
| European Commission | 49,1 | **52**
|
147 |
-
| Flores 101 dev | 41,0 |
|
148 |
-
| Flores 101 devtest | 42,1 |
|
149 |
-
| Cybersecurity | 42,5 | **
|
150 |
-
| wmt 19 biomedical | 21,7 |
|
151 |
-
| wmt 13 news | 34,9 |
|
152 |
-
| **Average** | 37,9 |
|
153 |
|
154 |
## Additional information
|
155 |
|
|
|
9 |
- bleu
|
10 |
library_name: fairseq
|
11 |
---
|
12 |
+
## Projecte Aina's English-Catalan machine translation model
|
13 |
|
14 |
## Model description
|
15 |
|
16 |
This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of English-Catalan datasets,
|
17 |
+
which after filtering and cleaning comprised 30.023.034 sentence pairs. The model was evaluated on the Flores and NTREX evaluation datasets.
|
|
|
18 |
|
19 |
## Intended uses and limitations
|
20 |
|
|
|
51 |
|
52 |
### Training data
|
53 |
|
54 |
+
The model was trained on a combination of several datasets, including data collected from Opus, HPLT and other sources.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
55 |
|
56 |
### Training procedure
|
57 |
|
58 |
### Data preparation
|
59 |
|
60 |
+
All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
|
61 |
+
This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
|
62 |
+
The filtered datasets are then concatenated to form a final corpus of 30.023.034 and before training the punctuation is normalized using a modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
|
63 |
|
64 |
#### Tokenization
|
65 |
|
66 |
+
All data is tokenized using sentencepiece, with a 50 thousand token sentencepiece model learned from the combination of all filtered training data.
|
67 |
+
This model is included.
|
68 |
|
69 |
#### Hyperparameters
|
70 |
|
71 |
The model is based on the Transformer-XLarge proposed by [Subramanian et al.](https://aclanthology.org/2021.wmt-1.18.pdf)
|
72 |
The following hyperparamenters were set on the Fairseq toolkit:
|
73 |
|
74 |
+
mirar hyperparam
|
75 |
+
|
76 |
| Hyperparameter | Value |
|
77 |
|------------------------------------|----------------------------------|
|
78 |
+
| Architecture | transformer_vaswani_wmt_en_de_big |
|
79 |
| Embedding size | 1024 |
|
80 |
| Feedforward size | 4096 |
|
81 |
| Number of heads | 16 |
|
|
|
94 |
| Dropout | 0.1 |
|
95 |
| Label smoothing | 0.1 |
|
96 |
|
97 |
+
The model was trained for a total of 16000 updates. Weights were saved every 1000 updates and reported results are the average of the last 6 checkpoints.
|
98 |
|
99 |
## Evaluation
|
100 |
|
|
|
117 |
|
118 |
| Test set | SoftCatalà | Google Translate | aina-translator-en-ca |
|
119 |
|----------------------|------------|------------------|---------------|
|
120 |
+
| Spanish Constitution | 32,6 | 37,8 | **41,2** |
|
121 |
+
| United Nations | 39,0 | 40,5 | **41,2** |
|
122 |
+
| European Commission | 49,1 | **52,0** | 51 |
|
123 |
+
| Flores 101 dev | 41,0 | **45,1** | 43,3 |
|
124 |
+
| Flores 101 devtest | 42,1 | **46,0** | 44,1 |
|
125 |
+
| Cybersecurity | 42,5 | **48,1** | 45,8 |
|
126 |
+
| wmt 19 biomedical | 21,7 | 25,5 | **26,7** |
|
127 |
+
| wmt 13 news | 34,9 | **35,7** | 34,0 |
|
128 |
+
| **Average** | 37,9 | **41,34** | 40,91 |
|
129 |
|
130 |
## Additional information
|
131 |
|