Update README.md
Browse files
README.md
CHANGED
@@ -51,7 +51,8 @@ However, we are well aware that our models may be biased. We intend to conduct r
|
|
51 |
|
52 |
### Training data
|
53 |
|
54 |
-
The model was trained on a combination of several datasets, including data collected from Opus, HPLT
|
|
|
55 |
|
56 |
### Training procedure
|
57 |
|
@@ -59,7 +60,8 @@ The model was trained on a combination of several datasets, including data colle
|
|
59 |
|
60 |
All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
|
61 |
This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
|
62 |
-
The filtered datasets are then concatenated to form a final corpus of 30.023.034 and before training the punctuation
|
|
|
63 |
|
64 |
#### Tokenization
|
65 |
|
@@ -100,11 +102,11 @@ The model was trained for a total of 16000 updates. Weights were saved every 100
|
|
100 |
|
101 |
### Variable and metrics
|
102 |
|
103 |
-
We use the BLEU score for evaluation on test sets:
|
104 |
[Spanish Constitution (TaCon)](https://elrc-share.eu/repository/browse/tacon-spanish-constitution-mt-test-set/84a96138b98611ec9c1a00155d02670628f3e6857b0f422abd82abc3795ec8c2/),
|
105 |
[United Nations](https://zenodo.org/record/3888414#.Y33-_tLMIW0),
|
106 |
[European Commission](https://elrc-share.eu/repository/browse/european-commission-corpus/8a419b1758ea11ed9c1a00155d0267069bd085cae124481589b0858e5b274327/),
|
107 |
-
[Flores-
|
108 |
[Cybersecurity](https://elrc-share.eu/repository/browse/cyber-mt-test-set/2bd93faab98c11ec9c1a00155d026706b96a490ed3e140f0a29a80a08c46e91e/),
|
109 |
[wmt19 biomedical test set](http://www.statmt.org/wmt19/biomedical-translation-task.html),
|
110 |
[wmt13 news test set](https://elrc-share.eu/repository/browse/catalan-wmt2013-machine-translation-shared-task-test-set/84a96139b98611ec9c1a00155d0267061a0aa1b62e2248e89aab4952f3c230fc/).
|
@@ -120,8 +122,8 @@ Below are the evaluation results on the machine translation from English to Cata
|
|
120 |
| Spanish Constitution | 32,6 | 37,8 | **41,2** |
|
121 |
| United Nations | 39,0 | 40,5 | **41,2** |
|
122 |
| European Commission | 49,1 | **52,0** | 51 |
|
123 |
-
| Flores
|
124 |
-
| Flores
|
125 |
| Cybersecurity | 42,5 | **48,1** | 45,8 |
|
126 |
| wmt 19 biomedical | 21,7 | 25,5 | **26,7** |
|
127 |
| wmt 13 news | 34,9 | **35,7** | 34,0 |
|
|
|
51 |
|
52 |
### Training data
|
53 |
|
54 |
+
The model was trained on a combination of several datasets, including data collected from [Opus](https://opus.nlpl.eu/), [HPLT](https://hplt-project.org/),
|
55 |
+
an internally created [CA-EN Parallel Corpus](https://huggingface.co/datasets/projecte-aina/CA-EN_Parallel_Corpus), and other sources.
|
56 |
|
57 |
### Training procedure
|
58 |
|
|
|
60 |
|
61 |
All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
|
62 |
This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
|
63 |
+
The filtered datasets are then concatenated to form a final corpus of 30.023.034 parallel sentences and before training the punctuation
|
64 |
+
is normalized using a modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
|
65 |
|
66 |
#### Tokenization
|
67 |
|
|
|
102 |
|
103 |
### Variable and metrics
|
104 |
|
105 |
+
We use the BLEU score for evaluation on the following test sets:
|
106 |
[Spanish Constitution (TaCon)](https://elrc-share.eu/repository/browse/tacon-spanish-constitution-mt-test-set/84a96138b98611ec9c1a00155d02670628f3e6857b0f422abd82abc3795ec8c2/),
|
107 |
[United Nations](https://zenodo.org/record/3888414#.Y33-_tLMIW0),
|
108 |
[European Commission](https://elrc-share.eu/repository/browse/european-commission-corpus/8a419b1758ea11ed9c1a00155d0267069bd085cae124481589b0858e5b274327/),
|
109 |
+
[Flores-200](https://github.com/facebookresearch/flores),
|
110 |
[Cybersecurity](https://elrc-share.eu/repository/browse/cyber-mt-test-set/2bd93faab98c11ec9c1a00155d026706b96a490ed3e140f0a29a80a08c46e91e/),
|
111 |
[wmt19 biomedical test set](http://www.statmt.org/wmt19/biomedical-translation-task.html),
|
112 |
[wmt13 news test set](https://elrc-share.eu/repository/browse/catalan-wmt2013-machine-translation-shared-task-test-set/84a96139b98611ec9c1a00155d0267061a0aa1b62e2248e89aab4952f3c230fc/).
|
|
|
122 |
| Spanish Constitution | 32,6 | 37,8 | **41,2** |
|
123 |
| United Nations | 39,0 | 40,5 | **41,2** |
|
124 |
| European Commission | 49,1 | **52,0** | 51 |
|
125 |
+
| Flores 200 dev | 41,0 | **45,1** | 43,3 |
|
126 |
+
| Flores 200 devtest | 42,1 | **46,0** | 44,1 |
|
127 |
| Cybersecurity | 42,5 | **48,1** | 45,8 |
|
128 |
| wmt 19 biomedical | 21,7 | 25,5 | **26,7** |
|
129 |
| wmt 13 news | 34,9 | **35,7** | 34,0 |
|