Fairseq
English
Catalan
fdelucaf commited on
Commit
e1ecce0
1 Parent(s): 5197d82

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -6
README.md CHANGED
@@ -51,7 +51,8 @@ However, we are well aware that our models may be biased. We intend to conduct r
51
 
52
  ### Training data
53
 
54
- The model was trained on a combination of several datasets, including data collected from Opus, HPLT and other sources.
 
55
 
56
  ### Training procedure
57
 
@@ -59,7 +60,8 @@ The model was trained on a combination of several datasets, including data colle
59
 
60
  All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
61
  This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
62
- The filtered datasets are then concatenated to form a final corpus of 30.023.034 and before training the punctuation is normalized using a modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
 
63
 
64
  #### Tokenization
65
 
@@ -100,11 +102,11 @@ The model was trained for a total of 16000 updates. Weights were saved every 100
100
 
101
  ### Variable and metrics
102
 
103
- We use the BLEU score for evaluation on test sets:
104
  [Spanish Constitution (TaCon)](https://elrc-share.eu/repository/browse/tacon-spanish-constitution-mt-test-set/84a96138b98611ec9c1a00155d02670628f3e6857b0f422abd82abc3795ec8c2/),
105
  [United Nations](https://zenodo.org/record/3888414#.Y33-_tLMIW0),
106
  [European Commission](https://elrc-share.eu/repository/browse/european-commission-corpus/8a419b1758ea11ed9c1a00155d0267069bd085cae124481589b0858e5b274327/),
107
- [Flores-101](https://github.com/facebookresearch/flores),
108
  [Cybersecurity](https://elrc-share.eu/repository/browse/cyber-mt-test-set/2bd93faab98c11ec9c1a00155d026706b96a490ed3e140f0a29a80a08c46e91e/),
109
  [wmt19 biomedical test set](http://www.statmt.org/wmt19/biomedical-translation-task.html),
110
  [wmt13 news test set](https://elrc-share.eu/repository/browse/catalan-wmt2013-machine-translation-shared-task-test-set/84a96139b98611ec9c1a00155d0267061a0aa1b62e2248e89aab4952f3c230fc/).
@@ -120,8 +122,8 @@ Below are the evaluation results on the machine translation from English to Cata
120
  | Spanish Constitution | 32,6 | 37,8 | **41,2** |
121
  | United Nations | 39,0 | 40,5 | **41,2** |
122
  | European Commission | 49,1 | **52,0** | 51 |
123
- | Flores 101 dev | 41,0 | **45,1** | 43,3 |
124
- | Flores 101 devtest | 42,1 | **46,0** | 44,1 |
125
  | Cybersecurity | 42,5 | **48,1** | 45,8 |
126
  | wmt 19 biomedical | 21,7 | 25,5 | **26,7** |
127
  | wmt 13 news | 34,9 | **35,7** | 34,0 |
 
51
 
52
  ### Training data
53
 
54
+ The model was trained on a combination of several datasets, including data collected from [Opus](https://opus.nlpl.eu/), [HPLT](https://hplt-project.org/),
55
+ an internally created [CA-EN Parallel Corpus](https://huggingface.co/datasets/projecte-aina/CA-EN_Parallel_Corpus), and other sources.
56
 
57
  ### Training procedure
58
 
 
60
 
61
  All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
62
  This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
63
+ The filtered datasets are then concatenated to form a final corpus of 30.023.034 parallel sentences and before training the punctuation
64
+ is normalized using a modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
65
 
66
  #### Tokenization
67
 
 
102
 
103
  ### Variable and metrics
104
 
105
+ We use the BLEU score for evaluation on the following test sets:
106
  [Spanish Constitution (TaCon)](https://elrc-share.eu/repository/browse/tacon-spanish-constitution-mt-test-set/84a96138b98611ec9c1a00155d02670628f3e6857b0f422abd82abc3795ec8c2/),
107
  [United Nations](https://zenodo.org/record/3888414#.Y33-_tLMIW0),
108
  [European Commission](https://elrc-share.eu/repository/browse/european-commission-corpus/8a419b1758ea11ed9c1a00155d0267069bd085cae124481589b0858e5b274327/),
109
+ [Flores-200](https://github.com/facebookresearch/flores),
110
  [Cybersecurity](https://elrc-share.eu/repository/browse/cyber-mt-test-set/2bd93faab98c11ec9c1a00155d026706b96a490ed3e140f0a29a80a08c46e91e/),
111
  [wmt19 biomedical test set](http://www.statmt.org/wmt19/biomedical-translation-task.html),
112
  [wmt13 news test set](https://elrc-share.eu/repository/browse/catalan-wmt2013-machine-translation-shared-task-test-set/84a96139b98611ec9c1a00155d0267061a0aa1b62e2248e89aab4952f3c230fc/).
 
122
  | Spanish Constitution | 32,6 | 37,8 | **41,2** |
123
  | United Nations | 39,0 | 40,5 | **41,2** |
124
  | European Commission | 49,1 | **52,0** | 51 |
125
+ | Flores 200 dev | 41,0 | **45,1** | 43,3 |
126
+ | Flores 200 devtest | 42,1 | **46,0** | 44,1 |
127
  | Cybersecurity | 42,5 | **48,1** | 45,8 |
128
  | wmt 19 biomedical | 21,7 | 25,5 | **26,7** |
129
  | wmt 13 news | 34,9 | **35,7** | 34,0 |