Fairseq
Catalan
German
AudreyVM commited on
Commit
dca0aec
1 Parent(s): 7f6c39c

setting backtranslation as default / update README

Browse files
Files changed (1) hide show
  1. README.md +21 -11
README.md CHANGED
@@ -13,8 +13,8 @@ library_name: fairseq
13
 
14
  ## Model description
15
 
16
- This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of Catalan-German datasets,
17
- which after filtering and cleaning comprised 6.258.272 sentence pairs. The model was evaluated on the Flores and NTREX evaluation datasets.
18
 
19
  ## Intended uses and limitations
20
 
@@ -36,7 +36,7 @@ import pyonmttok
36
  from huggingface_hub import snapshot_download
37
  model_dir = snapshot_download(repo_id="projecte-aina/aina-translator-ca-de", revision="main")
38
 
39
- tokenizer=pyonmttok.Tokenizer(mode="none", sp_model_path = model_dir + "/spm.model")
40
  tokenized=tokenizer.tokenize("Benvingut al projecte Aina!")
41
 
42
  translator = ctranslate2.Translator(model_dir)
@@ -52,7 +52,7 @@ However, we are well aware that our models may be biased. We intend to conduct r
52
 
53
  ### Training data
54
 
55
- The model was trained on a combination of the following datasets:
56
 
57
  | Dataset | Sentences | Sentences after Cleaning|
58
  |-------------------|----------------|-------------------|
@@ -71,8 +71,19 @@ The model was trained on a combination of the following datasets:
71
  | **Total** | **7.427.843** | **6.258.272** |
72
 
73
  All corpora except Europarl and Tilde were collected from [Opus](https://opus.nlpl.eu/).
74
- The Europarl and Tilde corpora are synthetic parallel corpora created from the original Spanish-Catalan corpora by [SoftCatalà](https://github.com/Softcatala).
75
 
 
 
 
 
 
 
 
 
 
 
 
76
 
77
  ### Training procedure
78
 
@@ -80,8 +91,7 @@ The Europarl and Tilde corpora are synthetic parallel corpora created from the o
80
 
81
  All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
82
  This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
83
- The filtered datasets are then concatenated to form a final corpus of 6.159.631 and before training the punctuation is normalized
84
- using a modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
85
 
86
 
87
  #### Tokenization
@@ -130,10 +140,10 @@ and [Google Translate](https://translate.google.es/?hl=es):
130
 
131
  | Test set | SoftCatalà | Google Translate | aina-translator-ca-de |
132
  |----------------------|------------|------------------|---------------|
133
- | Flores 101 dev | 26,2 | **34,8** | 27,5 |
134
- | Flores 101 devtest |26,3 | **34,0** | 26,9 |
135
- | NTREX | 21,7 | **28,8** | 20,4 |
136
- | Average | 24,7 | **32,5** | 24,9 |
137
 
138
  ## Additional information
139
 
 
13
 
14
  ## Model description
15
 
16
+ This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of Catalan-German datasets, totalling 100.000.000 sentence pairs.
17
+ 6.258.272 sentence pairs were parallel data collected from the web while the remaining 93.741.728 sentence pairs were parallel synthetic data created using the ES-CA translator of [PlanTL](https://huggingface.co/PlanTL-GOB-ES/mt-plantl-es-ca). The model was evaluated on the Flores and NTREX evaluation datasets.
18
 
19
  ## Intended uses and limitations
20
 
 
36
  from huggingface_hub import snapshot_download
37
  model_dir = snapshot_download(repo_id="projecte-aina/aina-translator-ca-de", revision="main")
38
 
39
+ tokenizer=pyonmttok.Tokenizer(mode="none", sp_model_path = model_dir + "/spm.50k.model")
40
  tokenized=tokenizer.tokenize("Benvingut al projecte Aina!")
41
 
42
  translator = ctranslate2.Translator(model_dir)
 
52
 
53
  ### Training data
54
 
55
+ The Catalan-German data collected from the web was a combination of the following datasets:
56
 
57
  | Dataset | Sentences | Sentences after Cleaning|
58
  |-------------------|----------------|-------------------|
 
71
  | **Total** | **7.427.843** | **6.258.272** |
72
 
73
  All corpora except Europarl and Tilde were collected from [Opus](https://opus.nlpl.eu/).
74
+ The Europarl and Tilde corpora are synthetic parallel corpora created from the original Spanish-German corpora by [SoftCatalà](https://github.com/Softcatala).
75
 
76
+ The 93.741.728 sentence pairs of synthetic parallel data were created from the following Spanish-German datasets:
77
+
78
+ | Dataset | Sentences before cleaning |
79
+ |-------------------|----------------|
80
+ |globalvoices_es-de_20230901 | 70.097 |
81
+ |multiparacrawl_es-de_20230901 | 56.873.541 |
82
+ |dgt_es-de_20240129 | 4.899.734 |
83
+ |eubookshop_es-de_20240129 | 4.750.170 |
84
+ |nllb_es-de_20240129 | 112.444.838 |
85
+ |opensubtitles_es-de_20240129 | 18.951.214 |
86
+ | **Total** | **197.989.594** |
87
 
88
  ### Training procedure
89
 
 
91
 
92
  All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
93
  This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
94
+ The filtered datasets are then concatenated and before training the punctuation is normalized using a modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
 
95
 
96
 
97
  #### Tokenization
 
140
 
141
  | Test set | SoftCatalà | Google Translate | aina-translator-ca-de |
142
  |----------------------|------------|------------------|---------------|
143
+ | Flores 101 dev | 26,2 | **34,8** | 34,1 |
144
+ | Flores 101 devtest |26,3 | **34,0** | 33,3 |
145
+ | NTREX | 21,7 | **28,8** | 27,8 |
146
+ | Average | 24,7 | **32,5** | 31,7 |
147
 
148
  ## Additional information
149