set back-translation as main
#1
by
AudreyVM
- opened
- README.md +14 -21
- config.json +0 -1
- model.bin +2 -2
- shared_vocabulary.json +0 -0
- spm.model +2 -2
README.md
CHANGED
@@ -13,8 +13,8 @@ library_name: fairseq
|
|
13 |
|
14 |
## Model description
|
15 |
|
16 |
-
This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of Catalan-German datasets,
|
17 |
-
|
18 |
|
19 |
## Intended uses and limitations
|
20 |
|
@@ -36,7 +36,7 @@ import pyonmttok
|
|
36 |
from huggingface_hub import snapshot_download
|
37 |
model_dir = snapshot_download(repo_id="projecte-aina/aina-translator-ca-de", revision="main")
|
38 |
|
39 |
-
tokenizer=pyonmttok.Tokenizer(mode="none", sp_model_path = model_dir + "/spm.
|
40 |
tokenized=tokenizer.tokenize("Benvingut al projecte Aina!")
|
41 |
|
42 |
translator = ctranslate2.Translator(model_dir)
|
@@ -52,7 +52,7 @@ However, we are well aware that our models may be biased. We intend to conduct r
|
|
52 |
|
53 |
### Training data
|
54 |
|
55 |
-
The
|
56 |
|
57 |
| Dataset | Sentences | Sentences after Cleaning|
|
58 |
|-------------------|----------------|-------------------|
|
@@ -60,27 +60,19 @@ The Catalan-German data collected from the web was a combination of the followin
|
|
60 |
| WikiMatrix | 180.322 | 125.811 |
|
61 |
| GNOME | 12.333| 1.241|
|
62 |
| KDE4 | 165.439 | 105.098 |
|
|
|
|
|
63 |
| OpenSubtitles | 303.329 | 171.376 |
|
64 |
| GlobalVoices| 4.636 | 3.578|
|
65 |
| Tatoeba | 732 | 655 |
|
66 |
| Books | 4.445 | 2049 |
|
67 |
| Europarl | 1.734.643 | 1.734.643 |
|
68 |
| Tilde | 3.434.091 | 3.434.091 |
|
|
|
69 |
|
70 |
All corpora except Europarl and Tilde were collected from [Opus](https://opus.nlpl.eu/).
|
71 |
-
The Europarl and Tilde corpora are synthetic parallel corpora created from the original Spanish-
|
72 |
|
73 |
-
The 93.741.728 sentence pairs of synthetic parallel data were created from the following Spanish-German datasets:
|
74 |
-
|
75 |
-
| Dataset | Sentences before cleaning |
|
76 |
-
|-------------------|----------------|
|
77 |
-
|globalvoices_es-de_20230901 | 70.097 |
|
78 |
-
|multiparacrawl_es-de_20230901 | 56.873.541 |
|
79 |
-
|dgt_es-de_20240129 | 4.899.734 |
|
80 |
-
|eubookshop_es-de_20240129 | 4.750.170 |
|
81 |
-
|nllb_es-de_20240129 | 112.444.838 |
|
82 |
-
|opensubtitles_es-de_20240129 | 18.951.214 |
|
83 |
-
| **Total** | **197.989.594** |
|
84 |
|
85 |
### Training procedure
|
86 |
|
@@ -88,7 +80,8 @@ The 93.741.728 sentence pairs of synthetic parallel data were created from the f
|
|
88 |
|
89 |
All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
|
90 |
This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
|
91 |
-
The filtered datasets are then concatenated
|
|
|
92 |
|
93 |
|
94 |
#### Tokenization
|
@@ -137,10 +130,10 @@ and [Google Translate](https://translate.google.es/?hl=es):
|
|
137 |
|
138 |
| Test set | SoftCatalà | Google Translate | aina-translator-ca-de |
|
139 |
|----------------------|------------|------------------|---------------|
|
140 |
-
| Flores 101 dev | 26,2 | **34,8** |
|
141 |
-
| Flores 101 devtest |26,3 | **34,0** |
|
142 |
-
| NTREX | 21,7 | **28,8** |
|
143 |
-
| Average | 24,7 | **32,5** |
|
144 |
|
145 |
## Additional information
|
146 |
|
|
|
13 |
|
14 |
## Model description
|
15 |
|
16 |
+
This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of Catalan-German datasets,
|
17 |
+
which after filtering and cleaning comprised 6.258.272 sentence pairs. The model was evaluated on the Flores and NTREX evaluation datasets.
|
18 |
|
19 |
## Intended uses and limitations
|
20 |
|
|
|
36 |
from huggingface_hub import snapshot_download
|
37 |
model_dir = snapshot_download(repo_id="projecte-aina/aina-translator-ca-de", revision="main")
|
38 |
|
39 |
+
tokenizer=pyonmttok.Tokenizer(mode="none", sp_model_path = model_dir + "/spm.model")
|
40 |
tokenized=tokenizer.tokenize("Benvingut al projecte Aina!")
|
41 |
|
42 |
translator = ctranslate2.Translator(model_dir)
|
|
|
52 |
|
53 |
### Training data
|
54 |
|
55 |
+
The model was trained on a combination of the following datasets:
|
56 |
|
57 |
| Dataset | Sentences | Sentences after Cleaning|
|
58 |
|-------------------|----------------|-------------------|
|
|
|
60 |
| WikiMatrix | 180.322 | 125.811 |
|
61 |
| GNOME | 12.333| 1.241|
|
62 |
| KDE4 | 165.439 | 105.098 |
|
63 |
+
| QED | 63.041 | 49.181 |
|
64 |
+
| TED2020 v1 | 46.680 | 38.428 |
|
65 |
| OpenSubtitles | 303.329 | 171.376 |
|
66 |
| GlobalVoices| 4.636 | 3.578|
|
67 |
| Tatoeba | 732 | 655 |
|
68 |
| Books | 4.445 | 2049 |
|
69 |
| Europarl | 1.734.643 | 1.734.643 |
|
70 |
| Tilde | 3.434.091 | 3.434.091 |
|
71 |
+
| **Total** | **7.427.843** | **6.258.272** |
|
72 |
|
73 |
All corpora except Europarl and Tilde were collected from [Opus](https://opus.nlpl.eu/).
|
74 |
+
The Europarl and Tilde corpora are synthetic parallel corpora created from the original Spanish-Catalan corpora by [SoftCatalà](https://github.com/Softcatala).
|
75 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
76 |
|
77 |
### Training procedure
|
78 |
|
|
|
80 |
|
81 |
All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
|
82 |
This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
|
83 |
+
The filtered datasets are then concatenated to form a final corpus of 6.159.631 and before training the punctuation is normalized
|
84 |
+
using a modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
|
85 |
|
86 |
|
87 |
#### Tokenization
|
|
|
130 |
|
131 |
| Test set | SoftCatalà | Google Translate | aina-translator-ca-de |
|
132 |
|----------------------|------------|------------------|---------------|
|
133 |
+
| Flores 101 dev | 26,2 | **34,8** | 27,5 |
|
134 |
+
| Flores 101 devtest |26,3 | **34,0** | 26,9 |
|
135 |
+
| NTREX | 21,7 | **28,8** | 20,4 |
|
136 |
+
| Average | 24,7 | **32,5** | 24,9 |
|
137 |
|
138 |
## Additional information
|
139 |
|
config.json
CHANGED
@@ -5,6 +5,5 @@
|
|
5 |
"decoder_start_token": "</s>",
|
6 |
"eos_token": "</s>",
|
7 |
"layer_norm_epsilon": null,
|
8 |
-
"multi_query_attention": false,
|
9 |
"unk_token": "<unk>"
|
10 |
}
|
|
|
5 |
"decoder_start_token": "</s>",
|
6 |
"eos_token": "</s>",
|
7 |
"layer_norm_epsilon": null,
|
|
|
8 |
"unk_token": "<unk>"
|
9 |
}
|
model.bin
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
-
size
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:e55f9e37a616e6d7cf7cc6111920e95133be662bbe4792cc6131f2df4ec25788
|
3 |
+
size 1860745998
|
shared_vocabulary.json
CHANGED
The diff for this file is too large to render.
See raw diff
|
|
spm.model
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
-
size
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:8999682b24246eb4bac8e43e528a47fe5555a7101710f04f4d3780804a703a77
|
3 |
+
size 1182213
|