set back-translation as main

by AudreyVM - opened Apr 23, 2024

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+32352

-32370

Files changed (5) hide show

README.md +27 -28
config.json +0 -1
model.bin +2 -2
shared_vocabulary.json +0 -0
spm.model +2 -2

README.md CHANGED Viewed

@@ -13,7 +13,8 @@ library_name: fairseq
 ## Model description
-This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of datasets comprising both Catalan-German data sourced from Opus, and additional datasets where synthetic Catalan was generated from the Spanish side of Spanish-Germancorpora using [Projecte Aina’s Spanish-Catalan model](https://huggingface.co/projecte-aina/aina-translator-es-ca). This gave a total of approximately 100 million sentence pairs. The model is evaluated on the Flores, NTEU and NTREX evaluation sets.
 ## Intended uses and limitations
@@ -35,7 +36,7 @@ import pyonmttok
 from huggingface_hub import snapshot_download
 model_dir = snapshot_download(repo_id="projecte-aina/aina-translator-ca-de", revision="main")
-tokenizer=pyonmttok.Tokenizer(mode="none", sp_model_path = model_dir + "/spm.50k.model")
 tokenized=tokenizer.tokenize("Benvingut al projecte Aina!")
 translator = ctranslate2.Translator(model_dir)
@@ -51,28 +52,26 @@ However, we are well aware that our models may be biased. We intend to conduct r
 ### Training data
-The Catalan-German data collected from the web was a combination of the following datasets:
-| Datasets       	|
-|-------------------|
-| WikiMatrix  	|
-| GNOME	|
-| KDE4    	|
-| OpenSubtitles	|
-| MultiParaCrawl |
-| DGT |
-| EUBookshop |
-| NLLB |
-| GlobalVoices|
-| Tatoeba |
-| Books |
-| Europarl |
-| Tilde |
 All corpora except Europarl and Tilde were collected from [Opus](https://opus.nlpl.eu/).
-The Europarl and Tilde corpora are synthetic parallel corpora created from the original Spanish-German corpora by [SoftCatalà](https://github.com/Softcatala).
-Once all available Catalan-German data had been collected, additional synthetic Catalan data was created from the Spanish side of Spanish-German corpora using [Projecte Aina’s Spanish-Catalan model.](https://huggingface.co/projecte-aina/aina-translator-es-ca)
 ### Training procedure
@@ -81,7 +80,8 @@ Once all available Catalan-German data had been collected, additional synthetic
  All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
  This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
- The filtered datasets are then concatenated and before training the punctuation is normalized using a modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
 #### Tokenization
@@ -130,11 +130,10 @@ and [Google Translate](https://translate.google.es/?hl=es):
 | Test set         	| SoftCatalà | Google Translate | aina-translator-ca-de |
 |----------------------|------------|------------------|---------------|
-| Flores 101 dev   	| 26,2     	| **34,8**     	| 34,1     	|
-| Flores 101 devtest   |26,3   	| **34,0**     	| 33,3     	|
-| NTREX | 21,7 | **28,8** | 27,8 |
-|NTEU | 37 | 37,1   | **48,3**   |
-| Average          	| 27,8   	| 33,7    	| **35,9**     	|
 ## Additional information

 ## Model description
+This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of Catalan-German datasets,
+which after filtering and cleaning comprised 6.258.272 sentence pairs. The model was evaluated on the Flores and NTREX evaluation datasets.
 ## Intended uses and limitations
 from huggingface_hub import snapshot_download
 model_dir = snapshot_download(repo_id="projecte-aina/aina-translator-ca-de", revision="main")
+tokenizer=pyonmttok.Tokenizer(mode="none", sp_model_path = model_dir + "/spm.model")
 tokenized=tokenizer.tokenize("Benvingut al projecte Aina!")
 translator = ctranslate2.Translator(model_dir)
 ### Training data
+The model was trained on a combination of the following datasets:
+| Dataset       	| Sentences  	| Sentences after Cleaning|
+|-------------------|----------------|-------------------|
+| Multi CCAligned | 1.478.152 | 1.027.481 |
+| WikiMatrix  	| 180.322 	| 125.811 	|
+| GNOME	| 12.333|	1.241|
+| KDE4    	| 165.439   	|  105.098 	|
+| QED           	| 63.041 	|   49.181 	|
+| TED2020 v1    	| 46.680 	| 38.428 |
+| OpenSubtitles	| 303.329	| 171.376	|
+| GlobalVoices| 4.636 	|	3.578|
+| Tatoeba | 732 | 655 |
+| Books | 4.445 | 2049 |
+| Europarl | 1.734.643 | 1.734.643 |
+| Tilde | 3.434.091 | 3.434.091 |
+| **Total**     	| **7.427.843** | **6.258.272** |
 All corpora except Europarl and Tilde were collected from [Opus](https://opus.nlpl.eu/).
+The Europarl and Tilde corpora are synthetic parallel corpora created from the original Spanish-Catalan corpora by [SoftCatalà](https://github.com/Softcatala).
 ### Training procedure
  All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
  This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
+ The filtered datasets are then concatenated to form a final corpus of 6.159.631 and before training the punctuation is normalized
+ using a modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
 #### Tokenization
 | Test set         	| SoftCatalà | Google Translate | aina-translator-ca-de |
 |----------------------|------------|------------------|---------------|
+| Flores 101 dev   	| 26,2     	| **34,8**     	| 27,5     	|
+| Flores 101 devtest   |26,3   	| **34,0**     	| 26,9     	|
+| NTREX | 21,7 | **28,8** | 20,4 |
+| Average          	| 24,7   	| **32,5**     	| 24,9      	|
 ## Additional information

config.json CHANGED Viewed

@@ -5,6 +5,5 @@
   "decoder_start_token": "</s>",
   "eos_token": "</s>",
   "layer_norm_epsilon": null,
-  "multi_query_attention": false,
   "unk_token": "<unk>"
 }

   "decoder_start_token": "</s>",
   "eos_token": "</s>",
   "layer_norm_epsilon": null,
   "unk_token": "<unk>"
 }

model.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:38e98a6e9dd5e6c00cd94647c0695666d78c070bfb1c73ab70a36f7c6557a9e6
-size 1860811612

 version https://git-lfs.github.com/spec/v1
+oid sha256:e55f9e37a616e6d7cf7cc6111920e95133be662bbe4792cc6131f2df4ec25788
+size 1860745998

shared_vocabulary.json CHANGED Viewed

The diff for this file is too large to render. See raw diff

spm.model CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:ec1143f19e1763ed954c4060f440ade847163d8267ec9e131b7c829080df8eb7
-size 1182306

 version https://git-lfs.github.com/spec/v1
+oid sha256:8999682b24246eb4bac8e43e528a47fe5555a7101710f04f4d3780804a703a77
+size 1182213