projecte-aina
/

aina-translator-ca-de

Fairseq

Catalan

German

Model card Files Files and versions Community

AudreyVM commited on Apr 23

Commit

dca0aec

•

1 Parent(s): 7f6c39c

setting backtranslation as default / update README

Browse files

Files changed (1) hide show

README.md +21 -11

README.md CHANGED Viewed

@@ -13,8 +13,8 @@ library_name: fairseq
 ## Model description
-This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of Catalan-German datasets,
-which after filtering and cleaning comprised 6.258.272 sentence pairs. The model was evaluated on the Flores and NTREX evaluation datasets.
 ## Intended uses and limitations
@@ -36,7 +36,7 @@ import pyonmttok
 from huggingface_hub import snapshot_download
 model_dir = snapshot_download(repo_id="projecte-aina/aina-translator-ca-de", revision="main")
-tokenizer=pyonmttok.Tokenizer(mode="none", sp_model_path = model_dir + "/spm.model")
 tokenized=tokenizer.tokenize("Benvingut al projecte Aina!")
 translator = ctranslate2.Translator(model_dir)
@@ -52,7 +52,7 @@ However, we are well aware that our models may be biased. We intend to conduct r
 ### Training data
-The model was trained on a combination of the following datasets:
 | Dataset       	| Sentences  	| Sentences after Cleaning|
 |-------------------|----------------|-------------------|
@@ -71,8 +71,19 @@ The model was trained on a combination of the following datasets:
 | **Total**     	| **7.427.843** | **6.258.272** |
 All corpora except Europarl and Tilde were collected from [Opus](https://opus.nlpl.eu/).
-The Europarl and Tilde corpora are synthetic parallel corpora created from the original Spanish-Catalan corpora by [SoftCatalà](https://github.com/Softcatala).
 ### Training procedure
@@ -80,8 +91,7 @@ The Europarl and Tilde corpora are synthetic parallel corpora created from the o
  All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
  This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
- The filtered datasets are then concatenated to form a final corpus of 6.159.631 and before training the punctuation is normalized
- using a modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
 #### Tokenization
@@ -130,10 +140,10 @@ and [Google Translate](https://translate.google.es/?hl=es):
 | Test set         	| SoftCatalà | Google Translate | aina-translator-ca-de |
 |----------------------|------------|------------------|---------------|
-| Flores 101 dev   	| 26,2     	| **34,8**     	| 27,5     	|
-| Flores 101 devtest   |26,3   	| **34,0**     	| 26,9     	|
-| NTREX | 21,7 | **28,8** | 20,4 |
-| Average          	| 24,7   	| **32,5**     	| 24,9      	|
 ## Additional information

 ## Model description
+This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of Catalan-German datasets, totalling 100.000.000 sentence pairs.
+6.258.272 sentence pairs were parallel data collected from the web while the remaining 93.741.728 sentence pairs were parallel synthetic data created using the ES-CA translator of [PlanTL](https://huggingface.co/PlanTL-GOB-ES/mt-plantl-es-ca). The model was evaluated on the Flores and NTREX evaluation datasets.
 ## Intended uses and limitations
 from huggingface_hub import snapshot_download
 model_dir = snapshot_download(repo_id="projecte-aina/aina-translator-ca-de", revision="main")
+tokenizer=pyonmttok.Tokenizer(mode="none", sp_model_path = model_dir + "/spm.50k.model")
 tokenized=tokenizer.tokenize("Benvingut al projecte Aina!")
 translator = ctranslate2.Translator(model_dir)
 ### Training data
+The Catalan-German data collected from the web was a combination of the following datasets:
 | Dataset       	| Sentences  	| Sentences after Cleaning|
 |-------------------|----------------|-------------------|
 | **Total**     	| **7.427.843** | **6.258.272** |
 All corpora except Europarl and Tilde were collected from [Opus](https://opus.nlpl.eu/).
+The Europarl and Tilde corpora are synthetic parallel corpora created from the original Spanish-German corpora by [SoftCatalà](https://github.com/Softcatala).
+The 93.741.728 sentence pairs of synthetic parallel data were created from the following Spanish-German datasets:
+| Dataset       	| Sentences before cleaning	|
+|-------------------|----------------|
+|globalvoices_es-de_20230901	|    70.097 |
+|multiparacrawl_es-de_20230901  | 56.873.541 |
+|dgt_es-de_20240129           |   4.899.734  |
+|eubookshop_es-de_20240129    |   4.750.170  |
+|nllb_es-de_20240129          |   112.444.838 |
+|opensubtitles_es-de_20240129 |   18.951.214  |
+| **Total**                   | **197.989.594** |
 ### Training procedure
  All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
  This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
+ The filtered datasets are then concatenated and before training the punctuation is normalized using a modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
 #### Tokenization
 | Test set         	| SoftCatalà | Google Translate | aina-translator-ca-de |
 |----------------------|------------|------------------|---------------|
+| Flores 101 dev   	| 26,2     	| **34,8**     	| 34,1     	|
+| Flores 101 devtest   |26,3   	| **34,0**     	| 33,3     	|
+| NTREX | 21,7 | **28,8** | 27,8 |
+| Average          	| 24,7   	| **32,5**     	| 31,7      	|
 ## Additional information