projecte-aina
/

aina-translator-en-ca

Fairseq

English

Catalan

Model card Files Files and versions Community

fdelucaf commited on May 13

Commit

e1ecce0

•

1 Parent(s): 5197d82

Update README.md

Browse files

Files changed (1) hide show

README.md +8 -6

README.md CHANGED Viewed

@@ -51,7 +51,8 @@ However, we are well aware that our models may be biased. We intend to conduct r
 ### Training data
-The model was trained on a combination of several datasets, including data collected from Opus, HPLT and other sources.
 ### Training procedure
@@ -59,7 +60,8 @@ The model was trained on a combination of several datasets, including data colle
  All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
  This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
- The filtered datasets are then concatenated to form a final corpus of 30.023.034 and before training the punctuation is normalized using a modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
 #### Tokenization
@@ -100,11 +102,11 @@ The model was trained for a total of 16000 updates. Weights were saved every 100
 ### Variable and metrics
-We use the BLEU score for evaluation on test sets:
 [Spanish Constitution (TaCon)](https://elrc-share.eu/repository/browse/tacon-spanish-constitution-mt-test-set/84a96138b98611ec9c1a00155d02670628f3e6857b0f422abd82abc3795ec8c2/),
 [United Nations](https://zenodo.org/record/3888414#.Y33-_tLMIW0),
 [European Commission](https://elrc-share.eu/repository/browse/european-commission-corpus/8a419b1758ea11ed9c1a00155d0267069bd085cae124481589b0858e5b274327/),
-[Flores-101](https://github.com/facebookresearch/flores),
 [Cybersecurity](https://elrc-share.eu/repository/browse/cyber-mt-test-set/2bd93faab98c11ec9c1a00155d026706b96a490ed3e140f0a29a80a08c46e91e/),
 [wmt19 biomedical test set](http://www.statmt.org/wmt19/biomedical-translation-task.html),
 [wmt13 news test set](https://elrc-share.eu/repository/browse/catalan-wmt2013-machine-translation-shared-task-test-set/84a96139b98611ec9c1a00155d0267061a0aa1b62e2248e89aab4952f3c230fc/).
@@ -120,8 +122,8 @@ Below are the evaluation results on the machine translation from English to Cata
 | Spanish Constitution | 32,6       | 37,8             | **41,2**      |
 | United Nations       | 39,0       | 40,5             | **41,2**      |
 | European Commission  | 49,1       | **52,0**         | 51            |
-| Flores 101 dev       | 41,0       | **45,1**         | 43,3          |
-| Flores 101 devtest   | 42,1       | **46,0**         | 44,1          |
 | Cybersecurity        | 42,5       | **48,1**         | 45,8          |
 | wmt 19 biomedical    | 21,7       | 25,5             | **26,7**      |
 | wmt 13 news          | 34,9       | **35,7**         | 34,0          |

 ### Training data
+The model was trained on a combination of several datasets, including data collected from [Opus](https://opus.nlpl.eu/), [HPLT](https://hplt-project.org/),
+an internally created [CA-EN Parallel Corpus](https://huggingface.co/datasets/projecte-aina/CA-EN_Parallel_Corpus), and other sources.
 ### Training procedure
  All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
  This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
+ The filtered datasets are then concatenated to form a final corpus of 30.023.034 parallel sentences and before training the punctuation
+ is normalized using a modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
 #### Tokenization
 ### Variable and metrics
+We use the BLEU score for evaluation on the following test sets:
 [Spanish Constitution (TaCon)](https://elrc-share.eu/repository/browse/tacon-spanish-constitution-mt-test-set/84a96138b98611ec9c1a00155d02670628f3e6857b0f422abd82abc3795ec8c2/),
 [United Nations](https://zenodo.org/record/3888414#.Y33-_tLMIW0),
 [European Commission](https://elrc-share.eu/repository/browse/european-commission-corpus/8a419b1758ea11ed9c1a00155d0267069bd085cae124481589b0858e5b274327/),
+[Flores-200](https://github.com/facebookresearch/flores),
 [Cybersecurity](https://elrc-share.eu/repository/browse/cyber-mt-test-set/2bd93faab98c11ec9c1a00155d026706b96a490ed3e140f0a29a80a08c46e91e/),
 [wmt19 biomedical test set](http://www.statmt.org/wmt19/biomedical-translation-task.html),
 [wmt13 news test set](https://elrc-share.eu/repository/browse/catalan-wmt2013-machine-translation-shared-task-test-set/84a96139b98611ec9c1a00155d0267061a0aa1b62e2248e89aab4952f3c230fc/).
 | Spanish Constitution | 32,6       | 37,8             | **41,2**      |
 | United Nations       | 39,0       | 40,5             | **41,2**      |
 | European Commission  | 49,1       | **52,0**         | 51            |
+| Flores 200 dev       | 41,0       | **45,1**         | 43,3          |
+| Flores 200 devtest   | 42,1       | **46,0**         | 44,1          |
 | Cybersecurity        | 42,5       | **48,1**         | 45,8          |
 | wmt 19 biomedical    | 21,7       | 25,5             | **26,7**      |
 | wmt 13 news          | 34,9       | **35,7**         | 34,0          |