Update README.md
Browse files
README.md
CHANGED
@@ -12,8 +12,8 @@ library_name: fairseq
|
|
12 |
## Model description
|
13 |
|
14 |
This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of Catalan-Spanish datasets,
|
15 |
-
up to 92 million sentences. Additionally, the model is evaluated on several public datasets comprising 5 different domains
|
16 |
-
biomedical, and news).
|
17 |
|
18 |
## Intended uses and limitations
|
19 |
|
@@ -51,21 +51,8 @@ However, we are well aware that our models may be biased. We intend to conduct r
|
|
51 |
|
52 |
### Training data
|
53 |
|
54 |
-
The was trained on a combination of
|
55 |
-
|
56 |
-
| Dataset | Sentences | Tokens |
|
57 |
-
|-------------------|----------------|-------------------|
|
58 |
-
| DOGC v2 | 8.472.786 | 188.929.206 |
|
59 |
-
| El Periodico | 6.483.106 | 145.591.906 |
|
60 |
-
| EuroParl | 1.876.669 | 49.212.670 |
|
61 |
-
| WikiMatrix | 1.421.077 | 34.902.039 |
|
62 |
-
| Wikimedia | 335.955 | 8.682.025 |
|
63 |
-
| QED | 71.867 | 1.079.705 |
|
64 |
-
| TED2020 v1 | 52.177 | 836.882 |
|
65 |
-
| CCMatrix v1 | 56.103.820 | 1.064.182.320 |
|
66 |
-
| MultiCCAligned v1 | 2.433.418 | 48.294.144 |
|
67 |
-
| ParaCrawl | 15.327.808 | 334.199.408 |
|
68 |
-
| **Total** | **92.578.683** | **1.875.910.305** |
|
69 |
|
70 |
### Training procedure
|
71 |
|
@@ -75,7 +62,7 @@ The was trained on a combination of the following datasets:
|
|
75 |
cleaned using the clean-corpus-n.pl script from [moses](https://github.com/moses-smt/mosesdecoder), allowing sentences between 5 and 150 words.
|
76 |
|
77 |
Before training, the punctuation is normalized using a modified version of the join-single-file.py script
|
78 |
-
from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
|
79 |
|
80 |
|
81 |
#### Tokenization
|
@@ -116,7 +103,8 @@ Weights were saved every 1000 updates and reported results are the average of th
|
|
116 |
|
117 |
### Variable and metrics
|
118 |
|
119 |
-
We use the BLEU score for evaluation on test sets:
|
|
|
120 |
[TaCon](https://elrc-share.eu/repository/browse/tacon-spanish-constitution-mt-test-set/84a96138b98611ec9c1a00155d02670628f3e6857b0f422abd82abc3795ec8c2/),
|
121 |
[United Nations](https://zenodo.org/record/3888414#.Y33-_tLMIW0),
|
122 |
[Cybersecurity](https://elrc-share.eu/repository/browse/cyber-mt-test-set/2bd93faab98c11ec9c1a00155d026706b96a490ed3e140f0a29a80a08c46e91e/),
|
|
|
12 |
## Model description
|
13 |
|
14 |
This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of Catalan-Spanish datasets,
|
15 |
+
up to 92 million sentences before cleaning and filtering. Additionally, the model is evaluated on several public datasets comprising 5 different domains
|
16 |
+
(general, adminstrative, technology, biomedical, and news).
|
17 |
|
18 |
## Intended uses and limitations
|
19 |
|
|
|
51 |
|
52 |
### Training data
|
53 |
|
54 |
+
The model was trained on a combination of several datasets, totalling around 92 million parallel sentences before filtering and cleaning.
|
55 |
+
The trainig data includes corpora collected from [Opus](https://opus.nlpl.eu/), internally created parallel datsets, and corpora from other sources.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
56 |
|
57 |
### Training procedure
|
58 |
|
|
|
62 |
cleaned using the clean-corpus-n.pl script from [moses](https://github.com/moses-smt/mosesdecoder), allowing sentences between 5 and 150 words.
|
63 |
|
64 |
Before training, the punctuation is normalized using a modified version of the join-single-file.py script
|
65 |
+
from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py).
|
66 |
|
67 |
|
68 |
#### Tokenization
|
|
|
103 |
|
104 |
### Variable and metrics
|
105 |
|
106 |
+
We use the BLEU score for evaluation on test sets:
|
107 |
+
[Flores-101](https://github.com/facebookresearch/flores),
|
108 |
[TaCon](https://elrc-share.eu/repository/browse/tacon-spanish-constitution-mt-test-set/84a96138b98611ec9c1a00155d02670628f3e6857b0f422abd82abc3795ec8c2/),
|
109 |
[United Nations](https://zenodo.org/record/3888414#.Y33-_tLMIW0),
|
110 |
[Cybersecurity](https://elrc-share.eu/repository/browse/cyber-mt-test-set/2bd93faab98c11ec9c1a00155d026706b96a490ed3e140f0a29a80a08c46e91e/),
|