Update README.md
Browse files
README.md
CHANGED
@@ -51,21 +51,8 @@ However, we are well aware that our models may be biased. We intend to conduct r
|
|
51 |
|
52 |
### Training data
|
53 |
|
54 |
-
The model was trained on a combination of
|
55 |
-
|
56 |
-
| Dataset | Sentences | Tokens |
|
57 |
-
|-----------|-----------|-----------|
|
58 |
-
| DOGC v2 | 8.472.786 | 188.929.206 |
|
59 |
-
| El Periodico | 6.483.106 | 145.591.906 |
|
60 |
-
| EuroParl | 1.876.669 | 49.212.670 |
|
61 |
-
| WikiMatrix | 1.421.077 | 34.902.039 |
|
62 |
-
| Wikimedia | 335.955 | 8.682.025 |
|
63 |
-
| QED | 71.867 | 1.079.705 |
|
64 |
-
| TED2020 v1 | 52.177 | 836.882 |
|
65 |
-
| CCMatrix v1 | 56.103.820 | 1.064.182.320 |
|
66 |
-
| MultiCCAligned v1 | 2.433.418 | 48.294.144 |
|
67 |
-
| ParaCrawl | 15.327.808 | 334.199.408 |
|
68 |
-
| **Total** | **92.578.683** | **1.875.910.305** |
|
69 |
|
70 |
### Training procedure
|
71 |
|
|
|
51 |
|
52 |
### Training data
|
53 |
|
54 |
+
The model was trained on a combination of several datasets, totalling around 92 million parallel sentences before filtering and cleaning.
|
55 |
+
The trainig data includes corpora collected from [Opus](https://opus.nlpl.eu/), internally created parallel datsets, and corpora from other sources.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
56 |
|
57 |
### Training procedure
|
58 |
|