projecte-aina
/

aina-translator-es-ca

Model card Files Files and versions Community

fdelucaf commited on May 15

Commit

f5d15d3

•

1 Parent(s): 61c4963

Update README.md

Files changed (1) hide show

README.md +2 -15

README.md CHANGED Viewed

@@ -51,21 +51,8 @@ However, we are well aware that our models may be biased. We intend to conduct r
 ### Training data
-The model was trained on a combination of the following datasets:
-| Dataset           | Sentences      | Tokens            |
-|-----------|-----------|-----------|
-| DOGC v2           | 8.472.786      | 188.929.206       |
-| El Periodico      | 6.483.106      | 145.591.906       |
-| EuroParl          | 1.876.669      | 49.212.670        |
-| WikiMatrix        | 1.421.077      | 34.902.039        |
-| Wikimedia         | 335.955        | 8.682.025         |
-| QED               | 71.867         | 1.079.705         |
-| TED2020 v1        | 52.177         | 836.882           |
-| CCMatrix v1       | 56.103.820     | 1.064.182.320     |
-| MultiCCAligned v1 | 2.433.418      | 48.294.144        |
-| ParaCrawl         | 15.327.808     | 334.199.408       |
-| **Total**         | **92.578.683** | **1.875.910.305** |
 ### Training procedure

 ### Training data
+The model was trained on a combination of several datasets, totalling around 92 million parallel sentences before filtering and cleaning.
+The trainig data includes corpora collected from [Opus](https://opus.nlpl.eu/), internally created parallel datsets, and corpora from other sources.
 ### Training procedure