Fairseq
Spanish
Catalan
fdelucaf commited on
Commit
f5d15d3
1 Parent(s): 61c4963

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -15
README.md CHANGED
@@ -51,21 +51,8 @@ However, we are well aware that our models may be biased. We intend to conduct r
51
 
52
  ### Training data
53
 
54
- The model was trained on a combination of the following datasets:
55
-
56
- | Dataset | Sentences | Tokens |
57
- |-----------|-----------|-----------|
58
- | DOGC v2 | 8.472.786 | 188.929.206 |
59
- | El Periodico | 6.483.106 | 145.591.906 |
60
- | EuroParl | 1.876.669 | 49.212.670 |
61
- | WikiMatrix | 1.421.077 | 34.902.039 |
62
- | Wikimedia | 335.955 | 8.682.025 |
63
- | QED | 71.867 | 1.079.705 |
64
- | TED2020 v1 | 52.177 | 836.882 |
65
- | CCMatrix v1 | 56.103.820 | 1.064.182.320 |
66
- | MultiCCAligned v1 | 2.433.418 | 48.294.144 |
67
- | ParaCrawl | 15.327.808 | 334.199.408 |
68
- | **Total** | **92.578.683** | **1.875.910.305** |
69
 
70
  ### Training procedure
71
 
 
51
 
52
  ### Training data
53
 
54
+ The model was trained on a combination of several datasets, totalling around 92 million parallel sentences before filtering and cleaning.
55
+ The trainig data includes corpora collected from [Opus](https://opus.nlpl.eu/), internally created parallel datsets, and corpora from other sources.
 
 
 
 
 
 
 
 
 
 
 
 
 
56
 
57
  ### Training procedure
58