Fairseq
Catalan
Spanish
fdelucaf commited on
Commit
5176ee4
1 Parent(s): f968ade

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -19
README.md CHANGED
@@ -12,8 +12,8 @@ library_name: fairseq
12
  ## Model description
13
 
14
  This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of Catalan-Spanish datasets,
15
- up to 92 million sentences. Additionally, the model is evaluated on several public datasets comprising 5 different domains (general, adminstrative, technology,
16
- biomedical, and news).
17
 
18
  ## Intended uses and limitations
19
 
@@ -51,21 +51,8 @@ However, we are well aware that our models may be biased. We intend to conduct r
51
 
52
  ### Training data
53
 
54
- The was trained on a combination of the following datasets:
55
-
56
- | Dataset | Sentences | Tokens |
57
- |-------------------|----------------|-------------------|
58
- | DOGC v2 | 8.472.786 | 188.929.206 |
59
- | El Periodico | 6.483.106 | 145.591.906 |
60
- | EuroParl | 1.876.669 | 49.212.670 |
61
- | WikiMatrix | 1.421.077 | 34.902.039 |
62
- | Wikimedia | 335.955 | 8.682.025 |
63
- | QED | 71.867 | 1.079.705 |
64
- | TED2020 v1 | 52.177 | 836.882 |
65
- | CCMatrix v1 | 56.103.820 | 1.064.182.320 |
66
- | MultiCCAligned v1 | 2.433.418 | 48.294.144 |
67
- | ParaCrawl | 15.327.808 | 334.199.408 |
68
- | **Total** | **92.578.683** | **1.875.910.305** |
69
 
70
  ### Training procedure
71
 
@@ -75,7 +62,7 @@ The was trained on a combination of the following datasets:
75
  cleaned using the clean-corpus-n.pl script from [moses](https://github.com/moses-smt/mosesdecoder), allowing sentences between 5 and 150 words.
76
 
77
  Before training, the punctuation is normalized using a modified version of the join-single-file.py script
78
- from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
79
 
80
 
81
  #### Tokenization
@@ -116,7 +103,8 @@ Weights were saved every 1000 updates and reported results are the average of th
116
 
117
  ### Variable and metrics
118
 
119
- We use the BLEU score for evaluation on test sets: [Flores-101](https://github.com/facebookresearch/flores),
 
120
  [TaCon](https://elrc-share.eu/repository/browse/tacon-spanish-constitution-mt-test-set/84a96138b98611ec9c1a00155d02670628f3e6857b0f422abd82abc3795ec8c2/),
121
  [United Nations](https://zenodo.org/record/3888414#.Y33-_tLMIW0),
122
  [Cybersecurity](https://elrc-share.eu/repository/browse/cyber-mt-test-set/2bd93faab98c11ec9c1a00155d026706b96a490ed3e140f0a29a80a08c46e91e/),
 
12
  ## Model description
13
 
14
  This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of Catalan-Spanish datasets,
15
+ up to 92 million sentences before cleaning and filtering. Additionally, the model is evaluated on several public datasets comprising 5 different domains
16
+ (general, adminstrative, technology, biomedical, and news).
17
 
18
  ## Intended uses and limitations
19
 
 
51
 
52
  ### Training data
53
 
54
+ The model was trained on a combination of several datasets, totalling around 92 million parallel sentences before filtering and cleaning.
55
+ The trainig data includes corpora collected from [Opus](https://opus.nlpl.eu/), internally created parallel datsets, and corpora from other sources.
 
 
 
 
 
 
 
 
 
 
 
 
 
56
 
57
  ### Training procedure
58
 
 
62
  cleaned using the clean-corpus-n.pl script from [moses](https://github.com/moses-smt/mosesdecoder), allowing sentences between 5 and 150 words.
63
 
64
  Before training, the punctuation is normalized using a modified version of the join-single-file.py script
65
+ from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py).
66
 
67
 
68
  #### Tokenization
 
103
 
104
  ### Variable and metrics
105
 
106
+ We use the BLEU score for evaluation on test sets:
107
+ [Flores-101](https://github.com/facebookresearch/flores),
108
  [TaCon](https://elrc-share.eu/repository/browse/tacon-spanish-constitution-mt-test-set/84a96138b98611ec9c1a00155d02670628f3e6857b0f422abd82abc3795ec8c2/),
109
  [United Nations](https://zenodo.org/record/3888414#.Y33-_tLMIW0),
110
  [Cybersecurity](https://elrc-share.eu/repository/browse/cyber-mt-test-set/2bd93faab98c11ec9c1a00155d026706b96a490ed3e140f0a29a80a08c46e91e/),