carlosep93
commited on
Commit
•
3d4bce7
1
Parent(s):
3af3027
Update README.md
Browse files
README.md
CHANGED
@@ -26,7 +26,7 @@ license: cc-by-4.0
|
|
26 |
|
27 |
## Model description
|
28 |
|
29 |
-
This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of Catalan-Spanish datasets, up to
|
30 |
|
31 |
## Intended uses and limitations
|
32 |
|
@@ -74,7 +74,6 @@ The was trained on a combination of the following datasets:
|
|
74 |
| CCMatrix v1 | 56.103.820 | 1.064.182.320 |
|
75 |
| MultiCCAligned v1 | 2.433.418 | 48.294.144 |
|
76 |
| ParaCrawl | 15.327.808 | 334.199.408 |
|
77 |
-
|-------------------|----------------|-------------------|
|
78 |
| **Total** | **92.578.683** | **1.875.910.305** |
|
79 |
|
80 |
### Training procedure
|
@@ -122,7 +121,7 @@ The model was trained using shards of 10 million sentences, for a total of 13.00
|
|
122 |
|
123 |
### Variable and metrics
|
124 |
|
125 |
-
We use the BLEU score for evaluation on test sets: [Flores-101](https://github.com/facebookresearch/flores), [
|
126 |
|
127 |
### Evaluation results
|
128 |
|
@@ -138,16 +137,9 @@ Below are the evaluation results on the machine translation from Catalan to Chin
|
|
138 |
| Cybersecurity | 73,5 | **76,9** | 75,1 |
|
139 |
| wmt 19 biomedical | 60,0 | 62,7 | **63,0** |
|
140 |
| wmt 13 news | 22,7 | 23,1 | **23,4** |
|
141 |
-
|----------------------|------------|------------------|---------------|
|
142 |
| Average | 52,5 | 56,6 | **56,7** |
|
143 |
|
144 |
|
145 |
-
|
146 |
-
- [Author](#author)
|
147 |
-
- [Licensing information](#licensing-information)
|
148 |
-
- [Funding](#funding)
|
149 |
-
- [Disclaimer](#disclaimer)
|
150 |
-
|
151 |
## Additional information
|
152 |
|
153 |
### Author
|
|
|
26 |
|
27 |
## Model description
|
28 |
|
29 |
+
This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of Catalan-Spanish datasets, up to 92 million sentences. Additionally, the model is evaluated on several public datasecomprising 5 different domains (general, adminstrative, technology, biomedical, and news).
|
30 |
|
31 |
## Intended uses and limitations
|
32 |
|
|
|
74 |
| CCMatrix v1 | 56.103.820 | 1.064.182.320 |
|
75 |
| MultiCCAligned v1 | 2.433.418 | 48.294.144 |
|
76 |
| ParaCrawl | 15.327.808 | 334.199.408 |
|
|
|
77 |
| **Total** | **92.578.683** | **1.875.910.305** |
|
78 |
|
79 |
### Training procedure
|
|
|
121 |
|
122 |
### Variable and metrics
|
123 |
|
124 |
+
We use the BLEU score for evaluation on test sets: [Flores-101](https://github.com/facebookresearch/flores), [TaCon](https://elrc-share.eu/repository/browse/tacon-spanish-constitution-mt-test-set/84a96138b98611ec9c1a00155d02670628f3e6857b0f422abd82abc3795ec8c2/), [United Nations](https://zenodo.org/record/3888414#.Y33-_tLMIW0), [Cybersecurity](https://elrc-share.eu/repository/browse/cyber-mt-test-set/2bd93faab98c11ec9c1a00155d026706b96a490ed3e140f0a29a80a08c46e91e/), [wmt19 biomedical test set](), [wmt13 news test set](https://elrc-share.eu/repository/browse/catalan-wmt2013-machine-translation-shared-task-test-set/84a96139b98611ec9c1a00155d0267061a0aa1b62e2248e89aab4952f3c230fc/), [aina aapp]()
|
125 |
|
126 |
### Evaluation results
|
127 |
|
|
|
137 |
| Cybersecurity | 73,5 | **76,9** | 75,1 |
|
138 |
| wmt 19 biomedical | 60,0 | 62,7 | **63,0** |
|
139 |
| wmt 13 news | 22,7 | 23,1 | **23,4** |
|
|
|
140 |
| Average | 52,5 | 56,6 | **56,7** |
|
141 |
|
142 |
|
|
|
|
|
|
|
|
|
|
|
|
|
143 |
## Additional information
|
144 |
|
145 |
### Author
|