magistermilitum
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -25,13 +25,22 @@ Several big corpora were cleaned ans transformed to be used during the process t
|
|
25 |
|
26 |
| dataset | size | Lang | dates |
|
27 |
| ------------- |:-------------:| -----:|-----:|
|
28 |
-
| CC100
|
29 |
-
| Corpus Corporum
|
30 |
-
| CEMA | 320Mb | la+fro |9th - 15th |
|
31 |
-
| HOME | 38Mb | la+fro | 12th - 15th |
|
32 |
-
| BFM | 34Mb | fro | 13th - 15th|
|
33 |
-
| AND | 19Mb | fro | 13th - 15th|
|
34 |
-
| CODEA | 13Mb | spa |12th - 16th |
|
35 |
| | ~6,5Gb | |
|
36 |
-
| | 650M
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
37 |
|
|
|
25 |
|
26 |
| dataset | size | Lang | dates |
|
27 |
| ------------- |:-------------:| -----:|-----:|
|
28 |
+
| CC100 [1] | 3,2Gb | la | 5th BC - 18th|
|
29 |
+
| Corpus Corporum [2] | 3,0Gb | la | 5th BC - 16th |
|
30 |
+
| CEMA [3] | 320Mb | la+fro |9th - 15th |
|
31 |
+
| HOME-Alcar [4] | 38Mb | la+fro | 12th - 15th |
|
32 |
+
| BFM [5] | 34Mb | fro | 13th - 15th|
|
33 |
+
| AND [6] | 19Mb | fro | 13th - 15th|
|
34 |
+
| CODEA [7] | 13Mb | spa |12th - 16th |
|
35 |
| | ~6,5Gb | |
|
36 |
+
| | 650M tokens (4,5Gb) | | |
|
37 |
+
|
38 |
+
[1] CC-NET Repository : https://huggingface.co/datasets/cc100
|
39 |
+
[2] Repositorium operum lationorum apud universitatem Turicensem : https://mlat.uzh.ch/
|
40 |
+
[3] Cartae Europae Medii Aevi (5th-15th c.) : https://cema.lamop.fr/
|
41 |
+
[4] History of Medieval Europe : https://doi.org/10.5281/zenodo.5600884
|
42 |
+
[5] Base du Français Médieval : https://txm-bfm.huma-num.fr/txm/
|
43 |
+
[6] Anglo-Normand Dictionary : https://anglo-norman.net/
|
44 |
+
[7] Corpus de Docuemntos Españoles anteriores a 1900: https://www.corpuscodea.es/
|
45 |
+
|
46 |
|