|
--- |
|
license: mit |
|
widget: |
|
- text: Universis presentes [MASK] inspecturis |
|
- text: eandem [MASK] per omnia parati observare |
|
- text: yo [MASK] rey de Galicia, de las Indias |
|
- text: en avant contre les choses [MASK] contenues |
|
datasets: |
|
- cc100 |
|
- bigscience-historical-texts/Open_Medieval_French |
|
- latinwikipedia |
|
language: |
|
- la |
|
- fr |
|
- es |
|
--- |
|
|
|
## Model Details |
|
|
|
This is a Fine-tuned version of the multilingual Bert model on medieval texts. The model is intended to be used as a fondation for other ML tasks on NLP and HTR environments. |
|
|
|
The train dataset entails 650M of tokens coming from texts on classical and medieval latin; old french and old Spanish from a period ranging from 5th BC to 16th centuries. |
|
|
|
Several big corpora were cleaned and transformed to be used during the training process : |
|
|
|
| dataset | size | Lang | dates | |
|
| ------------- |:-------------:| -----:|-----:| |
|
| CC100 [1] | 3,2Gb | la | 5th BC - 18th| |
|
| Corpus Corporum [2] | 3,0Gb | la | 5th BC - 16th | |
|
| CEMA [3] | 320Mb | la+fro |9th - 15th | |
|
| HOME-Alcar [4] | 38Mb | la+fro | 12th - 15th | |
|
| BFM [5] | 34Mb | fro | 13th - 15th| |
|
| AND [6] | 19Mb | fro | 13th - 15th| |
|
| CODEA [7] | 13Mb | spa |12th - 16th | |
|
| | ~6,5Gb | | |
|
| | 650M tokens (4,5Gb)* | | | |
|
|
|
|
|
* A significant overlapped quantity of text was detected across the corpora, specially on medieval collections. Besides, synthetic text ("Lorem ipsum dolorem...") was iteratively deleted. |
|
|
|
[1] CC-NET Repository : https://huggingface.co/datasets/cc100 |
|
|
|
[2] Repositorium operum lationorum apud universitatem Turicensem : https://mlat.uzh.ch/ |
|
|
|
[3] Cartae Europae Medii Aevi (5th-15th c.) : https://cema.lamop.fr/ |
|
|
|
[4] History of Medieval Europe : https://doi.org/10.5281/zenodo.5600884 |
|
|
|
[5] Base du Français Médieval : https://txm-bfm.huma-num.fr/txm/ |
|
|
|
[6] Anglo-Normand Dictionary : https://anglo-norman.net/ |
|
|
|
[7] Corpus de Docuemntos Españoles anteriores a 1900: https://www.corpuscodea.es/ |
|
|
|
|
|
|