pipeline_tag: fill-mask
datasets:
- arch-be/brabant-xvii
language:
- nl
- fr
widget:
- text: >-
by den ontfanger van de exploiten gecollacionneert tegens [MASK] brieven
by my
output:
- label: onsen
score: 0.326
- label: onse
score: 0.247
- label: doriginale
score: 0.17
- label: synen
score: 0.046
- label: zynen
score: 0.011
- text: |-
[MASK] par la grace de dieu etc savoir faisons a tous presens et avenir
nous avoir receu lumble supplication de jehannet austin
output:
- label: maximilian
score: 0.958
- label: philippe
score: 0.017
- label: phelippe
score: 0.016
- text: >-
[MASK] byder gracien gods roomsch keyser altyt vermeerder tsrycx coninck
van
germanien van castillien van leon van arragon van navarre van napels van
secillien
van maiorque van sardine vanden eylanden van indien vander vaster eerden
ende zee
occeane eertshertoge van oistenrycke hertoge van bourgoingnen van lothric
van brabant
van limborch van luxemborch etc.
output:
- label: philips
score: 0.968
- label: kaerle
score: 0.027
- label: maximiliaen
score: 0.002
- text: |-
Cornelia de Ghijs
Joos de Medraige.
Gedaen ende alzoo gepasseert inder stadt van Bruessele
den tweesten dach der maendt van [MASK] int jaer
duijsent vijffhondert tachtentich
output:
- label: julio
score: 0.428
- label: augusto
score: 0.111
- label: aprille
score: 0.107
- label: februario
score: 0.053
- label: januario
score: 0.042
license: mit
Brabant-XVII-From-Scratch
As the name suggests, this model was trained on the eponymous dataset that comprises the transcriptions of archives text from the 'council of brabant' (raad van babant) over a period that ranges from the xv to the vii century. These transcriptions were made by hand, mostly by volunteers. The transcribed documents cover (at the time) letters of pardons and letters of sentences.
The model architecture
The 'brabant-xvii-from-scratch' model in itself was trained as a plain old BERT model.
Tokenizer
The tokenizer that has been trained specifically for this model was trained using the standard BERT tokenizer (WordPiece) with a vocabulary size limit of 30,000. During the training of the tokenizer, all of the available text was provided so as to let the resulting tokenizer stick as close as possible to the actual vocabulary used in the complete corpus.
Preprocessing steps
The tokenizer of this model applies the following preprocessing steps:
- text has been normalized to NFKC encoding
- all '(', ')', '[' and, ']' have been removed from the text. These were typically used by annotators to indicate a likely word completion that had been abbreviated in the handwritten text. (That decision was made by historian in charge of the corpus since the objective is to make the corpus searchable eventually).
- sequences of line breaks have been removed to keep at most one.
- sequences of blanks have been normalized to keep at most one.
- all texts have been lower cased
- all accents have been removed
- all control characters have been normalized to plain space
Pretraining of the model
The model was then pretrained for both BERT objectives (MLM and NSP). To that end, it used the dataset 'arch-be/brabant-xvii' ("next_sentence") which has been crafted specifically for this purpose. As per the state-of-the art in BERT training, 15% of the tokens were masked at training time.
Related Work
This model was trained as an experiment to see what would work best on the target corpus (pardons and sentences). Three options were considered for that purpose:
- Pretraining a BERT model from scratch that would be able to leverage a tokenizer based on a vocabulary emerging from the target corpus
- Fine Tuning all layers of a pretrained historical model (emanjavacas/GysBERT-v2) that mostly fits with the target languages (not fully though, as the brabant-xvii comprises both text in ancient Dutch and text in ancient French)
- Fine Tuning the head of a pretrained historical model (emanjavacas/GysBERT-v2) that mostly fit with the target language.
Important Note
The fine tuning of the pretrained historical model (emanjavacas/GysBERT-v2) is fundamentally different from the pretraining of this foundation model. Indeed, this model was pretrained for both the MLM (masked lanaguage model) and the NSP (next sentence prediction) objectives. Whereas the finetuning only account for the retraining of the MLM objective (even when updating the weights of all layers is allowed). Indeed, when pretraining for both MLM and NSP, the intuition is to let the model learn what pairs of lines could possibly follow one another in the corpus (hence enriching the internal structure representation) in addidtion to teaching it to fill gaps in a text with the words it knows from its vocabulary.
First Observations
Note: The experiment is not complete yet, hence no conclusive results can be provided so far.
Option 1 (training a model from scratch) is what gave rise to this very model. Option 2 (fine tuning all layers of a pretrained historical model) is wat gave rise to