Brabant-XVII-From-Scratch

As the name suggests, this model was trained on the eponymous dataset that comprises the transcriptions of archives text from the 'council of brabant' (raad van babant) over a period that ranges from the xv to the vii century. These transcriptions were made by hand, mostly by volunteers. The transcribed documents cover (at the time) letters of pardons and letters of sentences.

The model architecture

The 'brabant-xvii-from-scratch' model in itself was trained as a plain old BERT model.

Tokenizer

The tokenizer that has been trained specifically for this model was trained using the standard BERT tokenizer (WordPiece) with a vocabulary size limit of 30,000. During the training of the tokenizer, all of the available text was provided so as to let the resulting tokenizer stick as close as possible to the actual vocabulary used in the complete corpus.

Preprocessing steps

The tokenizer of this model applies the following preprocessing steps:

text has been normalized to NFKC encoding
all '(', ')', '[' and, ']' have been removed from the text. These were typically used by annotators to indicate a likely word completion that had been abbreviated in the handwritten text. (That decision was made by historian in charge of the corpus since the objective is to make the corpus searchable eventually).
sequences of line breaks have been removed to keep at most one.
sequences of blanks have been normalized to keep at most one.
all texts have been lower cased
all accents have been removed
all control characters have been normalized to plain space

Pretraining of the model

The model was then pretrained for both BERT objectives (MLM and NSP). To that end, it used the dataset 'arch-be/brabant-xvii' ("next_sentence") which has been crafted specifically for this purpose. As per the state-of-the art in BERT training, 15% of the tokens were masked at training time.

Related Work

This model was trained as an experiment to see what would work best on the target corpus (pardons and sentences). Three options were considered for that purpose:

Pretraining a BERT model from scratch that would be able to leverage a tokenizer based on a vocabulary emerging from the target corpus
Fine Tuning all layers of a pretrained historical model (emanjavacas/GysBERT-v2) that mostly fits with the target languages (not fully though, as the brabant-xvii comprises both text in ancient Dutch and text in ancient French)
Fine Tuning the head of a pretrained historical model (emanjavacas/GysBERT-v2) that mostly fit with the target language.

Important Note

The fine tuning of the pretrained historical model (emanjavacas/GysBERT-v2) is fundamentally different from the pretraining of this foundation model. Indeed, this model was pretrained for both the MLM (masked lanaguage model) and the NSP (next sentence prediction) objectives. Whereas the finetuning only account for the retraining of the MLM objective (even when updating the weights of all layers is allowed). Indeed, when pretraining for both MLM and NSP, the intuition is to let the model learn what pairs of lines could possibly follow one another in the corpus (hence enriching the internal structure representation) in addidtion to teaching it to fill gaps in a text with the words it knows from its vocabulary.

First Observations

Note: The experiment is not complete yet, hence no conclusive results can be provided so far.

Option 1 (training a model from scratch) is what gave rise to this very model. Option 2 (fine tuning all layers of a pretrained historical model) is wat gave rise to

xaviergillard
/

brabant-xvii-from-scratch