MacBERTh

This model is a Historical Language Model for English coming from the MacBERTh project.

The architecture is based on BERT base uncased from the original BERT pre-training codebase. The training material comes from different sources including:

EEBO
ECCO
COHA
CLMET3.1
EVANS
Hansard Corpus

with a total word count of approximately 3.9B tokens.

Details and evaluation can be found in the accompanying publications:

MacBERTh: Development and Evaluation of a Historically Pre-trained Language Model for English (1450-1950)
Adapting vs. Pre-training Language Models for Historical Languages