File size: 4,508 Bytes
19b5bf4
 
9dcdf14
 
 
 
 
 
 
 
 
 
19b5bf4
9dcdf14
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
240852e
9dcdf14
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
240852e
9dcdf14
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
---
license: other
language:
- en
- de
- fr
- fi
- sv
- nl
- nb
- nn
- 'no'
---

# hmTEAMS

[![🤗](https://github.com/stefan-it/hmTEAMS/raw/main/logo.jpeg "🤗")](https://github.com/stefan-it/hmTEAMS)

Historic Multilingual and Monolingual [TEAMS](https://aclanthology.org/2021.findings-acl.219/) Models.
The following languages are covered:

* English (British Library Corpus - Books)
* German (Europeana Newspaper)
* French (Europeana Newspaper)
* Finnish (Europeana Newspaper, Digilib)
* Swedish (Europeana Newspaper, Digilib)
* Dutch (Delpher Corpus)
* Norwegian (NCC Corpus)

# Architecture

We pretrain a "Training ELECTRA Augmented with Multi-word Selection"
([TEAMS](https://aclanthology.org/2021.findings-acl.219/)) model:

![hmTEAMS Overview](https://github.com/stefan-it/hmTEAMS/raw/main/hmteams_overview.svg)

# Results

We perform experiments on various historic NER datasets, such as HIPE-2022 or ICDAR Europeana.
All details incl. hyper-parameters can be found [here](https://github.com/stefan-it/hmTEAMS/tree/main/bench).

## Small Benchmark

We test our pretrained language models on various datasets from HIPE-2020, HIPE-2022 and Europeana.
The following table shows an overview of used datasets.

| Language | Dataset                                                                                          | Additional Dataset                                                               |
|----------|--------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------|
| English  | [AjMC](https://github.com/hipe-eval/HIPE-2022-data/blob/main/documentation/README-ajmc.md)       | -                                                                                |
| German   | [AjMC](https://github.com/hipe-eval/HIPE-2022-data/blob/main/documentation/README-ajmc.md)       | -                                                                                |
| French   | [AjMC](https://github.com/hipe-eval/HIPE-2022-data/blob/main/documentation/README-ajmc.md)       | [ICDAR-Europeana](https://github.com/stefan-it/historic-domain-adaptation-icdar) |
| Finnish  | [NewsEye](https://github.com/hipe-eval/HIPE-2022-data/blob/main/documentation/README-newseye.md) | -                                                                                |
| Swedish  | [NewsEye](https://github.com/hipe-eval/HIPE-2022-data/blob/main/documentation/README-newseye.md) | -                                                                                |
| Dutch    | [ICDAR-Europeana](https://github.com/stefan-it/historic-domain-adaptation-icdar)                 | -                                                                                |

# Results

| Model                                                                                  | English AjMC | German AjMC  | French AjMC  | Finnish NewsEye | Swedish NewsEye | Dutch ICDAR  | French ICDAR | Avg.      |
|----------------------------------------------------------------------------------------|--------------|--------------|--------------|-----------------|-----------------|--------------|--------------|-----------|
| hmBERT (32k) [Schweter et al.](https://ceur-ws.org/Vol-3180/paper-87.pdf)              | 85.36 ± 0.94 | 89.08 ± 0.09 | 85.10 ± 0.60 | 77.28 ± 0.37    | 82.85 ± 0.83    | 82.11 ± 0.61 | 77.21 ± 0.16 | 82.71     |
| hmTEAMS (Ours)                                                                         | 86.41 ± 0.36 | 88.64 ± 0.42 | 85.41 ± 0.67 | 79.27 ± 1.88    | 82.78 ± 0.60    | 88.21 ± 0.39 | 78.03 ± 0.39 | **84.11** |

# Release

Our pretrained hmTEAMS model can be obtained from the Hugging Face Model Hub. Because of complicated
license issues (that needs to be figured out), the model is only available by requesting access from
Model Hub:

* [hmTEAMS Discriminator](https://huggingface.co/hmteams/teams-base-historic-multilingual-discriminator)
* [hmTEAMS Generator (**this model**)](https://huggingface.co/hmteams/teams-base-historic-multilingual-generator)

# Acknowledgements

We thank [Luisa März](https://github.com/LuisaMaerz), [Katharina Schmid](https://github.com/schmika) and
[Erion Çano](https://github.com/erionc) for their fruitful discussions about Historic Language Models.

Research supported with Cloud TPUs from Google's [TPU Research Cloud](https://sites.research.google/trc/about/) (TRC).
Many Thanks for providing access to the TPUs ❤️