File size: 5,547 Bytes
f580828
 
 
fece59c
0632c77
f580828
 
b0dff33
f580828
 
 
 
 
2426e29
 
0632c77
f580828
 
 
 
 
 
 
 
0632c77
 
ac2f266
b0dff33
f580828
 
 
 
d7e079f
f580828
6d68386
f580828
 
 
5e8da84
6e69271
 
5e8da84
 
 
89e44be
 
5e8da84
89e44be
 
 
 
5e8da84
 
89e44be
 
5e8da84
89e44be
5e8da84
f580828
 
 
 
f309b85
f580828
 
f309b85
f580828
c4f2c13
f580828
f309b85
f580828
 
f309b85
f580828
 
 
5e8da84
 
f309b85
5e8da84
f309b85
f580828
 
5e8da84
 
 
4232672
 
 
771b289
4232672
 
 
 
 
 
5e8da84
 
c4f2c13
f580828
2230d9a
e283348
 
 
f580828
 
 
 
 
 
 
 
 
 
 
 
c4f2c13
f580828
 
 
 
 
 
69af71d
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
---
license: apache-2.0
base_model: google/byt5-small
tags:
- generated_from_trainer
language: de
model-index:
- name: textplus-bbaw/transnormer-19c-beta-v02
  results:
  - task:
      name: Historic Text Normalization
      type: translation
    dataset:
      name: DTA reviEvalCorpus v1
      url: ybracke/dta-reviEvalCorpus-v1
      type: text
      split: test
    metrics:
    - name: Word Accuracy
      type: accuracy
      value: 0.98878
    - name: Word Accuracy (case insensitive)
      type: accuracy
      value: 0.99343
pipeline_tag: text2text-generation
library_name: transformers
datasets:
- textplus-bbaw/dta-reviEvalCorpus-v1
---



# Transnormer 19th century (beta v02)

This model can normalize historical German spellings from the 19th century. 

## Model description

`Transnormer` is a byte-level sequence-to-sequence model for normalizing historical German text. 
This model was trained on text from the 19th and late 18th century, 
by performing a fine-tuning of [google/byt5-small](https://huggingface.co/google/byt5-small) on the [DTA reviEvalCorpus](https://huggingface.co/datasets/ybracke/dta-reviEvalCorpus-v1), a modified version of the [DTA EvalCorpus](https://kaskade.dwds.de/~moocow/software/dtaec/) (see section [Training and evaluation data](#training-and-evaluation-data)).

## Uses

This model is intended for users that have a digitalized historical text and require normalization, 
that is, a version of the historical text that comes closer to modern spelling. 
Historical text typically contains spelling variations and extinct spellings that differ from contemporary text. 
This can be a drawback when working with historical text: 
Historical variation can impair the performance of NLP tools (POS tagging, etc.) that were trained on contemporary language, 
and full text search on historical texts can be tedious due to numerous spelling variants.
Historical text normalization can mitigate these problems to some extent.  

Note that this model is intended for the normalization of *historical German text from a specific time period*. 
It is *not intended* for other types of text that may require normalization (e.g. computer mediated communication), other languages than German or other periods of time. 
There may be other models available for these settings on the [Hub](https://huggingface.co/models).

This model can be further fine-tuned to be adapted or improved, as described in the [`transformers` tutorials](https://huggingface.co/docs/transformers/training).

### Demo Usage

```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("ybracke/transnormer-19c-beta-v02")
model = AutoModelForSeq2SeqLM.from_pretrained("ybracke/transnormer-19c-beta-v02")
sentence = "Die Königinn ſaß auf des Pallaſtes mittlerer Tribune."
inputs = tokenizer(sentence, return_tensors="pt",)
outputs = model.generate(**inputs, num_beams=4, max_length=128)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
# >>> ['Die Königin saß auf des Palastes mittlerer Tribüne.']
```

Or use this model with the [pipeline API](https://huggingface.co/transformers/main_classes/pipelines.html) like this:

```python
from transformers import pipeline

transnormer = pipeline(model='ybracke/transnormer-19c-beta-v02')
sentence = "Die Königinn ſaß auf des Pallaſtes mittlerer Tribune."
print(transnormer(sentence, num_beams=4, max_length=128))
# >>> [{'generated_text': 'Die Königin saß auf des Palastes mittlerer Tribüne.'}]
```

### Recommendations

The model was trained using a maximum input length of 512 bytes (~70 words). 
Inference on longer sequences is possible, but more error-prone than on shorter sequences. 
Moreover, inference on shorter sequences is faster and less computationally expensive. 
Consider splitting long sequences to process them separately. 
([Here](https://github.com/ybracke/transnormer/blob/main/demo/process-text-file.py) is an example implementation).

The default generation configuration for this model limits the output length to 512 bytes. 
To increase or decrease it, use the `max_new_tokens` parameter for generation. 
For more details on how to customize generation, see the Hugging Face docs on [generation strategies](https://huggingface.co/docs/transformers/v4.45.1/en/generation_strategies).




## Training and evaluation data

The model was fine-tuned and evaluated on the [DTA reviEvalCorpus](https://huggingface.co/datasets/ybracke/dta-reviEvalCorpus-v1).
*DTA reviEvalCorpus* is a parallel corpus of German texts from the period between 1780 to 1899, that aligns sentences in historical spelling of with their normalizations.
The training set contains 96 documents with 4.6M source tokens, the dev and test set contain 13 documents (405K tokens) and 12 documents (381K tokens), respectively. 
For more information, see the [dataset card](https://huggingface.co/datasets/ybracke/dta-reviEvalCorpus-v1) of the corpus.

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 10 (published model: 8 epochs)

### Framework versions

- Transformers 4.31.0
- Pytorch 2.1.0+cu121
- Datasets 2.18.0
- Tokenizers 0.13.3

## Model Card Author

Yannic Bracke, Berlin-Brandenburg Academy of Sciences and Humanities

## Model Card Contact

`textplus (at) bbaw (dot) de`