|
--- |
|
license: mit |
|
language: |
|
- de |
|
metrics: |
|
- bleu |
|
pipeline_tag: text2text-generation |
|
tags: |
|
- ByT5 |
|
- historical |
|
- t5 |
|
- ocr-correction |
|
--- |
|
|
|
Finetuned version of [hmByT5](https://huggingface.co/hmbyt5-preliminary/byt5-small-historic-multilingual-span20-flax) on DE1, DE2, DE3 and DE7 parts of the [IDCAR2019-POCR](https://drive.google.com/file/d/1wOhmsoxOVQEPgHSX1QrYWKg5XAdYkzwi/view) dataset to correct OCR mistakes. The max_length was set to 350. |
|
|
|
## Performance |
|
``` |
|
SacreBLEU eval dataset: 10.83 |
|
SacreBLEU eval model: 72.35 |
|
``` |
|
|
|
## Usage |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
|
|
|
example_sentence = "Anvpreiſungq. Haupidepot für Wien: In der Stadt, obere Bräunerſtraße Nr. 1137 in der Varfüͤmerie-Handlung zur" |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("Var3n/hmByT5_anno") |
|
model = AutoModelForSeq2SeqLM.from_pretrained("Var3n/hmByT5_anno") |
|
|
|
input = tokenizer(example_sentence, return_tensors="pt").input_ids |
|
output = model.generate(input, max_new_tokens=len(input[0]), num_beams=4, do_sample=True) |
|
|
|
text = tokenizer.decode(output[0], skip_special_tokens=True) |
|
``` |