Update README.md
Browse files
README.md
CHANGED
@@ -3,4 +3,99 @@ license: apache-2.0
|
|
3 |
base_model:
|
4 |
- PleIAs/OCRonos-Vintage
|
5 |
library_name: transformers
|
6 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
base_model:
|
4 |
- PleIAs/OCRonos-Vintage
|
5 |
library_name: transformers
|
6 |
+
language:
|
7 |
+
- es
|
8 |
+
pipeline_tag: text-generation
|
9 |
+
tags:
|
10 |
+
- OCR
|
11 |
+
- text-correction
|
12 |
+
- ocr-correction
|
13 |
+
- archives
|
14 |
+
- GPT2
|
15 |
+
- history
|
16 |
+
- SLM
|
17 |
+
- pre-train
|
18 |
+
- drama
|
19 |
+
---
|
20 |
+
|
21 |
+
**Filiberto 124M Instruct** is a small specialized model for OCR correction of Spanish Golden Age Dramas OCR, based on the [OCRonos-Vintage](https://hf.co/PleIAs/OCRonos-Vintage) model for cultural heritage archives OCR correction.
|
22 |
+
|
23 |
+
Filiberto 124M Instruct is only 124 million parameters. It can run easily on CPU or provide correction at scale on GPUs (>10k tokens/seconds).
|
24 |
+
|
25 |
+
## Training
|
26 |
+
The pre-trained included a collection of individual verses and their correction taken from the TEXORO corpus, totalling ~5 million tokens.
|
27 |
+
|
28 |
+
Pre-training ran on 5 epochs with levanter (500 steps total, each processing 1024 sequences of 512 tokens) on a TPUv4-32 for 15 minutes.
|
29 |
+
|
30 |
+
Tokenization is currently done with the GPT-2 tokenizer.
|
31 |
+
|
32 |
+
## Example of OCR correction
|
33 |
+
Filiberto 124M Instruct has been pre-trained on an instruction dataset with a hard-coded structure: `### Text ###` for OCRized text submissiong and `### Correction ###` for the generated correction.
|
34 |
+
|
35 |
+
Filiberto 124M Instruct can be imported like any GPT-2 like model:
|
36 |
+
|
37 |
+
```python
|
38 |
+
import torch
|
39 |
+
from transformers import GPT2LMHeadModel, GPT2Tokenizer
|
40 |
+
|
41 |
+
# Load pre-trained model and tokenizer
|
42 |
+
model_name = "bertin-project/filiberto-124M-instruct"
|
43 |
+
model = GPT2LMHeadModel.from_pretrained(model_name)
|
44 |
+
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
|
45 |
+
|
46 |
+
# Set the device to GPU if available, otherwise use CPU
|
47 |
+
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
48 |
+
model.to(device)
|
49 |
+
```
|
50 |
+
|
51 |
+
And afterwards inference can be run like this:
|
52 |
+
|
53 |
+
```python
|
54 |
+
# Function to generate text
|
55 |
+
def ocr_correction(prompt, max_new_tokens=600):
|
56 |
+
|
57 |
+
prompt = f"""### Text ###\n{prompt}\n\n\n### Correction ###\n"""
|
58 |
+
input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)
|
59 |
+
|
60 |
+
# Generate text
|
61 |
+
output = model.generate(input_ids,
|
62 |
+
max_new_tokens=max_new_tokens,
|
63 |
+
pad_token_id=tokenizer.eos_token_id,
|
64 |
+
top_k=50)
|
65 |
+
|
66 |
+
# Decode and return the generated text
|
67 |
+
return tokenizer.decode(output[0], skip_special_tokens=True).split("### Correction ###")[-1].strip()
|
68 |
+
|
69 |
+
ocr_result = ocr_correction(prompt)
|
70 |
+
print(ocr_result)
|
71 |
+
```
|
72 |
+
|
73 |
+
An example of an OCRized drama:
|
74 |
+
|
75 |
+
> Otra vez, Don Iuan, me dad,
|
76 |
+
> y otras mil vezes los braços.
|
77 |
+
> Otra, y otras mil sean lazos
|
78 |
+
> de nuestra antigua amistad.
|
79 |
+
> Como venis?
|
80 |
+
> Yo me siento
|
81 |
+
> tan alegre, tan vfano,
|
82 |
+
> tan venturoso, tan vano,
|
83 |
+
> que no podrà el pensamiento
|
84 |
+
> encareceros jamàs
|
85 |
+
> las venturas que posseo,
|
86 |
+
> porque el pensamiento creo
|
87 |
+
|
88 |
+
would yield this result:
|
89 |
+
|
90 |
+
> Otra vez, Don Iuan, me dad,
|
91 |
+
> y otras mil vezes los braços.
|
92 |
+
> Otra, y otras mil sean lazos
|
93 |
+
> de nuestra antigua amistad.
|
94 |
+
> Como venis?
|
95 |
+
> Yo me siento
|
96 |
+
> tan alegre, tan vfano,
|
97 |
+
> tan venturoso, tan vano,
|
98 |
+
> que no podrà el pensamiento
|
99 |
+
> encareceros jamàs
|
100 |
+
> las venturas que posseo,
|
101 |
+
> porque el pensamiento creo
|