versae commited on
Commit
3dd294c
·
verified ·
1 Parent(s): 8cd3ab7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +96 -1
README.md CHANGED
@@ -3,4 +3,99 @@ license: apache-2.0
3
  base_model:
4
  - PleIAs/OCRonos-Vintage
5
  library_name: transformers
6
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  base_model:
4
  - PleIAs/OCRonos-Vintage
5
  library_name: transformers
6
+ language:
7
+ - es
8
+ pipeline_tag: text-generation
9
+ tags:
10
+ - OCR
11
+ - text-correction
12
+ - ocr-correction
13
+ - archives
14
+ - GPT2
15
+ - history
16
+ - SLM
17
+ - pre-train
18
+ - drama
19
+ ---
20
+
21
+ **Filiberto 124M Instruct** is a small specialized model for OCR correction of Spanish Golden Age Dramas OCR, based on the [OCRonos-Vintage](https://hf.co/PleIAs/OCRonos-Vintage) model for cultural heritage archives OCR correction.
22
+
23
+ Filiberto 124M Instruct is only 124 million parameters. It can run easily on CPU or provide correction at scale on GPUs (>10k tokens/seconds).
24
+
25
+ ## Training
26
+ The pre-trained included a collection of individual verses and their correction taken from the TEXORO corpus, totalling ~5 million tokens.
27
+
28
+ Pre-training ran on 5 epochs with levanter (500 steps total, each processing 1024 sequences of 512 tokens) on a TPUv4-32 for 15 minutes.
29
+
30
+ Tokenization is currently done with the GPT-2 tokenizer.
31
+
32
+ ## Example of OCR correction
33
+ Filiberto 124M Instruct has been pre-trained on an instruction dataset with a hard-coded structure: `### Text ###` for OCRized text submissiong and `### Correction ###` for the generated correction.
34
+
35
+ Filiberto 124M Instruct can be imported like any GPT-2 like model:
36
+
37
+ ```python
38
+ import torch
39
+ from transformers import GPT2LMHeadModel, GPT2Tokenizer
40
+
41
+ # Load pre-trained model and tokenizer
42
+ model_name = "bertin-project/filiberto-124M-instruct"
43
+ model = GPT2LMHeadModel.from_pretrained(model_name)
44
+ tokenizer = GPT2Tokenizer.from_pretrained(model_name)
45
+
46
+ # Set the device to GPU if available, otherwise use CPU
47
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
48
+ model.to(device)
49
+ ```
50
+
51
+ And afterwards inference can be run like this:
52
+
53
+ ```python
54
+ # Function to generate text
55
+ def ocr_correction(prompt, max_new_tokens=600):
56
+
57
+ prompt = f"""### Text ###\n{prompt}\n\n\n### Correction ###\n"""
58
+ input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)
59
+
60
+ # Generate text
61
+ output = model.generate(input_ids,
62
+ max_new_tokens=max_new_tokens,
63
+ pad_token_id=tokenizer.eos_token_id,
64
+ top_k=50)
65
+
66
+ # Decode and return the generated text
67
+ return tokenizer.decode(output[0], skip_special_tokens=True).split("### Correction ###")[-1].strip()
68
+
69
+ ocr_result = ocr_correction(prompt)
70
+ print(ocr_result)
71
+ ```
72
+
73
+ An example of an OCRized drama:
74
+
75
+ > Otra vez, Don Iuan, me dad,
76
+ > y otras mil vezes los braços.
77
+ > Otra, y otras mil sean lazos
78
+ > de nuestra antigua amistad.
79
+ > Como venis?
80
+ > Yo me siento
81
+ > tan alegre, tan vfano,
82
+ > tan venturoso, tan vano,
83
+ > que no podrà el pensamiento
84
+ > encareceros jamàs
85
+ > las venturas que posseo,
86
+ > porque el pensamiento creo
87
+
88
+ would yield this result:
89
+
90
+ > Otra vez, Don Iuan, me dad,
91
+ > y otras mil vezes los braços.
92
+ > Otra, y otras mil sean lazos
93
+ > de nuestra antigua amistad.
94
+ > Como venis?
95
+ > Yo me siento
96
+ > tan alegre, tan vfano,
97
+ > tan venturoso, tan vano,
98
+ > que no podrà el pensamiento
99
+ > encareceros jamàs
100
+ > las venturas que posseo,
101
+ > porque el pensamiento creo