Filiberto 124M Instruct is a small specialized model for OCR correction of Spanish Golden Age Dramas OCR, based on the OCRonos-Vintage model for cultural heritage archives OCR correction.

Filiberto 124M Instruct is only 124 million parameters. It can run easily on CPU or provide correction at scale on GPUs (>10k tokens/seconds).

Training

The pre-trained included a collection of individual verses and their correction taken from the TEXORO corpus, via a collaboration with ETSO, totalling ~5 million tokens.

Pre-training ran on 5 epochs with levanter (500 steps total, each processing 1024 sequences of 512 tokens) on a TPUv4-32 for 15 minutes.

Tokenization is currently done with the GPT-2 tokenizer.

Example of OCR correction

Filiberto 124M Instruct has been pre-trained on an instruction dataset with a hard-coded structure: ### Text ### for OCRized text submissiong and ### Correction ### for the generated correction.

Filiberto 124M Instruct can be imported like any GPT-2 like model:

import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load pre-trained model and tokenizer
model_name = "bertin-project/filiberto-124M-instruct"
model = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

# Set the device to GPU if available, otherwise use CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

And afterwards inference can be run like this:

# Function to generate text
def ocr_correction(prompt, max_new_tokens=600):

    prompt = f"""### Text ###\n{prompt}\n\n\n### Correction ###\n"""
    input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)

    # Generate text
    output = model.generate(input_ids,
                            max_new_tokens=max_new_tokens,
                            pad_token_id=tokenizer.eos_token_id,
                            top_k=50)

    # Decode and return the generated text
    return tokenizer.decode(output[0], skip_special_tokens=True).split("### Correction ###")[-1].strip()

ocr_result = ocr_correction(prompt)
print(ocr_result)

An example of an OCRized drama:

Otra vez, Don Iuan, me dad,
y otras mil vezes los braços.
Otra, y otras mil sean lazos
de nuestra antigua amistad.
Como venis?
Yo me siento
tan alegre, tan vfano,
tan venturoso, tan vano,
que no podrà el pensamiento
encareceros jamàs
las venturas que posseo,
porque el pensamiento creo

would yield this result:

Otra vez, Don Iuan, me dad,
y otras mil vezes los braços.
Otra, y otras mil sean lazos
de nuestra antigua amistad.
Como venis?
Yo me siento
tan alegre, tan vfano,
tan venturoso, tan vano,
que no podrà el pensamiento
encareceros jamàs
las venturas que posseo,
porque el pensamiento creo
Downloads last month
27
Safetensors
Model size
124M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for bertin-project/filiberto-124M-instruct

Finetuned
(9)
this model