Image-to-Text
Transformers
PyTorch
donut
vision
Eval Results
hmarzan's picture
Fixed yaml metadata prefix block
8eb02a3
|
raw
history blame
6.01 kB
metadata
license: mit
inference: false
base_model: naver-clova-ix/donut-base
tags:
  - donut
  - image-to-text
  - vision
model-index:
  - name: donut-dr-matriculas-ocr
    results:
      - task:
          type: image-to-text
          name: Image to text
        metrics:
          - type: loss
            value: 0.0563
            name: Final loss (50 epochs)
          - type: accuracy
            value: 0.724689
            name: F1 Accuracy (Val)
          - type: accuracy
            value: 0.923603
            name: F1 Accuracy (Train)
          - type: edit distance
            value: 0.914544
            name: ED (Val)
          - type: edit distance
            value: 0.971895
            name: ED (Train)
metrics:
  - accuracy
datasets:
  - propietary/matriculas
pipeline_tag: image-to-text

Donut 🍩 for DR Matriculas (Donut-DR-matriculas-OCR)

Donut model was introduced in the paper OCR-free Document Understanding Transformer by Geewok et al. and first released in this repository.

=== Matriculas OCR V1 ===

This model is a finetune of the donut base model on a propietary dataset. Its purpose is to efficiently extract text from the dominican official vehicle registration documents.

This propietary dataset was manually corrected, and we prepared the teacher forcing (ground truth) data with the images and json lines. The license for the V1 model is mit, available under the MIT license.

It achieves the following results on the evaluation set:

  • Loss: 0.0563
  • Edit distance: 0.914544
  • F1 accuracy: 0.724689

The task_prompt has been changed to <s_matricula> for the V1.

The focus for the next or future version, will be to collect a better an larger dataset for training.

Model description

Donut consists of a vision encoder (Swin Transformer) and a text decoder (BART). Given an image, the encoder first encodes the image into a tensor of embeddings (of shape batch_size, seq_len, hidden_size), after which the decoder autoregressively generates text, conditioned on the encoding of the encoder.

model image

How to use

import torch
import re
from PIL import Image
from transformers import DonutProcessor
#from transformers import VisionEncoderDecoderModel

import warnings
warnings.filterwarnings("ignore")

from sconf import Config
from donut import DonutConfig, DonutModel

config = Config(default="./config.yaml")

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
processor = DonutProcessor.from_pretrained("marzanconsulting/donut-dr-matriculas-ocr")

model = DonutModel.from_pretrained(
                "marzanconsulting/donut-dr-matriculas-ocr",
                input_size=config.input_size,
                max_length=config.max_length,
                align_long_axis=config.align_long_axis,
                ignore_mismatched_sizes=True,
            )

model.to(device)

def load_and_preprocess_image(image_path: str, processor):
    """
    Load an image and preprocess it for the model.
    """
    image = Image.open(image_path).convert("RGB")
    pixel_values = processor(image, return_tensors="pt").pixel_values
    return pixel_values

def generate_text_from_image(model, image_path: str, processor, device):
    """
    Generate text from an image using the trained model.
    """
    # Load and preprocess the image
    pixel_values = load_and_preprocess_image(image_path, processor)
    pixel_values = pixel_values.to(device)

    decoder_input_ids = processor.tokenizer(task_prompt="<s_matricula>", 
                                            add_special_tokens=False,
                                            return_tensors="pt").input_ids    

    decoded_text = model.inference(image_tensors=pixel_values, 
                                   prompt_tensors=decoder_input_ids)["predictions"][0]

    return decoded_text

# Example usage
image_path = "path_to_your_image"  # Replace with your image path
extracted_text = generate_text_from_image(model, image_path, processor, device)
print("Extracted Text:", extracted_text)

Refer to the documentation for more code examples.

Intended uses & limitations

This fine-tuned model is specifically designed for extracting text from dominican vehicle registration (matriculas) documents, and may not perform optimally on other types of documents. The dataset used is still suboptimal (numerous errors are still there), thus, this model will need to be retrained later to improve its performance.

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 3e-05
  • train_batch_size: 5
  • eval_batch_size: 1
  • seed: 2022
  • optimizer: AdamW with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_steps: 300
  • num_epochs: 50
  • weight_decay: 0.01

Framework versions

  • Transformers 4.25.1
  • Timm 0.6.13
  • Pytorch-lightning 1.6.4
  • Donut 1.0.9

If you want to support me, you can here.

BibTeX entry and citation info for DONUT

@article{DBLP:journals/corr/abs-2111-15664,
  author    = {Geewook Kim and
               Teakgyu Hong and
               Moonbin Yim and
               Jinyoung Park and
               Jinyeong Yim and
               Wonseok Hwang and
               Sangdoo Yun and
               Dongyoon Han and
               Seunghyun Park},
  title     = {Donut: Document Understanding Transformer without {OCR}},
  journal   = {CoRR},
  volume    = {abs/2111.15664},
  year      = {2021},
  url       = {https://arxiv.org/abs/2111.15664},
  eprinttype = {arXiv},
  eprint    = {2111.15664},
  timestamp = {Thu, 02 Dec 2021 10:50:44 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2111-15664.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}