File size: 7,105 Bytes
5c14122 9914952 5c14122 9914952 5c14122 9914952 5c14122 9914952 5c14122 9914952 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 |
---
license: cc-by-nc-4.0
inference: false
base_model: naver-clova-ix/donut-base
tags:
- donut
- image-to-text
- vision
model-index:
- name: donut-receipts-extract
results:
- task:
type: image-to-text
name: Image to text
metrics:
- type: loss
value: 0.326069
- type: accuracy
value: 0.895219
name: Accuracy
- type: cer
value: 0.158358
name: CER
- type: wer
value: 1.673989
name: WER
- type: edit distance
value: 0.145293
name: Edit_distance
metrics:
- cer
- wer
- accuracy
datasets:
- AdamCodd/donut-receipts
pipeline_tag: image-to-text
---
# Note
This model was forked from [AdamCodd/donut-receipts-extract](https://huggingface.co/AdamCodd/donut-receipts-extract).
# Donut-receipts-extract
Donut model was introduced in the paper [OCR-free Document Understanding Transformer](https://arxiv.org/abs/2111.15664) by Geewok et al. and first released in [this repository](https://github.com/clovaai/donut).
## === V2 ===
This model has been retrained on an improved version of the [AdamCodd/donut-receipts](https://huggingface.co/datasets/AdamCodd/donut-receipts) dataset (deduplicated, manually corrected). The new license for the V2 model is **cc-by-nc-4.0**. For commercial use rights, please contact me ([email protected]). Meanwhile, the V1 model remains available under the MIT license (under v1 branch).
It achieves the following results on the evaluation set:
* Loss: 0.326069
* Edit distance: 0.145293
* CER: 0.158358
* WER: 1.673989
* Mean accuracy: 0.895219
* F1: 0.977897
The task_prompt has been changed to ``<s_receipt>`` for the V2 (previously ``<s_cord-v2>`` for V1). Two new keys ``<s_svc>`` and ``<s_discount>`` have been added, ``<s_telephone>`` has been renamed to ``<s_phone>``.
The V2 performs way better than the V1 as it has been trained on twice the resolution for the receipts, using a better dataset. Despite that, it's not perfect due to a lack of diverse receipts (the training dataset is still ~1100 receipts); for a future version, that will be the main focus.
## === V1 ====
This model is a finetune of the [donut base model](https://huggingface.co/naver-clova-ix/donut-base/) on the [AdamCodd/donut-receipts](https://huggingface.co/datasets/AdamCodd/donut-receipts) dataset. Its purpose is to efficiently extract text from receipts.
It achieves the following results on the evaluation set:
* Loss: 0.498843
* Edit distance: 0.198315
* CER: 0.213929
* WER: 7.634032
* Mean accuracy: 0.843472
## Model description
Donut consists of a vision encoder (Swin Transformer) and a text decoder (BART). Given an image, the encoder first encodes the image into a tensor of embeddings (of shape batch_size, seq_len, hidden_size), after which the decoder autoregressively generates text, conditioned on the encoding of the encoder.
![model image](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/donut_architecture.jpg)
### How to use
```python
import torch
import re
from PIL import Image
from transformers import DonutProcessor, VisionEncoderDecoderModel
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
processor = DonutProcessor.from_pretrained("AdamCodd/donut-receipts-extract")
model = VisionEncoderDecoderModel.from_pretrained("AdamCodd/donut-receipts-extract")
model.to(device)
def load_and_preprocess_image(image_path: str, processor):
"""
Load an image and preprocess it for the model.
"""
image = Image.open(image_path).convert("RGB")
pixel_values = processor(image, return_tensors="pt").pixel_values
return pixel_values
def generate_text_from_image(model, image_path: str, processor, device):
"""
Generate text from an image using the trained model.
"""
# Load and preprocess the image
pixel_values = load_and_preprocess_image(image_path, processor)
pixel_values = pixel_values.to(device)
# Generate output using model
model.eval()
with torch.no_grad():
task_prompt = "<s_receipt>" # <s_cord-v2> for v1
decoder_input_ids = processor.tokenizer(task_prompt, add_special_tokens=False, return_tensors="pt").input_ids
decoder_input_ids = decoder_input_ids.to(device)
generated_outputs = model.generate(
pixel_values,
decoder_input_ids=decoder_input_ids,
max_length=model.decoder.config.max_position_embeddings,
pad_token_id=processor.tokenizer.pad_token_id,
eos_token_id=processor.tokenizer.eos_token_id,
early_stopping=True,
bad_words_ids=[[processor.tokenizer.unk_token_id]],
return_dict_in_generate=True
)
# Decode generated output
decoded_text = processor.batch_decode(generated_outputs.sequences)[0]
decoded_text = decoded_text.replace(processor.tokenizer.eos_token, "").replace(processor.tokenizer.pad_token, "")
decoded_text = re.sub(r"<.*?>", "", decoded_text, count=1).strip() # remove first task start token
decoded_text = processor.token2json(decoded_text)
return decoded_text
# Example usage
image_path = "path_to_your_image" # Replace with your image path
extracted_text = generate_text_from_image(model, image_path, processor, device)
print("Extracted Text:", extracted_text)
```
Refer to the [documentation](https://huggingface.co/docs/transformers/main/en/model_doc/donut) for more code examples.
## Intended uses & limitations
This fine-tuned model is specifically designed for extracting text from receipts and may not perform optimally on other types of documents. The dataset used is still suboptimal (numerous errors are still there) so this model will need to be retrained at a later date to improve its performance.
### Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 3e-05
- train_batch_size: 2
- eval_batch_size: 4
- seed: 42
- optimizer: AdamW with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 300
- num_epochs: 35
- weight_decay: 0.01
### Framework versions
- Transformers 4.36.2
- Datasets 2.16.1
- Tokenizers 0.15.0
- Evaluate 0.4.1
If you want to support me, you can [here](https://ko-fi.com/adamcodd).
### BibTeX entry and citation info
```bibtex
@article{DBLP:journals/corr/abs-2111-15664,
author = {Geewook Kim and
Teakgyu Hong and
Moonbin Yim and
Jinyoung Park and
Jinyeong Yim and
Wonseok Hwang and
Sangdoo Yun and
Dongyoon Han and
Seunghyun Park},
title = {Donut: Document Understanding Transformer without {OCR}},
journal = {CoRR},
volume = {abs/2111.15664},
year = {2021},
url = {https://arxiv.org/abs/2111.15664},
eprinttype = {arXiv},
eprint = {2111.15664},
timestamp = {Thu, 02 Dec 2021 10:50:44 +0100},
biburl = {https://dblp.org/rec/journals/corr/abs-2111-15664.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
``` |