Image-to-Text
Transformers
PyTorch
donut
vision
Eval Results
hmarzan commited on
Commit
4d03da9
·
1 Parent(s): 8b08100

First version of Donut DR Matriculas

Browse files
README.md CHANGED
@@ -1,3 +1,180 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Donut 🍩 for DR Matriculas (Donut-DR-matriculas-OCR)
2
+
3
+ ---
4
+ license: mit
5
+ inference: false
6
+ base_model: naver-clova-ix/donut-base
7
+ tags:
8
+ - donut
9
+ - image-to-text
10
+ - vision
11
+ model-index:
12
+ - name: donut-dr-matriculas-ocr
13
+ results:
14
+ - task:
15
+ type: image-to-text
16
+ name: Image to text
17
+ metrics:
18
+ - type: loss
19
+ value: 0.0563
20
+ name: Final loss (50 epochs)
21
+ - type: accuracy
22
+ value: 0.724689
23
+ name: F1 Accuracy (Val)
24
+ - type: accuracy
25
+ value: 0.923603
26
+ name: F1 Accuracy (Train)
27
+ - type: edit distance
28
+ value: 0.914544
29
+ name: ED (Val)
30
+ - type: edit distance
31
+ value: 0.971895
32
+ name: ED (Train)
33
+ metrics:
34
+ - accuracy
35
+ datasets:
36
+ - propietary/matriculas
37
+ pipeline_tag: image-to-text
38
+ ---
39
+
40
+ # Donut-DR-matriculas-OCR
41
+
42
+ Donut model was introduced in the paper [OCR-free Document Understanding Transformer](https://arxiv.org/abs/2111.15664) by Geewok et al. and first released in [this repository](https://github.com/clovaai/donut).
43
+
44
+ ## === Matriculas OCR V1 ===
45
+
46
+ This model is a finetune of the [donut base model](https://huggingface.co/naver-clova-ix/donut-base/) on a propietary dataset. Its purpose is to efficiently extract text from the dominican official vehicle registration documents.
47
+
48
+ This propietary dataset was manually corrected, and we prepared the teacher forcing (ground truth) data with the images and json lines. The license for the V1 model is **mit**, available under the MIT license.
49
+
50
+ It achieves the following results on the evaluation set:
51
+
52
+ * Loss: 0.0563
53
+ * Edit distance: 0.914544
54
+ * F1 accuracy: 0.724689
55
+
56
+ The task_prompt has been changed to ``<s_matricula>`` for the V1.
57
+
58
+ The focus for the next or future version, will be to collect a better an larger dataset for training.
59
+
60
+ ## Model description
61
+
62
+ Donut consists of a vision encoder (Swin Transformer) and a text decoder (BART). Given an image, the encoder first encodes the image into a tensor of embeddings (of shape batch_size, seq_len, hidden_size), after which the decoder autoregressively generates text, conditioned on the encoding of the encoder.
63
+
64
+ ![model image](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/donut_architecture.jpg)
65
+
66
+
67
+ ### How to use
68
+
69
+ ```python
70
+ import torch
71
+ import re
72
+ from PIL import Image
73
+ from transformers import DonutProcessor
74
+ #from transformers import VisionEncoderDecoderModel
75
+
76
+ import warnings
77
+ warnings.filterwarnings("ignore")
78
+
79
+ from sconf import Config
80
+ from donut import DonutConfig, DonutModel
81
+
82
+ config = Config(default="./config.yaml")
83
+
84
+ device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
85
+ processor = DonutProcessor.from_pretrained("marzanconsulting/donut-dr-matriculas-ocr")
86
+
87
+ model = DonutModel.from_pretrained(
88
+ "marzanconsulting/donut-dr-matriculas-ocr",
89
+ input_size=config.input_size,
90
+ max_length=config.max_length,
91
+ align_long_axis=config.align_long_axis,
92
+ ignore_mismatched_sizes=True,
93
+ )
94
+
95
+ model.to(device)
96
+
97
+ def load_and_preprocess_image(image_path: str, processor):
98
+ """
99
+ Load an image and preprocess it for the model.
100
+ """
101
+ image = Image.open(image_path).convert("RGB")
102
+ pixel_values = processor(image, return_tensors="pt").pixel_values
103
+ return pixel_values
104
+
105
+ def generate_text_from_image(model, image_path: str, processor, device):
106
+ """
107
+ Generate text from an image using the trained model.
108
+ """
109
+ # Load and preprocess the image
110
+ pixel_values = load_and_preprocess_image(image_path, processor)
111
+ pixel_values = pixel_values.to(device)
112
+
113
+ decoder_input_ids = processor.tokenizer(task_prompt="<s_matricula>",
114
+ add_special_tokens=False,
115
+ return_tensors="pt").input_ids
116
+
117
+ decoded_text = model.inference(image_tensors=pixel_values,
118
+ prompt_tensors=decoder_input_ids)["predictions"][0]
119
+
120
+ return decoded_text
121
+
122
+ # Example usage
123
+ image_path = "path_to_your_image" # Replace with your image path
124
+ extracted_text = generate_text_from_image(model, image_path, processor, device)
125
+ print("Extracted Text:", extracted_text)
126
+ ```
127
+
128
+ Refer to the [documentation](https://huggingface.co/docs/transformers/main/en/model_doc/donut) for more code examples.
129
+
130
+ ## Intended uses & limitations
131
+
132
+ This fine-tuned model is specifically designed for extracting text from dominican vehicle registration (matriculas) documents, and may not perform optimally on other types of documents. The dataset used is still suboptimal (numerous errors are still there), thus, this model will need to be retrained later to improve its performance.
133
+
134
+ ### Training hyperparameters
135
+
136
+ The following hyperparameters were used during training:
137
+ - learning_rate: 3e-05
138
+ - train_batch_size: 5
139
+ - eval_batch_size: 1
140
+ - seed: 2022
141
+ - optimizer: AdamW with betas=(0.9,0.999) and epsilon=1e-08
142
+ - lr_scheduler_type: linear
143
+ - lr_scheduler_warmup_steps: 300
144
+ - num_epochs: 50
145
+ - weight_decay: 0.01
146
+
147
+ ### Framework versions
148
+
149
+ - Transformers 4.25.1
150
+ - Timm 0.6.13
151
+ - Pytorch-lightning 1.6.4
152
+ - Donut 1.0.9
153
+
154
+ If you want to support me, you can [here](https://www.marzanconsulting.com/).
155
+
156
+ ### BibTeX entry and citation info for DONUT
157
+
158
+ ```bibtex
159
+ @article{DBLP:journals/corr/abs-2111-15664,
160
+ author = {Geewook Kim and
161
+ Teakgyu Hong and
162
+ Moonbin Yim and
163
+ Jinyoung Park and
164
+ Jinyeong Yim and
165
+ Wonseok Hwang and
166
+ Sangdoo Yun and
167
+ Dongyoon Han and
168
+ Seunghyun Park},
169
+ title = {Donut: Document Understanding Transformer without {OCR}},
170
+ journal = {CoRR},
171
+ volume = {abs/2111.15664},
172
+ year = {2021},
173
+ url = {https://arxiv.org/abs/2111.15664},
174
+ eprinttype = {arXiv},
175
+ eprint = {2111.15664},
176
+ timestamp = {Thu, 02 Dec 2021 10:50:44 +0100},
177
+ biburl = {https://dblp.org/rec/journals/corr/abs-2111-15664.bib},
178
+ bibsource = {dblp computer science bibliography, https://dblp.org}
179
+ }
180
+ ```
added_tokens.json ADDED
@@ -0,0 +1,69 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "</s_ano_de_fabricacion>": 57553,
3
+ "</s_cap_carga_ton>": 57565,
4
+ "</s_cedula_pasaporte_rnc>": 57573,
5
+ "</s_chasis>": 57539,
6
+ "</s_cilindros>": 57567,
7
+ "</s_color>": 57557,
8
+ "</s_datos_del_propietario>": 57571,
9
+ "</s_datos_del_vehiculo>": 57533,
10
+ "</s_datos_dgii>": 57581,
11
+ "</s_dgii_dato_1>": 57585,
12
+ "</s_dgii_dato_2>": 57583,
13
+ "</s_direccion>": 57579,
14
+ "</s_encabezado>": 57527,
15
+ "</s_estatus_del_vehiculo>": 57541,
16
+ "</s_fecha_de_expedicion>": 57529,
17
+ "</s_fields>": 57525,
18
+ "</s_fuerza_motriz_hp_cc>": 57563,
19
+ "</s_marca>": 57549,
20
+ "</s_modelo>": 57551,
21
+ "</s_motor_o_no_de_serie>": 57559,
22
+ "</s_no_de_registro_y_placa>": 57537,
23
+ "</s_no_de_registro_y_placa_anterior>": 57577,
24
+ "</s_no_matricula>": 57531,
25
+ "</s_no_puertas>": 57569,
26
+ "</s_nombres_razon_social>": 57575,
27
+ "</s_pasajeros>": 57561,
28
+ "</s_r1>": 57535,
29
+ "</s_r2>": 57545,
30
+ "</s_r3>": 57555,
31
+ "</s_tipo_de_emision>": 57543,
32
+ "</s_tipo_de_vehiculo>": 57547,
33
+ "<s_ano_de_fabricacion>": 57554,
34
+ "<s_cap_carga_ton>": 57566,
35
+ "<s_cedula_pasaporte_rnc>": 57574,
36
+ "<s_chasis>": 57540,
37
+ "<s_cilindros>": 57568,
38
+ "<s_color>": 57558,
39
+ "<s_dataset>": 57587,
40
+ "<s_datos_del_propietario>": 57572,
41
+ "<s_datos_del_vehiculo>": 57534,
42
+ "<s_datos_dgii>": 57582,
43
+ "<s_dgii_dato_1>": 57586,
44
+ "<s_dgii_dato_2>": 57584,
45
+ "<s_direccion>": 57580,
46
+ "<s_encabezado>": 57528,
47
+ "<s_estatus_del_vehiculo>": 57542,
48
+ "<s_fecha_de_expedicion>": 57530,
49
+ "<s_fields>": 57526,
50
+ "<s_fuerza_motriz_hp_cc>": 57564,
51
+ "<s_iitcdip>": 57523,
52
+ "<s_marca>": 57550,
53
+ "<s_matricula>": 57588,
54
+ "<s_modelo>": 57552,
55
+ "<s_motor_o_no_de_serie>": 57560,
56
+ "<s_no_de_registro_y_placa>": 57538,
57
+ "<s_no_de_registro_y_placa_anterior>": 57578,
58
+ "<s_no_matricula>": 57532,
59
+ "<s_no_puertas>": 57570,
60
+ "<s_nombres_razon_social>": 57576,
61
+ "<s_pasajeros>": 57562,
62
+ "<s_r1>": 57536,
63
+ "<s_r2>": 57546,
64
+ "<s_r3>": 57556,
65
+ "<s_synthdog>": 57524,
66
+ "<s_tipo_de_emision>": 57544,
67
+ "<s_tipo_de_vehiculo>": 57548,
68
+ "<sep/>": 57522
69
+ }
config.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "naver-clova-ix/donut-base",
3
+ "align_long_axis": false,
4
+ "architectures": [
5
+ "DonutModel"
6
+ ],
7
+ "decoder_layer": 4,
8
+ "encoder_layer": [
9
+ 2,
10
+ 2,
11
+ 14,
12
+ 2
13
+ ],
14
+ "input_size": [
15
+ 960,
16
+ 1280
17
+ ],
18
+ "max_length": 868,
19
+ "max_position_embeddings": 868,
20
+ "model_type": "donut",
21
+ "torch_dtype": "float32",
22
+ "transformers_version": "4.25.1",
23
+ "window_size": 10
24
+ }
config.yaml ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ resume_from_checkpoint_path: None
2
+ result_path: 'trainer/result'
3
+ pretrained_model_name_or_path: 'marzanconsulting/donut-dr-matriculas-ocr'
4
+ dataset_name_or_paths:
5
+ - 'trainer/dataset'
6
+ task_start_tokens:
7
+ - '<s_matricula>'
8
+ sort_json_key: False
9
+ train_batch_sizes:
10
+ - 5
11
+ val_batch_sizes:
12
+ - 1
13
+ input_size:
14
+ - 960
15
+ - 1280
16
+ max_length: 868
17
+ max_position_embeddings: 868
18
+ align_long_axis: False
19
+ num_nodes: 1
20
+ seed: 2022
21
+ lr: 3e-05
22
+ warmup_steps: 300
23
+ num_training_samples_per_epoch: 800
24
+ max_epochs: 50
25
+ max_steps: -1
26
+ num_workers: 12
27
+ val_check_interval: 1.0
28
+ check_val_every_n_epoch: 3
29
+ gradient_clip_val: 1.0
30
+ verbose: True
31
+ model_dir: ''
32
+ tensorboard_dir: ''
33
+ checkpoint_dir:
34
+ exp_name: 'train_cord_matriculas'
35
+ exp_version: 'mmc_v1'
preprocessor_config.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "do_align_long_axis": false,
3
+ "do_normalize": true,
4
+ "do_pad": true,
5
+ "do_resize": true,
6
+ "do_thumbnail": true,
7
+ "image_processor_type": "DonutImageProcessor",
8
+ "image_mean": [
9
+ 0.5,
10
+ 0.5,
11
+ 0.5
12
+ ],
13
+ "image_std": [
14
+ 0.5,
15
+ 0.5,
16
+ 0.5
17
+ ],
18
+ "processor_class": "DonutProcessor",
19
+ "resample": 2,
20
+ "size": {
21
+ "width": 1280,
22
+ "height": 960
23
+ }
24
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c16610f407d1f75997923ab15f57662352f256a01688a8a76531a87749c5408a
3
+ size 859005897
sentencepiece.bpe.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cb9e3dce4c326195d08fc3dd0f7e2eee1da8595c847bf4c1a9c78b7a82d47e2d
3
+ size 1296245
special_tokens_map.json ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<s_dataset>",
4
+ "<s_matricula>"
5
+ ],
6
+ "bos_token": "<s>",
7
+ "cls_token": "<s>",
8
+ "eos_token": "</s>",
9
+ "mask_token": {
10
+ "content": "<mask>",
11
+ "lstrip": true,
12
+ "normalized": true,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": "<pad>",
17
+ "sep_token": "</s>",
18
+ "unk_token": "<unk>"
19
+ }
tokenizer_config.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<s>",
3
+ "cls_token": "<s>",
4
+ "eos_token": "</s>",
5
+ "from_slow": true,
6
+ "mask_token": {
7
+ "__type": "AddedToken",
8
+ "content": "<mask>",
9
+ "lstrip": true,
10
+ "normalized": true,
11
+ "rstrip": false,
12
+ "single_word": false
13
+ },
14
+ "model_max_length": 1000000000000000019884624838656,
15
+ "name_or_path": "naver-clova-ix/donut-base",
16
+ "pad_token": "<pad>",
17
+ "processor_class": "DonutProcessor",
18
+ "sep_token": "</s>",
19
+ "sp_model_kwargs": {},
20
+ "special_tokens_map_file": null,
21
+ "tokenizer_class": "XLMRobertaTokenizer",
22
+ "unk_token": "<unk>"
23
+ }