File size: 1,271 Bytes
b04e418
 
 
 
 
7ecd7bb
efbc2c6
 
7ecd7bb
 
 
0e8a4cb
fdc8a4b
a0216c6
fb607ba
7ecd7bb
80fbb12
7ecd7bb
 
cc6124f
7ecd7bb
efbc2c6
cc6124f
c0a286b
efbc2c6
7ecd7bb
eb9468e
 
7ecd7bb
 
eb9468e
7ecd7bb
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
---
language:
- en
metrics:
- accuracy
pipeline_tag: image-text-to-text
base_model:
- naver-clova-ix/donut-base-finetuned-cord-v2
tags:
- logistics
- document-parsing
---
๐Ÿ—๏ธ This is a FYP project topic on document parsing of ๐Ÿšš logistics ๐Ÿšš shipping documents for system integration.
- https://huggingface.co/uartimcs/donut-booking-extract/blob/main/FYP.pdf

Latest update on the version of modules used to continue run the program because there is no recent update for the donut pretrained model.

**My use case:**
Extract common key datafields from shipping documents generated from ten different shipping lines.

**Repo & Datasets**
- donut.zip (Original Donut Repo + Labelled Booking Dummy Datasets with JSONL files + Config Files)
- sample-image-to-play.zip (Excess dummy datasets used to play and test the model)
  https://huggingface.co/spaces/uartimcs/donut-booking-gradio

**Colab Notebooks**
- donut-booking-train.ipynb (Train the model in Colab using T4 TPU / A100 GPU environment)
- donut-booking-run.ipynb (Run the model in Colab using gradio using T4 TPU / A100 GPU environment)

**Size of dataset**

Follow the CORD-v2 dataset ratio:
- train: 800 (80 pics x 10 classes)
- validation: 100 (10 pics x 10 classes)
- test:  100 (10 pics x 10 classes)