Transformers
ali6parmak commited on
Commit
808ef5b
·
verified ·
1 Parent(s): 95548e4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +107 -6
README.md CHANGED
@@ -13,6 +13,24 @@ In this model card, we are providing the non-visual models we use in our pdf-doc
13
  This service allows for the segmentation and classification of different parts of PDF pages, identifying the elements such as texts, titles, pictures, tables and so on. Additionally, it determines the correct order of these identified elements.
14
 
15
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
  ## Quick Start
17
 
18
  Clone the service:
@@ -38,6 +56,79 @@ Get the segments of a PDF:
38
  curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5060/fast
39
 
40
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41
  When the process is done, the output will include a list of SegmentBox elements and, every SegmentBox element will has this information:
42
 
43
  {
@@ -47,16 +138,26 @@ When the process is done, the output will include a list of SegmentBox elements
47
  "height": Height of the segment
48
  "page_number": Page number which the segment belongs to
49
  "text": Text inside the segment
50
- "type": Type of the segment
51
  }
52
 
53
-
54
- To stop the server:
55
 
56
  make stop
57
 
 
58
 
59
- For more information, you can refer to:
 
60
 
61
- https://github.com/huridocs/pdf-document-layout-analysis
62
-
 
 
 
 
 
 
 
 
 
 
13
  This service allows for the segmentation and classification of different parts of PDF pages, identifying the elements such as texts, titles, pictures, tables and so on. Additionally, it determines the correct order of these identified elements.
14
 
15
 
16
+ <table>
17
+ <tr>
18
+ <td>
19
+ <img src="https://raw.githubusercontent.com/huridocs/pdf-document-layout-analysis/main/images/vgtexample1.png"/>
20
+ </td>
21
+ <td>
22
+ <img src="https://raw.githubusercontent.com/huridocs/pdf-document-layout-analysis/main/images/vgtexample2.png"/>
23
+ </td>
24
+ <td>
25
+ <img src="https://raw.githubusercontent.com/huridocs/pdf-document-layout-analysis/main/images/vgtexample3.png"/>
26
+ </td>
27
+ <td>
28
+ <img src="https://raw.githubusercontent.com/huridocs/pdf-document-layout-analysis/main/images/vgtexample4.png"/>
29
+ </td>
30
+ </tr>
31
+ </table>
32
+
33
+
34
  ## Quick Start
35
 
36
  Clone the service:
 
56
  curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5060/fast
57
 
58
 
59
+ To stop the server:
60
+
61
+ make stop
62
+
63
+
64
+ ## Contents
65
+ - [Quick Start](#quick-start)
66
+ - [Dependencies](#dependencies)
67
+ - [Requirements](#requirements)
68
+ - [Models](#models)
69
+ - [Data](#data)
70
+ - [Usage](#usage)
71
+
72
+ ## Dependencies
73
+ * Docker Desktop 4.25.0 [install link](https://www.docker.com/products/docker-desktop/)
74
+ * For GPU support [install link](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)
75
+
76
+ ## Requirements
77
+ * 4 GB RAM memory
78
+ * 6 GB GPU memory (if not, it will run with CPU)
79
+
80
+ ## Models
81
+
82
+ There are two kinds of models in the project. The default model is a visual model which has been trained by
83
+ Alibaba Research Group. If you would like to take a look at their original project, you can visit
84
+ [this](https://github.com/AlibabaResearch/AdvancedLiterateMachinery) link. There are various models published by them
85
+ and according to our benchmarks the best performing model is the one trained with the [DocLayNet](https://github.com/DS4SD/DocLayNet)
86
+ dataset. So, this model is the default model in our project, and it uses more resources than the other model which we ourselves trained.
87
+
88
+ The second kind of model is the LightGBM models. These models are not visual models, they do not "see" the pages, but
89
+ we are using XML information that we extracted by using [Poppler](https://poppler.freedesktop.org). The reason there are two
90
+ models existed is, one of these models is predicting the types of the tokens and the other one is trying to find out the correct segmentations in the page.
91
+ By combining both, we are segmenting the pages alongside with the type of the content.
92
+
93
+ Even though the visual model using more resources than the others, generally it's giving better performance since it
94
+ "sees" the whole page and has an idea about all the context. On the other hand, LightGBM models are performing slightly worse
95
+ but they are much faster and more resource-friendly. It will only require your CPU power.
96
+
97
+ ## Data
98
+
99
+ As we mentioned, we are using the visual model that trained on [DocLayNet](https://github.com/DS4SD/DocLayNet) dataset.
100
+ Also for training the LightGBM models, we again used this dataset. There are 11 categories in this dataset:
101
+
102
+ 1: "Caption"
103
+ 2: "Footnote"
104
+ 3: "Formula"
105
+ 4: "ListItem"
106
+ 5: "PageFooter"
107
+ 6: "PageHeader"
108
+ 7: "Picture"
109
+ 8: "SectionHeader"
110
+ 9: "Table"
111
+ 10: "Text"
112
+ 11: "Title"
113
+
114
+ For more information about the data, you can visit the link we shared above.
115
+
116
+
117
+
118
+ ## Usage
119
+
120
+ As we mentioned at the [Quick Start](#quick-start), you can use the service simply like this:
121
+
122
+ curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5060
123
+
124
+ This command will run the code on visual model. So you should be prepared that it will use lots of resources. But if you
125
+ want to use the not visual models, which are the LightGBM models, you can use this command:
126
+
127
+ curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5060/fast
128
+
129
+ The shape of the response will be the same in both of these commands.
130
+
131
+
132
  When the process is done, the output will include a list of SegmentBox elements and, every SegmentBox element will has this information:
133
 
134
  {
 
138
  "height": Height of the segment
139
  "page_number": Page number which the segment belongs to
140
  "text": Text inside the segment
141
+ "type": Type of the segment (one of the categories mentioned above)
142
  }
143
 
144
+ And to stop the server, you can simply use this:
 
145
 
146
  make stop
147
 
148
+ ### Order of the Output Elements
149
 
150
+ When all the processes are done, the service returns a list of SegmentBox elements with some determined order. To figure out this order,
151
+ we are mostly relying on [Poppler](https://poppler.freedesktop.org). In addition to this, we are also getting help from the types of the segments.
152
 
153
+ During the PDF to XML conversion, Poppler determines an initial reading order for each token it creates. These tokens are typically lines of text,
154
+ but it depends on Poppler's heuristics. When we extract a segment, it usually consists of multiple tokens. Therefore, for each segment on the page,
155
+ we calculate an "average reading order" by averaging the reading orders of the tokens within that segment. We then sort the segments
156
+ based on this average reading order. However, this process is not solely dependent on Poppler, we also consider the types of segments.
157
+
158
+ First, we place the "header" segments at the beginning and sort them among themselves. Next, we sort the remaining segments,
159
+ excluding "footers" and "footnotes," which are positioned at the end of the output.
160
+
161
+ Occasionally, we encounter segments like pictures that might not contain text. Since Poppler cannot assign a reading order to these non-text segments,
162
+ we process them after sorting all segments with content. To determine their reading order, we rely on the reading order of the nearest "non-empty" segment,
163
+ using distance as a criterion.