File size: 10,968 Bytes
2aa7f88 808ef5b 1d60d8a 2aa7f88 d277ae4 2aa7f88 d277ae4 c1a9148 d277ae4 2aa7f88 d277ae4 c1a9148 d277ae4 2aa7f88 c1a9148 d277ae4 2aa7f88 808ef5b d277ae4 808ef5b d277ae4 808ef5b d277ae4 e441581 808ef5b d277ae4 c1a9148 d277ae4 808ef5b 26f2535 808ef5b c1a9148 808ef5b acc26ee 808ef5b acc26ee 808ef5b acc26ee 808ef5b 2aa7f88 acc26ee 2aa7f88 808ef5b 2aa7f88 ffc6bc1 6d4db20 ffc6bc1 6d4db20 ffc6bc1 808ef5b 2aa7f88 808ef5b 2aa7f88 808ef5b 2aa7f88 808ef5b 26f2535 9335bf9 26f2535 e441581 7f89ccb e441581 7f89ccb e441581 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 |
---
license: openrail
---
<h3 align="center">PDF Document Layout Analysis</h3>
<p align="center">Models for extracting segments alongside with their types from a PDF</p>
In this model card, we are providing the non-visual models we use in our pdf-document-layout-analysis service:
https://github.com/huridocs/pdf-document-layout-analysis
This service allows for the segmentation and classification of different parts of PDF pages, identifying the elements such as texts, titles, pictures, tables and so on. Additionally, it determines the correct order of these identified elements.
<table>
<tr>
<td>
<img src="https://raw.githubusercontent.com/huridocs/pdf-document-layout-analysis/main/images/vgtexample1.png"/>
</td>
<td>
<img src="https://raw.githubusercontent.com/huridocs/pdf-document-layout-analysis/main/images/vgtexample2.png"/>
</td>
<td>
<img src="https://raw.githubusercontent.com/huridocs/pdf-document-layout-analysis/main/images/vgtexample3.png"/>
</td>
<td>
<img src="https://raw.githubusercontent.com/huridocs/pdf-document-layout-analysis/main/images/vgtexample4.png"/>
</td>
</tr>
</table>
#### Project Links:
- GitHub: [pdf-document-layout-analysis](https://github.com/huridocs/pdf-document-layout-analysis)
- HuggingFace: [pdf-document-layout-analysis](https://huggingface.co/HURIDOCS/pdf-document-layout-analysis)
- DockerHub: [pdf-document-layout-analysis](https://hub.docker.com/r/huridocs/pdf-document-layout-analysis/)
## Quick Start
Run the service:
- With GPU support:
```
docker run --rm --name pdf-document-layout-analysis --gpus '"device=0"' -p 5060:5060 --entrypoint ./start.sh huridocs/pdf-document-layout-analysis:v0.0.21
```
- Without GPU support:
```
docker run --rm --name pdf-document-layout-analysis -p 5060:5060 --entrypoint ./start.sh huridocs/pdf-document-layout-analysis:v0.0.21
```
[OPTIONAL] OCR the PDF. Check supported languages (curl localhost:5060/info):
curl -X POST -F 'language=en' -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5060/ocr --output ocr_document.pdf
Get the segments from a PDF:
curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5060
To stop the server:
docker stop pdf-document-layout-analysis
## Contents
- [Quick Start](#quick-start)
- [Build From Source](#build-from-source)
- [Dependencies](#dependencies)
- [Requirements](#requirements)
- [Models](#models)
- [Data](#data)
- [Usage](#usage)
- [Benchmarks](#benchmarks)
- [Performance](#performance)
- [Speed](#speed)
- [Related Services](#related-services)
## Build From Source
Start the service:
make start
[OPTIONAL] OCR the PDF. Check supported languages (curl localhost:5060/info):
curl -X POST -F 'language=en' -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5060/ocr --output ocr_document.pdf
Get the segments from a PDF:
curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5060
To stop the server:
make stop
## Dependencies
* Docker Desktop 4.25.0 [install link](https://www.docker.com/products/docker-desktop/)
* For GPU support [install link](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)
## Requirements
* 4 GB RAM memory
* 6 GB GPU memory (if not, it will run with CPU)
## Models
There are two kinds of models in the project. The default model is a visual model (specifically called as Vision Grid Transformer - VGT) which has been trained by
Alibaba Research Group. If you would like to take a look at their original project, you can visit
[this](https://github.com/AlibabaResearch/AdvancedLiterateMachinery) link. There are various models published by them
and according to our benchmarks the best performing model is the one trained with the [DocLayNet](https://github.com/DS4SD/DocLayNet)
dataset. So, this model is the default model in our project, and it uses more resources than the other model which we ourselves trained.
The second kind of model is the LightGBM models. These models are not visual models, they do not "see" the pages, but
we are using XML information that we extracted by using [Poppler](https://poppler.freedesktop.org). The reason there are two
models existed is, one of these models is predicting the types of the tokens and the other one is trying to find out the correct segmentations in the page.
By combining both, we are segmenting the pages alongside with the type of the content.
Even though the visual model using more resources than the others, generally it's giving better performance since it
"sees" the whole page and has an idea about all the context. On the other hand, LightGBM models are performing slightly worse
but they are much faster and more resource-friendly. It will only require your CPU power.
The service converts PDFs to text-searchable PDFs using [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) and [ocrmypdf](https://ocrmypdf.readthedocs.io/en/latest/index.html).
## Data
As we mentioned, we are using the visual model that trained on [DocLayNet](https://github.com/DS4SD/DocLayNet) dataset.
Also for training the LightGBM models, we again used this dataset. There are 11 categories in this dataset:
1: "Caption"
2: "Footnote"
3: "Formula"
4: "List item"
5: "Page footer"
6: "Page header"
7: "Picture"
8: "Section header"
9: "Table"
10: "Text"
11: "Title"
For more information about the data, you can visit the link we shared above.
## Usage
As we mentioned at the [Quick Start](#quick-start), you can use the service simply like this:
curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5060
This command will run the code on visual model. So you should be prepared that it will use lots of resources. But if you
want to use the not visual models, which are the LightGBM models, you can use this command:
curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' -F "fast=true" localhost:5060
The shape of the response will be the same in both of these commands.
When the process is done, the output will include a list of SegmentBox elements and, every SegmentBox element will has this information:
{
"left": Left position of the segment
"top": Top position of the segment
"width": Width of the segment
"height": Height of the segment
"page_number": Page number which the segment belongs to
"page_width": Width of the page which the segment belongs to
"page_height": Width of the page which the segment belongs to
"text": Text inside the segment
"type": Type of the segment (one of the categories mentioned above)
}
If you want to get the visualizations, you can use this command:
curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5060/visualize -o '/PATH/TO/OUTPUT_PDF/pdf_name.pdf'
Or with fast models:
curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' -F "fast=true" localhost:5060/visualize -o '/PATH/TO/OUTPUT_PDF/pdf_name.pdf'
And to stop the server, you can simply use this:
make stop
### Order of the Output Elements
When all the processes are done, the service returns a list of SegmentBox elements with some determined order. To figure out this order,
we are mostly relying on [Poppler](https://poppler.freedesktop.org). In addition to this, we are also getting help from the types of the segments.
During the PDF to XML conversion, Poppler determines an initial reading order for each token it creates. These tokens are typically lines of text,
but it depends on Poppler's heuristics. When we extract a segment, it usually consists of multiple tokens. Therefore, for each segment on the page,
we calculate an "average reading order" by averaging the reading orders of the tokens within that segment. We then sort the segments
based on this average reading order. However, this process is not solely dependent on Poppler, we also consider the types of segments.
First, we place the "header" segments at the beginning and sort them among themselves. Next, we sort the remaining segments,
excluding "footers" and "footnotes," which are positioned at the end of the output.
Occasionally, we encounter segments like pictures that might not contain text. Since Poppler cannot assign a reading order to these non-text segments,
we process them after sorting all segments with content. To determine their reading order, we rely on the reading order of the nearest "non-empty" segment,
using distance as a criterion.
### Extracting Tables and Formulas
Our service provides a way to extract your tables and formulas in different formats.
As default, formula segments' "text" property will include the formula in LaTeX format.
You can also extract tables in different formats like "markdown", "latex", or "html" but this is not a default option.
To extract the tables like this, you should set "extraction_format" parameter. Some example usages shown below:
```
curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5060 -F "extraction_format=latex"
```
```
curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5060/fast -F "extraction_format=markdown"
```
You should be aware that this additional extraction process can make the process much longer, especially if you have a large number of tables.
(For table extraction, we are using [StructEqTable](https://github.com/UniModal4Reasoning/StructEqTable-Deploy)
and for formula extraction, we are using [RapidLaTeXOCR](https://github.com/RapidAI/RapidLaTeXOCR))
## Benchmark
These are the benchmark results for VGT model on PubLayNet dataset:
<table>
<tr>
<th>Overall</th>
<th>Text</th>
<th>Title</th>
<th>List</th>
<th>Table</th>
<th>Figure</th>
</tr>
<tr>
<td>0.962</td>
<td>0.950</td>
<td>0.939</td>
<td>0.968</td>
<td>0.981</td>
<td>0.971</td>
</tr>
</table>
You can check this link to see the comparison with the other models:
https://paperswithcode.com/sota/document-layout-analysis-on-publaynet-val
## Related Services
Here are some of our other services that is built upon this service:
- [PDF Table Of Contents Extractor](https://github.com/huridocs/pdf-table-of-contents-extractor): This project aims to extract text from PDF files using the outputs generated
by the pdf-document-layout-analysis service. By leveraging the segmentation and classification capabilities of the underlying analysis tool,
this project automates the process of text extraction from PDF files.
- [PDF Text Extraction](https://github.com/huridocs/pdf-text-extraction): This project aims to extract text from PDF files using the outputs generated by the
pdf-document-layout-analysis service. By leveraging the segmentation and classification capabilities of the underlying
analysis tool, this project automates the process of text extraction from PDF files.
|