ds4sd
/

docling-models

Inference Endpoints

Model card Files Files and versions

docling-models / README.md

PeterStaar's picture

fixing typo

ae5a4ee 2 months ago

|

3.49 kB

	---
	license: cdla-permissive-2.0
	---

	# Docling Models

	This page contains models that power the PDF document converion package [docling](https://github.com/DS4SD/docling).

	## Layout Model

	The layout model will take an image from a poge and apply RT-DETR model in order to find different layout components. It currently detects the labels: Caption, Footnote, Formula, List-item, Page-footer, Page-header, Picture, Section-header, Table, Text, Title. As a reference (from the DocLayNet-paper), this is the performance of standard object detection methods on the DocLayNet dataset compared to human evaluation,

	\| \| human \| MRCNN \| MRCNN \| FRCNN \| YOLO \|
	\|----------------\|---------\|---------\|---------\|---------\|--------\|
	\| \| human \| R50 \| R101 \| R101 \| v5x6 \|
	\| Caption \| 84-89 \| 68.4 \| 71.5 \| 70.1 \| 77.7 \|
	\| Footnote \| 83-91 \| 70.9 \| 71.8 \| 73.7 \| 77.2 \|
	\| Formula \| 83-85 \| 60.1 \| 63.4 \| 63.5 \| 66.2 \|
	\| List-item \| 87-88 \| 81.2 \| 80.8 \| 81.0 \| 86.2 \|
	\| Page-footer \| 93-94 \| 61.6 \| 59.3 \| 58.9 \| 61.1 \|
	\| Page-header \| 85-89 \| 71.9 \| 70.0 \| 72.0 \| 67.9 \|
	\| Picture \| 69-71 \| 71.7 \| 72.7 \| 72.0 \| 77.1 \|
	\| Section-header \| 83-84 \| 67.6 \| 69.3 \| 68.4 \| 74.6 \|
	\| Table \| 77-81 \| 82.2 \| 82.9 \| 82.2 \| 86.3 \|
	\| Text \| 84-86 \| 84.6 \| 85.8 \| 85.4 \| 88.1 \|
	\| Title \| 60-72 \| 76.7 \| 80.4 \| 79.9 \| 82.7 \|
	\| All \| 82-83 \| 72.4 \| 73.5 \| 73.4 \| 76.8 \|

	## TableFormer

	The tableformer model will identify the structure of the table, starting from an image of a table. It uses the predicted table regions of the layout model to identify the tables. Tableformer has SOTA table structure identification,

	\| Model (TEDS) \| Simple table \| Complex table \| All tables \|
	\| ------------ \| ------------ \| ------------- \| ---------- \|
	\| Tabula \| 78.0 \| 57.8 \| 67.9 \|
	\| Traprange \| 60.8 \| 49.9 \| 55.4 \|
	\| Camelot \| 80.0 \| 66.0 \| 73.0 \|
	\| Acrobat Pro \| 68.9 \| 61.8 \| 65.3 \|
	\| EDD \| 91.2 \| 85.4 \| 88.3 \|
	\| TableFormer \| 95.4 \| 90.1 \| 93.6 \|

	## References

	```
	@techreport{Docling,
	author = {Deep Search Team},
	month = {8},
	title = {{Docling Technical Report}},
	url={https://arxiv.org/abs/2408.09869},
	eprint={2408.09869},
	doi = "10.48550/arXiv.2408.09869",
	version = {1.0.0},
	year = {2024}
	}

	@article{doclaynet2022,
	title = {DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis},
	doi = {10.1145/3534678.353904},
	url = {https://arxiv.org/abs/2206.01062},
	author = {Pfitzmann, Birgit and Auer, Christoph and Dolfi, Michele and Nassar, Ahmed S and Staar, Peter W J},
	year = {2022}
	}

	@InProceedings{TableFormer2022,
	author = {Nassar, Ahmed and Livathinos, Nikolaos and Lysak, Maksym and Staar, Peter},
	title = {TableFormer: Table Structure Understanding With Transformers},
	booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
	month = {June},
	year = {2022},
	pages = {4614-4623},
	doi = {https://doi.org/10.1109/CVPR52688.2022.00457}
	}
	```