Update README.md
Browse files
README.md
CHANGED
@@ -59,3 +59,177 @@ The best way to use this model is via the [OCRFlux toolkit](https://github.com/c
|
|
59 |
The toolkit comes with an efficient inference setup via vllm that can handle millions of documents
|
60 |
at scale.
|
61 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
59 |
The toolkit comes with an efficient inference setup via vllm that can handle millions of documents
|
60 |
at scale.
|
61 |
|
62 |
+
### Benchmark for single-page parsing
|
63 |
+
|
64 |
+
We ship two comprehensive benchmarks to help measure the performance of our OCR system in single-page parsing:
|
65 |
+
|
66 |
+
- [OCRFlux-bench-single](https://huggingface.co/datasets/ChatDOC/OCRFlux-bench-single): Containing 2000 pdf pages (1000 English pages and 1000 Chinese pages) and their ground-truth Markdowns (manually labeled with multi-round check).
|
67 |
+
|
68 |
+
- [OCRFlux-pubtabnet-single](https://huggingface.co/datasets/ChatDOC/OCRFlux-pubtabnet-single): Derived from the public [PubTabNet](https://github.com/ibm-aur-nlp/PubTabNet) benchmark with some format transformation. It contains 9064 HTML table samples, which are split into simple tables and complex tables according to whether they have rowspan and colspan cells.
|
69 |
+
|
70 |
+
We emphasize that the released benchmarks are NOT included in our training and evaluation data. The following is the main result:
|
71 |
+
|
72 |
+
|
73 |
+
1. In [OCRFlux-bench-single](https://huggingface.co/datasets/ChatDOC/OCRFlux-bench-single), we calculated the Edit Distance Similarity (EDS) between the generated Markdowns and the ground-truth Markdowns as the metric.
|
74 |
+
|
75 |
+
<table>
|
76 |
+
<thead>
|
77 |
+
<tr>
|
78 |
+
<th>Language</th>
|
79 |
+
<th>Model</th>
|
80 |
+
<th>Avg EDS ↑</th>
|
81 |
+
</tr>
|
82 |
+
</thead>
|
83 |
+
<tbody>
|
84 |
+
<tr>
|
85 |
+
<td rowspan="4">English</td>
|
86 |
+
<td>olmOCR-7B-0225-preview</td>
|
87 |
+
<td>0.885</td>
|
88 |
+
</tr>
|
89 |
+
<tr>
|
90 |
+
<td>Nanonets-OCR-s</td>
|
91 |
+
<td>0.870</td>
|
92 |
+
</tr>
|
93 |
+
<tr>
|
94 |
+
<td>MonkeyOCR</td>
|
95 |
+
<td>0.828</td>
|
96 |
+
</tr>
|
97 |
+
<tr>
|
98 |
+
<td><strong><a href="https://huggingface.co/ChatDOC/OCRFlux-3B">OCRFlux-3B</a></strong></td>
|
99 |
+
<td>0.971</td>
|
100 |
+
</tr>
|
101 |
+
<tr>
|
102 |
+
<td rowspan="4">Chinese</td>
|
103 |
+
<td>olmOCR-7B-0225-preview</td>
|
104 |
+
<td>0.859</td>
|
105 |
+
</tr>
|
106 |
+
<tr>
|
107 |
+
<td>Nanonets-OCR-s</td>
|
108 |
+
<td>0.846</td>
|
109 |
+
</tr>
|
110 |
+
<tr>
|
111 |
+
<td>MonkeyOCR</td>
|
112 |
+
<td>0.731</td>
|
113 |
+
</tr>
|
114 |
+
<tr>
|
115 |
+
<td><strong><a href="https://huggingface.co/ChatDOC/OCRFlux-3B">OCRFlux-3B</a></strong></td>
|
116 |
+
<td>0.962</td>
|
117 |
+
</tr>
|
118 |
+
<tr>
|
119 |
+
<td rowspan="4">Total</td>
|
120 |
+
<td>olmOCR-7B-0225-preview</td>
|
121 |
+
<td>0.872</td>
|
122 |
+
</tr>
|
123 |
+
<tr>
|
124 |
+
<td>Nanonets-OCR-s</td>
|
125 |
+
<td>0.858</td>
|
126 |
+
</tr>
|
127 |
+
<tr>
|
128 |
+
<td>MonkeyOCR</td>
|
129 |
+
<td>0.780</td>
|
130 |
+
</tr>
|
131 |
+
<tr>
|
132 |
+
<td><strong><a href="https://huggingface.co/ChatDOC/OCRFlux-3B">OCRFlux-3B</a></strong></td>
|
133 |
+
<td>0.967</td>
|
134 |
+
</tr>
|
135 |
+
</tbody>
|
136 |
+
</table>
|
137 |
+
|
138 |
+
2. In [OCRFlux-pubtabnet-single](https://huggingface.co/datasets/ChatDOC/OCRFlux-pubtabnet-single), we calculated the Tree Edit Distance-based Similarity (TEDS) between the generated HTML tables and the ground-truth HTML tables as the metric.
|
139 |
+
<table>
|
140 |
+
<thead>
|
141 |
+
<tr>
|
142 |
+
<th>Type</th>
|
143 |
+
<th>Model</th>
|
144 |
+
<th>Avg TEDS ↑</th>
|
145 |
+
</tr>
|
146 |
+
</thead>
|
147 |
+
<tbody>
|
148 |
+
<tr>
|
149 |
+
<td rowspan="4">Simple</td>
|
150 |
+
<td>olmOCR-7B-0225-preview</td>
|
151 |
+
<td>0.810</td>
|
152 |
+
</tr>
|
153 |
+
<tr>
|
154 |
+
<td>Nanonets-OCR-s</td>
|
155 |
+
<td>0.882</td>
|
156 |
+
</tr>
|
157 |
+
<tr>
|
158 |
+
<td>MonkeyOCR</td>
|
159 |
+
<td>0.880</td>
|
160 |
+
</tr>
|
161 |
+
<tr>
|
162 |
+
<td><strong><a href="https://huggingface.co/ChatDOC/OCRFlux-3B">OCRFlux-3B</a></strong></td>
|
163 |
+
<td>0.912</td>
|
164 |
+
</tr>
|
165 |
+
<tr>
|
166 |
+
<td rowspan="4">Complex</td>
|
167 |
+
<td>olmOCR-7B-0225-preview</td>
|
168 |
+
<td>0.676</td>
|
169 |
+
</tr>
|
170 |
+
<tr>
|
171 |
+
<td>Nanonets-OCR-s</td>
|
172 |
+
<td>0.772</td>
|
173 |
+
</tr>
|
174 |
+
<tr>
|
175 |
+
<td><strong>MonkeyOCR<strong></td>
|
176 |
+
<td>0.826</td>
|
177 |
+
</tr>
|
178 |
+
<tr>
|
179 |
+
<td><a href="https://huggingface.co/ChatDOC/OCRFlux-3B">OCRFlux-3B</a></td>
|
180 |
+
<td>0.807</td>
|
181 |
+
</tr>
|
182 |
+
<tr>
|
183 |
+
<td rowspan="4">Total</td>
|
184 |
+
<td>olmOCR-7B-0225-preview</td>
|
185 |
+
<td>0.744</td>
|
186 |
+
</tr>
|
187 |
+
<tr>
|
188 |
+
<td>Nanonets-OCR-s</td>
|
189 |
+
<td>0.828</td>
|
190 |
+
</tr>
|
191 |
+
<tr>
|
192 |
+
<td>MonkeyOCR</td>
|
193 |
+
<td>0.853</td>
|
194 |
+
</tr>
|
195 |
+
<tr>
|
196 |
+
<td><strong><a href="https://huggingface.co/ChatDOC/OCRFlux-3B">OCRFlux-3B</a></strong></td>
|
197 |
+
<td>0.861</td>
|
198 |
+
</tr>
|
199 |
+
</tbody>
|
200 |
+
</table>
|
201 |
+
|
202 |
+
We also conduct some case studies to show the superiority of our model in the [blog](https://ocrflux.pdfparser.io/#/blog) article.
|
203 |
+
|
204 |
+
### Benchmark for cross-page table/paragraph merging
|
205 |
+
|
206 |
+
PDF documents are typically paginated, which often results in tables or paragraphs being split across consecutive pages. Accurately detecting and merging such cross-page structures is crucial to avoid generating incomplete or fragmented content.
|
207 |
+
|
208 |
+
The detection task can be formulated as follows: given the Markdowns of two consecutive pages—each structured as a list of Markdown elements (e.g., paragraphs and tables)—the goal is to identify the indexes of elements that should be merged across the pages.
|
209 |
+
|
210 |
+
Then for the merging task, if the elements to be merged are paragraphs, we can just concate them. However, for two table fragments, their merging is much more challenging. For example, the table spanning multiple pages will repeat the header of the first page on the second page. Another difficult scenario is that the table cell contains long content that spans multiple lines within the cell, with the first few lines appearing on the previous page and the remaining lines continuing on the next page. We also observe some cases where tables with a large number of columns are split vertically and placed on two consecutive pages. More examples of cross-page tables can be found in our [blog](https://ocrflux.pdfparser.io/#/blog) article. To address these issues, we develop the LLM model for cross-page table merging. Specifically, this model takes two split table fragments as input and generates a complete, well-structured table as output.
|
211 |
+
|
212 |
+
We ship two comprehensive benchmarks to help measure the performance of our OCR system in cross-page table/paragraph detection and merging tasks respectively:
|
213 |
+
|
214 |
+
- [OCRFlux-bench-cross](https://huggingface.co/datasets/ChatDOC/OCRFlux-bench-cross): Containing 1000 samples (500 English samples and 500 Chinese samples), each sample contains the Markdown element lists of two consecutive pages, along with the indexes of elements that need to be merged (manually labeled through multiple rounds of review). If no tables or paragraphs require merging, the indexes in the annotation data are left empty.
|
215 |
+
|
216 |
+
- [OCRFlux-pubtabnet-cross](https://huggingface.co/datasets/ChatDOC/OCRFlux-pubtabnet-cross): Containing 9064 pairs of split table fragments, along with their corresponding ground-truth merged versions.
|
217 |
+
|
218 |
+
The released benchmarks are NOT included in our training and evaluation data neither. The following is the main result:
|
219 |
+
|
220 |
+
1. In [OCRFlux-bench-cross](https://huggingface.co/datasets/ChatDOC/OCRFlux-bench-cross), we caculated the Accuracy, Precision, Recall and F1 score as the metric. Notice that the detection results are right only when it accurately judges whether there are elements that need to be merged across the two pages and output the right indexes of them.
|
221 |
+
|
222 |
+
| Language | Precision ↑ | Recall ↑ | F1 ↑ | Accuracy ↑ |
|
223 |
+
|----------|-------------|----------|-------|------------|
|
224 |
+
| English | 0.992 | 0.964 | 0.978 | 0.978 |
|
225 |
+
| Chinese | 1.000 | 0.988 | 0.994 | 0.994 |
|
226 |
+
| Total | 0.996 | 0.976 | 0.986 | 0.986 |
|
227 |
+
|
228 |
+
2. In [OCRFlux-pubtabnet-cross](https://huggingface.co/datasets/ChatDOC/OCRFlux-pubtabnet-cross), we calculate the Tree Edit Distance-based Similarity (TEDS) between the generated merged table and the ground-truth merged table as the metric.
|
229 |
+
|
230 |
+
| Table type | Avg TEDS ↑ |
|
231 |
+
|------------|--------------|
|
232 |
+
| Simple | 0.965 |
|
233 |
+
| Complex | 0.935 |
|
234 |
+
| Total | 0.950 |
|
235 |
+
|