Update README.md
Browse files
README.md
CHANGED
@@ -26,18 +26,6 @@ OCRFlux is a multimodal large language model based toolkit for converting PDFs a
|
|
26 |
|
27 |
Try the online demo: https://ocrflux.pdfparser.io/
|
28 |
|
29 |
-
# Functions
|
30 |
-
|
31 |
-
## On each page
|
32 |
-
|
33 |
-
Convert into text with a natural reading order, even in the presence of multi-column layouts, figures, and insets
|
34 |
-
Support for complicated tables and equations
|
35 |
-
Automatically removes headers and footers
|
36 |
-
|
37 |
-
## Cross-page table/paragraph merging
|
38 |
-
|
39 |
-
Cross-page table merging
|
40 |
-
Cross-page paragraph merging
|
41 |
|
42 |
## Key features:
|
43 |
Superior parsing quality on each page
|
@@ -49,16 +37,100 @@ Native support for cross-page table/paragraph merging (to our best this is the f
|
|
49 |
Based on a 3B parameter VLM, so it can run even on GTX 3090 GPU.
|
50 |
|
51 |
|
52 |
-
## News
|
53 |
-
Jun 17, 2025 - v0.1.0 - Initial public launch and demo.
|
54 |
-
|
55 |
-
|
56 |
## Usage
|
57 |
|
58 |
The best way to use this model is via the [OCRFlux toolkit](https://github.com/chatdoc-com/OCRFlux).
|
59 |
The toolkit comes with an efficient inference setup via vllm that can handle millions of documents
|
60 |
at scale.
|
61 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
62 |
### Benchmark for single-page parsing
|
63 |
|
64 |
We ship two comprehensive benchmarks to help measure the performance of our OCR system in single-page parsing:
|
|
|
26 |
|
27 |
Try the online demo: https://ocrflux.pdfparser.io/
|
28 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
29 |
|
30 |
## Key features:
|
31 |
Superior parsing quality on each page
|
|
|
37 |
Based on a 3B parameter VLM, so it can run even on GTX 3090 GPU.
|
38 |
|
39 |
|
|
|
|
|
|
|
|
|
40 |
## Usage
|
41 |
|
42 |
The best way to use this model is via the [OCRFlux toolkit](https://github.com/chatdoc-com/OCRFlux).
|
43 |
The toolkit comes with an efficient inference setup via vllm that can handle millions of documents
|
44 |
at scale.
|
45 |
|
46 |
+
### API for directly calling OCRFlux (New)
|
47 |
+
You can use the inference API to directly call OCRFlux in your codes without using an online vllm server like following:
|
48 |
+
|
49 |
+
```
|
50 |
+
from vllm import LLM
|
51 |
+
from ocrflux.inference import parse
|
52 |
+
|
53 |
+
file_path = 'test.pdf'
|
54 |
+
# file_path = 'test.png'
|
55 |
+
llm = LLM(model="model_dir/OCRFlux-3B",gpu_memory_utilization=0.8,max_model_len=8192)
|
56 |
+
result = parse(llm,file_path)
|
57 |
+
document_markdown = result['document_text']
|
58 |
+
with open('test.md','w') as f:
|
59 |
+
f.write(document_markdown)
|
60 |
+
```
|
61 |
+
|
62 |
+
### Docker Usage
|
63 |
+
|
64 |
+
Requirements:
|
65 |
+
|
66 |
+
- Docker with GPU support [(NVIDIA Toolkit)](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)
|
67 |
+
- Pre-downloaded model: [OCRFlux-3B](https://huggingface.co/ChatDOC/OCRFlux-3B)
|
68 |
+
|
69 |
+
To use OCRFlux in a docker container, you can use the following example command:
|
70 |
+
|
71 |
+
```bash
|
72 |
+
docker run -it --gpus all \
|
73 |
+
-v /path/to/localworkspace:/localworkspace \
|
74 |
+
-v /path/to/test_pdf_dir:/test_pdf_dir/ \
|
75 |
+
-v /path/to/OCRFlux-3B:/OCRFlux-3B \
|
76 |
+
chatdoc/ocrflux:latest /localworkspace --data /test_pdf_dir/* --model /OCRFlux-3B/
|
77 |
+
```
|
78 |
+
|
79 |
+
#### Viewing Results
|
80 |
+
Generate the final Markdown files by running the following command. Generated Markdown files will be in `./localworkspace/markdowns/DOCUMENT_NAME` directory.
|
81 |
+
|
82 |
+
```bash
|
83 |
+
python -m ocrflux.jsonl_to_markdown ./localworkspace
|
84 |
+
```
|
85 |
+
|
86 |
+
### Full documentation for the pipeline
|
87 |
+
|
88 |
+
```bash
|
89 |
+
python -m ocrflux.pipeline --help
|
90 |
+
usage: pipeline.py [-h] [--task {pdf2markdown,merge_pages,merge_tables}] [--data [DATA ...]] [--pages_per_group PAGES_PER_GROUP] [--max_page_retries MAX_PAGE_RETRIES]
|
91 |
+
[--max_page_error_rate MAX_PAGE_ERROR_RATE] [--workers WORKERS] [--model MODEL] [--model_max_context MODEL_MAX_CONTEXT] [--model_chat_template MODEL_CHAT_TEMPLATE]
|
92 |
+
[--target_longest_image_dim TARGET_LONGEST_IMAGE_DIM] [--skip_cross_page_merge] [--port PORT]
|
93 |
+
workspace
|
94 |
+
|
95 |
+
Manager for running millions of PDFs through a batch inference pipeline
|
96 |
+
|
97 |
+
positional arguments:
|
98 |
+
workspace The filesystem path where work will be stored, can be a local folder
|
99 |
+
|
100 |
+
options:
|
101 |
+
-h, --help show this help message and exit
|
102 |
+
--data [DATA ...] List of paths to files to process
|
103 |
+
--pages_per_group PAGES_PER_GROUP
|
104 |
+
Aiming for this many pdf pages per work item group
|
105 |
+
--max_page_retries MAX_PAGE_RETRIES
|
106 |
+
Max number of times we will retry rendering a page
|
107 |
+
--max_page_error_rate MAX_PAGE_ERROR_RATE
|
108 |
+
Rate of allowable failed pages in a document, 1/250 by default
|
109 |
+
--workers WORKERS Number of workers to run at a time
|
110 |
+
--model MODEL The path to the model
|
111 |
+
--model_max_context MODEL_MAX_CONTEXT
|
112 |
+
Maximum context length that the model was fine tuned under
|
113 |
+
--model_chat_template MODEL_CHAT_TEMPLATE
|
114 |
+
Chat template to pass to vllm server
|
115 |
+
--target_longest_image_dim TARGET_LONGEST_IMAGE_DIM
|
116 |
+
Dimension on longest side to use for rendering the pdf pages
|
117 |
+
--skip_cross_page_merge
|
118 |
+
Whether to skip cross-page merging
|
119 |
+
--port PORT Port to use for the VLLM server
|
120 |
+
```
|
121 |
+
|
122 |
+
## Code overview
|
123 |
+
|
124 |
+
There are some nice reusable pieces of the code that may be useful for your own projects:
|
125 |
+
- Processing millions of PDFs through our released model using VLLM - [pipeline.py](https://github.com/chatdoc-com/OCRFlux/blob/main/ocrflux/pipeline.py)
|
126 |
+
- Generating final Markdowns from jsonl files - [jsonl_to_markdown.py](https://github.com/chatdoc-com/OCRFlux/blob/main/ocrflux/jsonl_to_markdown.py)
|
127 |
+
- Evaluating the model on the single-page parsing task - [eval_page_to_markdown.py](https://github.com/chatdoc-com/OCRFlux/blob/main/eval/eval_page_to_markdown.py)
|
128 |
+
- Evaluating the model on the table parising task - [eval_table_to_html.py](https://github.com/chatdoc-com/OCRFlux/blob/main/eval/eval_table_to_html.py)
|
129 |
+
- Evaluating the model on the paragraphs/tables merging detection task - [eval_element_merge_detect.py](https://github.com/chatdoc-com/OCRFlux/blob/main/eval/eval_element_merge_detect.py)
|
130 |
+
- Evaluating the model on the table merging task - [eval_html_table_merge.py](https://github.com/chatdoc-com/OCRFlux/blob/main/eval/eval_html_table_merge.py)
|
131 |
+
|
132 |
+
|
133 |
+
|
134 |
### Benchmark for single-page parsing
|
135 |
|
136 |
We ship two comprehensive benchmarks to help measure the performance of our OCR system in single-page parsing:
|