ChatDOC commited on
Commit
fa3dd33
·
verified ·
1 Parent(s): 4a0d0ef

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +88 -16
README.md CHANGED
@@ -26,18 +26,6 @@ OCRFlux is a multimodal large language model based toolkit for converting PDFs a
26
 
27
  Try the online demo: https://ocrflux.pdfparser.io/
28
 
29
- # Functions
30
-
31
- ## On each page
32
-
33
- Convert into text with a natural reading order, even in the presence of multi-column layouts, figures, and insets
34
- Support for complicated tables and equations
35
- Automatically removes headers and footers
36
-
37
- ## Cross-page table/paragraph merging
38
-
39
- Cross-page table merging
40
- Cross-page paragraph merging
41
 
42
  ## Key features:
43
  Superior parsing quality on each page
@@ -49,16 +37,100 @@ Native support for cross-page table/paragraph merging (to our best this is the f
49
  Based on a 3B parameter VLM, so it can run even on GTX 3090 GPU.
50
 
51
 
52
- ## News
53
- Jun 17, 2025 - v0.1.0 - Initial public launch and demo.
54
-
55
-
56
  ## Usage
57
 
58
  The best way to use this model is via the [OCRFlux toolkit](https://github.com/chatdoc-com/OCRFlux).
59
  The toolkit comes with an efficient inference setup via vllm that can handle millions of documents
60
  at scale.
61
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62
  ### Benchmark for single-page parsing
63
 
64
  We ship two comprehensive benchmarks to help measure the performance of our OCR system in single-page parsing:
 
26
 
27
  Try the online demo: https://ocrflux.pdfparser.io/
28
 
 
 
 
 
 
 
 
 
 
 
 
 
29
 
30
  ## Key features:
31
  Superior parsing quality on each page
 
37
  Based on a 3B parameter VLM, so it can run even on GTX 3090 GPU.
38
 
39
 
 
 
 
 
40
  ## Usage
41
 
42
  The best way to use this model is via the [OCRFlux toolkit](https://github.com/chatdoc-com/OCRFlux).
43
  The toolkit comes with an efficient inference setup via vllm that can handle millions of documents
44
  at scale.
45
 
46
+ ### API for directly calling OCRFlux (New)
47
+ You can use the inference API to directly call OCRFlux in your codes without using an online vllm server like following:
48
+
49
+ ```
50
+ from vllm import LLM
51
+ from ocrflux.inference import parse
52
+
53
+ file_path = 'test.pdf'
54
+ # file_path = 'test.png'
55
+ llm = LLM(model="model_dir/OCRFlux-3B",gpu_memory_utilization=0.8,max_model_len=8192)
56
+ result = parse(llm,file_path)
57
+ document_markdown = result['document_text']
58
+ with open('test.md','w') as f:
59
+ f.write(document_markdown)
60
+ ```
61
+
62
+ ### Docker Usage
63
+
64
+ Requirements:
65
+
66
+ - Docker with GPU support [(NVIDIA Toolkit)](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)
67
+ - Pre-downloaded model: [OCRFlux-3B](https://huggingface.co/ChatDOC/OCRFlux-3B)
68
+
69
+ To use OCRFlux in a docker container, you can use the following example command:
70
+
71
+ ```bash
72
+ docker run -it --gpus all \
73
+ -v /path/to/localworkspace:/localworkspace \
74
+ -v /path/to/test_pdf_dir:/test_pdf_dir/ \
75
+ -v /path/to/OCRFlux-3B:/OCRFlux-3B \
76
+ chatdoc/ocrflux:latest /localworkspace --data /test_pdf_dir/* --model /OCRFlux-3B/
77
+ ```
78
+
79
+ #### Viewing Results
80
+ Generate the final Markdown files by running the following command. Generated Markdown files will be in `./localworkspace/markdowns/DOCUMENT_NAME` directory.
81
+
82
+ ```bash
83
+ python -m ocrflux.jsonl_to_markdown ./localworkspace
84
+ ```
85
+
86
+ ### Full documentation for the pipeline
87
+
88
+ ```bash
89
+ python -m ocrflux.pipeline --help
90
+ usage: pipeline.py [-h] [--task {pdf2markdown,merge_pages,merge_tables}] [--data [DATA ...]] [--pages_per_group PAGES_PER_GROUP] [--max_page_retries MAX_PAGE_RETRIES]
91
+ [--max_page_error_rate MAX_PAGE_ERROR_RATE] [--workers WORKERS] [--model MODEL] [--model_max_context MODEL_MAX_CONTEXT] [--model_chat_template MODEL_CHAT_TEMPLATE]
92
+ [--target_longest_image_dim TARGET_LONGEST_IMAGE_DIM] [--skip_cross_page_merge] [--port PORT]
93
+ workspace
94
+
95
+ Manager for running millions of PDFs through a batch inference pipeline
96
+
97
+ positional arguments:
98
+ workspace The filesystem path where work will be stored, can be a local folder
99
+
100
+ options:
101
+ -h, --help show this help message and exit
102
+ --data [DATA ...] List of paths to files to process
103
+ --pages_per_group PAGES_PER_GROUP
104
+ Aiming for this many pdf pages per work item group
105
+ --max_page_retries MAX_PAGE_RETRIES
106
+ Max number of times we will retry rendering a page
107
+ --max_page_error_rate MAX_PAGE_ERROR_RATE
108
+ Rate of allowable failed pages in a document, 1/250 by default
109
+ --workers WORKERS Number of workers to run at a time
110
+ --model MODEL The path to the model
111
+ --model_max_context MODEL_MAX_CONTEXT
112
+ Maximum context length that the model was fine tuned under
113
+ --model_chat_template MODEL_CHAT_TEMPLATE
114
+ Chat template to pass to vllm server
115
+ --target_longest_image_dim TARGET_LONGEST_IMAGE_DIM
116
+ Dimension on longest side to use for rendering the pdf pages
117
+ --skip_cross_page_merge
118
+ Whether to skip cross-page merging
119
+ --port PORT Port to use for the VLLM server
120
+ ```
121
+
122
+ ## Code overview
123
+
124
+ There are some nice reusable pieces of the code that may be useful for your own projects:
125
+ - Processing millions of PDFs through our released model using VLLM - [pipeline.py](https://github.com/chatdoc-com/OCRFlux/blob/main/ocrflux/pipeline.py)
126
+ - Generating final Markdowns from jsonl files - [jsonl_to_markdown.py](https://github.com/chatdoc-com/OCRFlux/blob/main/ocrflux/jsonl_to_markdown.py)
127
+ - Evaluating the model on the single-page parsing task - [eval_page_to_markdown.py](https://github.com/chatdoc-com/OCRFlux/blob/main/eval/eval_page_to_markdown.py)
128
+ - Evaluating the model on the table parising task - [eval_table_to_html.py](https://github.com/chatdoc-com/OCRFlux/blob/main/eval/eval_table_to_html.py)
129
+ - Evaluating the model on the paragraphs/tables merging detection task - [eval_element_merge_detect.py](https://github.com/chatdoc-com/OCRFlux/blob/main/eval/eval_element_merge_detect.py)
130
+ - Evaluating the model on the table merging task - [eval_html_table_merge.py](https://github.com/chatdoc-com/OCRFlux/blob/main/eval/eval_html_table_merge.py)
131
+
132
+
133
+
134
  ### Benchmark for single-page parsing
135
 
136
  We ship two comprehensive benchmarks to help measure the performance of our OCR system in single-page parsing: