Update README.md
Browse files
README.md
CHANGED
@@ -3,7 +3,7 @@ base_model:
|
|
3 |
- HuggingFaceTB/SmolVLM-256M-Instruct
|
4 |
language:
|
5 |
- en
|
6 |
-
library_name:
|
7 |
license: apache-2.0
|
8 |
pipeline_tag: image-text-to-text
|
9 |
tags:
|
@@ -11,14 +11,161 @@ tags:
|
|
11 |
---
|
12 |
|
13 |
# zboyles/SmolDocling-256M-preview-bf16
|
14 |
-
This model was converted to MLX format from [`ds4sd/SmolDocling-256M-preview`](https://huggingface.co/ds4sd/SmolDocling-256M-preview) using mlx-vlm version **0.1.18**.
|
15 |
-
Refer to the [original model card](https://huggingface.co/ds4sd/SmolDocling-256M-preview) for more details on the model.
|
16 |
-
|
17 |
|
18 |
-
```bash
|
19 |
-
pip install -U mlx-vlm
|
20 |
-
```
|
21 |
|
22 |
-
|
23 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
24 |
```
|
|
|
|
|
|
|
|
|
|
3 |
- HuggingFaceTB/SmolVLM-256M-Instruct
|
4 |
language:
|
5 |
- en
|
6 |
+
library_name: mlx
|
7 |
license: apache-2.0
|
8 |
pipeline_tag: image-text-to-text
|
9 |
tags:
|
|
|
11 |
---
|
12 |
|
13 |
# zboyles/SmolDocling-256M-preview-bf16
|
14 |
+
This model was converted to **MLX format** from [`ds4sd/SmolDocling-256M-preview`](https://huggingface.co/ds4sd/SmolDocling-256M-preview) using mlx-vlm version **0.1.18**.
|
15 |
+
* Refer to the [**original model card**](https://huggingface.co/ds4sd/SmolDocling-256M-preview) for more details on the model.
|
16 |
+
* Refer to the [**mlx-vlm repo**](https://github.com/Blaizzy/mlx-vlm) for more examples using `mlx-vlm`.
|
17 |
|
|
|
|
|
|
|
18 |
|
19 |
+
## Use SmolDocling-256M-preview with with docling and mlx
|
20 |
+
|
21 |
+
> **Find Working MLX + Docling Example Code Below**
|
22 |
+
|
23 |
+
|
24 |
+
<div style="display: flex; align-items: center;">
|
25 |
+
<img src="https://huggingface.co/ds4sd/SmolDocling-256M-preview/resolve/main/assets/SmolDocling_doctags1.png" alt="SmolDocling" style="width: 200px; height: auto; margin-right: 20px;">
|
26 |
+
<div>
|
27 |
+
<h3>SmolDocling-256M-preview</h3>
|
28 |
+
<p>SmolDocling is a multimodal Image-Text-to-Text model designed for efficient document conversion. It retains Docling's most popular features while ensuring full compatibility with Docling through seamless support for <strong>DoclingDocuments</strong>.</p>
|
29 |
+
</div>
|
30 |
+
</div>
|
31 |
+
|
32 |
+
This model was presented in the paper [SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion](https://huggingface.co/papers/2503.11576).
|
33 |
+
|
34 |
+
### π Features:
|
35 |
+
- π·οΈ **DocTags for Efficient Tokenization** β Introduces DocTags an efficient and minimal representation for documents that is fully compatible with **DoclingDocuments**.
|
36 |
+
- π **OCR (Optical Character Recognition)** β Extracts text accurately from images.
|
37 |
+
- π **Layout and Localization** β Preserves document structure and document element **bounding boxes**.
|
38 |
+
- π» **Code Recognition** β Detects and formats code blocks including identation.
|
39 |
+
- π’ **Formula Recognition** β Identifies and processes mathematical expressions.
|
40 |
+
- π **Chart Recognition** β Extracts and interprets chart data.
|
41 |
+
- π **Table Recognition** β Supports column and row headers for structured table extraction.
|
42 |
+
- πΌοΈ **Figure Classification** β Differentiates figures and graphical elements.
|
43 |
+
- π **Caption Correspondence** β Links captions to relevant images and figures.
|
44 |
+
- π **List Grouping** β Organizes and structures list elements correctly.
|
45 |
+
- π **Full-Page Conversion** β Processes entire pages for comprehensive document conversion including all page elements (code, equations, tables, charts etc.)
|
46 |
+
- π² **OCR with Bounding Boxes** β OCR regions using a bounding box.
|
47 |
+
- π **General Document Processing** β Trained for both scientific and non-scientific documents.
|
48 |
+
- π **Seamless Docling Integration** β Import into **Docling** and export in multiple formats.
|
49 |
+
- π¨ **Fast inference using VLLM** β Avg of 0.35 secs per page on A100 GPU.
|
50 |
+
|
51 |
+
### π§ *Coming soon!*
|
52 |
+
- π **Better chart recognition π οΈ**
|
53 |
+
- π **One shot multi-page inference β±οΈ**
|
54 |
+
- π§ͺ **Chemical Recognition**
|
55 |
+
- π **Datasets**
|
56 |
+
|
57 |
+
## β¨οΈ Get started (**MLX** code examples)
|
58 |
+
|
59 |
+
You can use **mlx** to perform inference, and [Docling](https://github.com/docling-project/docling) to convert the results to a variety of ourput formats (md, html, etc.):
|
60 |
+
|
61 |
+
<details>
|
62 |
+
<summary>π Single page image inference using MLX via `mlx-vlm` π€</summary>
|
63 |
+
|
64 |
+
```python
|
65 |
+
# Prerequisites:
|
66 |
+
# pip install -U mlx-vlm
|
67 |
+
# pip install docling_core
|
68 |
+
|
69 |
+
import sys
|
70 |
+
|
71 |
+
from pathlib import Path
|
72 |
+
from PIL import Image
|
73 |
+
|
74 |
+
from mlx_vlm import load, apply_chat_template, stream_generate
|
75 |
+
from mlx_vlm.utils import load_image
|
76 |
+
|
77 |
+
# Variables
|
78 |
+
path_or_hf_repo="zboyles/SmolDocling-256M-preview-bf16"
|
79 |
+
output_path=Path("output")
|
80 |
+
output_path.mkdir(exist_ok=True)
|
81 |
+
|
82 |
+
# Model Params
|
83 |
+
eos="<end_of_utterance>"
|
84 |
+
verbose=True
|
85 |
+
kwargs={
|
86 |
+
"max_tokens": 8000,
|
87 |
+
"temperature": 0.0,
|
88 |
+
}
|
89 |
+
|
90 |
+
# Load images
|
91 |
+
# Note: I manually downloaded the image
|
92 |
+
# image_src = "https://upload.wikimedia.org/wikipedia/commons/7/76/GazettedeFrance.jpg"
|
93 |
+
# image = load_image(image_src)
|
94 |
+
image_src = "images/GazettedeFrance.jpg"
|
95 |
+
image = Image.open(image_src).convert("RGB")
|
96 |
+
|
97 |
+
# Initialize processor and model
|
98 |
+
model, processor = load(
|
99 |
+
path_or_hf_repo=path_or_hf_repo,
|
100 |
+
trust_remote_code=True,
|
101 |
+
)
|
102 |
+
config = model.config
|
103 |
+
|
104 |
+
|
105 |
+
# Create input messages - Docling Walkthrough Structure
|
106 |
+
messages = [
|
107 |
+
{
|
108 |
+
"role": "user",
|
109 |
+
"content": [
|
110 |
+
{"type": "image"},
|
111 |
+
{"type": "text", "text": "Convert this page to docling."}
|
112 |
+
]
|
113 |
+
},
|
114 |
+
]
|
115 |
+
prompt = apply_chat_template(processor, config, messages, add_generation_prompt=True)
|
116 |
+
|
117 |
+
# # Alternatively, supported prompt creation method
|
118 |
+
# messages = [{"role": "user", "content": "Convert this page to docling."}]
|
119 |
+
# prompt = apply_chat_template(processor, config, messages, add_generation_prompt=True)
|
120 |
+
|
121 |
+
|
122 |
+
text = ""
|
123 |
+
last_response = None
|
124 |
+
|
125 |
+
for response in stream_generate(
|
126 |
+
model=model,
|
127 |
+
processor=processor,
|
128 |
+
prompt=prompt,
|
129 |
+
image=image,
|
130 |
+
**kwargs
|
131 |
+
):
|
132 |
+
if verbose:
|
133 |
+
print(response.text, end="", flush=True)
|
134 |
+
text += response.text
|
135 |
+
last_response = response
|
136 |
+
if eos in text:
|
137 |
+
text = text.split(eos)[0].strip()
|
138 |
+
break
|
139 |
+
print()
|
140 |
+
|
141 |
+
if verbose:
|
142 |
+
print("\n" + "=" * 10)
|
143 |
+
if len(text) == 0:
|
144 |
+
print("No text generated for this prompt")
|
145 |
+
sys.exit(0)
|
146 |
+
print(
|
147 |
+
f"Prompt: {last_response.prompt_tokens} tokens, "
|
148 |
+
f"{last_response.prompt_tps:.3f} tokens-per-sec"
|
149 |
+
)
|
150 |
+
print(
|
151 |
+
f"Generation: {last_response.generation_tokens} tokens, "
|
152 |
+
f"{last_response.generation_tps:.3f} tokens-per-sec"
|
153 |
+
)
|
154 |
+
print(f"Peak memory: {last_response.peak_memory:.3f} GB")
|
155 |
+
|
156 |
+
# To convert to Docling Document, MD, HTML, etc.:
|
157 |
+
docling_output_path = output_path / Path(image_src).with_suffix(".dt").name
|
158 |
+
docling_output_path.write_text(text)
|
159 |
+
doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([text], [image])
|
160 |
+
doc = DoclingDocument(name="Document")
|
161 |
+
doc.load_from_doctags(doctags_doc)
|
162 |
+
# export as any format
|
163 |
+
# HTML
|
164 |
+
doc.save_as_html(docling_output_path.with_suffix(".html"))
|
165 |
+
# MD
|
166 |
+
doc.save_as_markdown(docling_output_path.with_suffix(".md"))
|
167 |
```
|
168 |
+
</details>
|
169 |
+
|
170 |
+
Thanks to [**@Blaizzy**](https://github.com/Blaizzy) for the [code examples](https://github.com/Blaizzy/mlx-vlm/tree/main/examples) that helped me quickly adapt the `docling` example.
|
171 |
+
|