zboyles
/

SmolDocling-256M-preview-bf16

@@ -3,7 +3,7 @@ base_model:
 - HuggingFaceTB/SmolVLM-256M-Instruct
 language:
 - en
-library_name: transformers
 license: apache-2.0
 pipeline_tag: image-text-to-text
 tags:
@@ -11,14 +11,161 @@ tags:
 ---
 # zboyles/SmolDocling-256M-preview-bf16
-This model was converted to MLX format from [`ds4sd/SmolDocling-256M-preview`](https://huggingface.co/ds4sd/SmolDocling-256M-preview) using mlx-vlm version **0.1.18**.
-Refer to the [original model card](https://huggingface.co/ds4sd/SmolDocling-256M-preview) for more details on the model.
-## Use with mlx
-```bash
-pip install -U mlx-vlm
-```
-```bash
-python -m mlx_vlm.generate --model zboyles/SmolDocling-256M-preview-bf16 --max-tokens 100 --temperature 0.0 --prompt "Describe this image." --image <path_to_image>
 ```

 - HuggingFaceTB/SmolVLM-256M-Instruct
 language:
 - en
+library_name: mlx
 license: apache-2.0
 pipeline_tag: image-text-to-text
 tags:
 ---
 # zboyles/SmolDocling-256M-preview-bf16
+This model was converted to **MLX format** from [`ds4sd/SmolDocling-256M-preview`](https://huggingface.co/ds4sd/SmolDocling-256M-preview) using mlx-vlm version **0.1.18**.
+* Refer to the [**original model card**](https://huggingface.co/ds4sd/SmolDocling-256M-preview) for more details on the model.
+* Refer to the [**mlx-vlm repo**](https://github.com/Blaizzy/mlx-vlm) for more examples using `mlx-vlm`.
+## Use SmolDocling-256M-preview with with docling and mlx
+> **Find Working MLX + Docling Example Code Below**
+<div style="display: flex; align-items: center;">
+    <img src="https://huggingface.co/ds4sd/SmolDocling-256M-preview/resolve/main/assets/SmolDocling_doctags1.png" alt="SmolDocling" style="width: 200px; height: auto; margin-right: 20px;">
+    <div>
+        <h3>SmolDocling-256M-preview</h3>
+        <p>SmolDocling is a multimodal Image-Text-to-Text model designed for efficient document conversion. It retains Docling's most popular features while ensuring full compatibility with Docling through seamless support for <strong>DoclingDocuments</strong>.</p>
+    </div>
+</div>
+This model was presented in the paper [SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion](https://huggingface.co/papers/2503.11576).
+### 🚀 Features:
+- 🏷️ **DocTags for Efficient Tokenization** – Introduces DocTags an efficient and minimal representation for documents that is fully compatible with **DoclingDocuments**.
+- 🔍 **OCR (Optical Character Recognition)** – Extracts text accurately from images.
+- 📐 **Layout and Localization** – Preserves document structure and document element **bounding boxes**.
+- 💻 **Code Recognition** – Detects and formats code blocks including identation.
+- 🔢 **Formula Recognition** – Identifies and processes mathematical expressions.
+- 📊 **Chart Recognition** – Extracts and interprets chart data.
+- 📑 **Table Recognition** – Supports column and row headers for structured table extraction.
+- 🖼️ **Figure Classification** – Differentiates figures and graphical elements.
+- 📝 **Caption Correspondence** – Links captions to relevant images and figures.
+- 📜 **List Grouping** – Organizes and structures list elements correctly.
+- 📄 **Full-Page Conversion** – Processes entire pages for comprehensive document conversion including all page elements (code, equations, tables, charts etc.)
+- 🔲 **OCR with Bounding Boxes** – OCR regions using a bounding box.
+- 📂 **General Document Processing** – Trained for both scientific and non-scientific documents.
+- 🔄 **Seamless Docling Integration** – Import into **Docling** and export in multiple formats.
+- 💨 **Fast inference using VLLM** – Avg of 0.35 secs per page on A100 GPU.
+### 🚧 *Coming soon!*
+- 📊 **Better chart recognition 🛠️**
+- 📚 **One shot multi-page inference ⏱️**
+- 🧪 **Chemical Recognition**
+- 📙 **Datasets**
+## ⌨️ Get started (**MLX** code examples)
+You can use **mlx** to perform inference, and [Docling](https://github.com/docling-project/docling) to convert the results to a variety of ourput formats (md, html, etc.):
+<details>
+<summary>📄 Single page image inference using MLX via `mlx-vlm` 🤖</summary>
+```python
+# Prerequisites:
+# pip install -U mlx-vlm
+# pip install docling_core
+import sys
+from pathlib import Path
+from PIL import Image
+from mlx_vlm import load, apply_chat_template, stream_generate
+from mlx_vlm.utils import load_image
+# Variables
+path_or_hf_repo="zboyles/SmolDocling-256M-preview-bf16"
+output_path=Path("output")
+output_path.mkdir(exist_ok=True)
+# Model Params
+eos="<end_of_utterance>"
+verbose=True
+kwargs={
+    "max_tokens": 8000,
+    "temperature": 0.0,
+}
+# Load images
+# Note: I manually downloaded the image
+# image_src = "https://upload.wikimedia.org/wikipedia/commons/7/76/GazettedeFrance.jpg"
+# image = load_image(image_src)
+image_src = "images/GazettedeFrance.jpg"
+image = Image.open(image_src).convert("RGB")
+# Initialize processor and model
+model, processor = load(
+    path_or_hf_repo=path_or_hf_repo,
+    trust_remote_code=True,
+)
+config = model.config
+# Create input messages - Docling Walkthrough Structure
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "image"},
+            {"type": "text", "text": "Convert this page to docling."}
+        ]
+    },
+]
+prompt = apply_chat_template(processor, config, messages, add_generation_prompt=True)
+# # Alternatively, supported prompt creation method
+# messages = [{"role": "user", "content": "Convert this page to docling."}]
+# prompt = apply_chat_template(processor, config, messages, add_generation_prompt=True)
+text = ""
+last_response = None
+for response in stream_generate(
+    model=model,
+    processor=processor,
+    prompt=prompt,
+    image=image,
+    **kwargs
+):
+    if verbose:
+        print(response.text, end="", flush=True)
+    text += response.text
+    last_response = response
+    if eos in text:
+        text = text.split(eos)[0].strip()
+        break
+print()
+if verbose:
+    print("\n" + "=" * 10)
+    if len(text) == 0:
+        print("No text generated for this prompt")
+        sys.exit(0)
+    print(
+        f"Prompt: {last_response.prompt_tokens} tokens, "
+        f"{last_response.prompt_tps:.3f} tokens-per-sec"
+    )
+    print(
+        f"Generation: {last_response.generation_tokens} tokens, "
+        f"{last_response.generation_tps:.3f} tokens-per-sec"
+    )
+    print(f"Peak memory: {last_response.peak_memory:.3f} GB")
+# To convert to Docling Document, MD, HTML, etc.:
+docling_output_path = output_path / Path(image_src).with_suffix(".dt").name
+docling_output_path.write_text(text)
+doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([text], [image])
+doc = DoclingDocument(name="Document")
+doc.load_from_doctags(doctags_doc)
+# export as any format
+# HTML
+doc.save_as_html(docling_output_path.with_suffix(".html"))
+# MD
+doc.save_as_markdown(docling_output_path.with_suffix(".md"))
 ```
+</details>
+Thanks to [**@Blaizzy**](https://github.com/Blaizzy) for the [code examples](https://github.com/Blaizzy/mlx-vlm/tree/main/examples) that helped me quickly adapt the `docling` example.