Creating Datasets from PDFs and Images with Mistral OCR, Langchain and Gradio
In the age of data-driven AI, transforming unstructured documents like PDFs and images into structured datasets is a game-changer. Whether you’re building a knowledge base, training a language model, or analyzing scanned documents, Optical Character Recognition (OCR) combined with smart text processing can unlock valuable insights. In this post, we’ll explore a powerful gradio application that automates this process, using Mistral AI for OCR, LangChain for text chunking, and Hugging Face for dataset storage—all wrapped in an intuitive Gradio interface. Let’s dive into how this tool works, why it’s useful, and how you can use it to create your own datasets.
What Does the Gradio Application Do?
The application, built as a gradio app, streamlines the process of converting multiple PDF or image files into a structured dataset. Here’s the high-level workflow:
- Upload Files: Upload one or more PDFs or images (PNG, JPG, JPEG, WEBP, BMP) via a web interface.
- Perform OCR: Extract text and images using Mistral AI’s OCR API, producing markdown with embedded base64-encoded images.
- Chunk Text: Split the markdown into manageable chunks based on headers or character limits, preserving images in metadata.
- Save to Hugging Face: Store the chunks as a dataset in a Hugging Face repository, complete with text, metadata, and source file information.
The result? A clean, structured dataset ready for machine learning, search, or archival purposes, accessible via a simple web interface.
Why This Matters
Unstructured documents are everywhere—think scanned books, meeting notes, or archival records. Extracting usable data from them is often tedious, requiring manual transcription or complex preprocessing. This application solves that by:
- Automating OCR: Mistral AI’s OCR handles both text and images, even in complex layouts.
- Structuring Data: LangChain’s text splitters create logical chunks, making the data easier to analyze or feed into models.
- Enabling Collaboration: Hugging Face datasets are shareable, versioned, and accessible for AI workflows.
- Simplifying Interaction: Gradio’s interface makes the tool accessible to non-coders, with options to customize chunking and output.
Use cases include creating training data for LLMs, building searchable document archives, or preparing datasets for research.
How It Works: A Technical Breakdown
The script, named app.py
, is a modular, robust application. Let’s break down its key components and how they work together.
1. Gradio Interface
Gradio provides a user-friendly web interface where users can:
- Upload multiple files.
- Set chunking options (e.g., max chunk size, overlap, whether to strip headers).
- Specify a Hugging Face repository and token.
- View processing status and dataset links.
A notable feature is the fix for a Gradio bug, where a patched function handles schema parsing for multiple file uploads, ensuring compatibility with the latest Gradio versions.
2. Mistral AI OCR
Mistral AI’s OCR API processes PDFs and images, extracting text as markdown and images as base64-encoded data URIs. The script supports:
- PDFs: Uploaded to Mistral’s servers, processed, and deleted after use.
- Images: Encoded as base64 and sent directly to the OCR API.
- Error Handling: Robust checks ensure invalid files or API failures are reported clearly.
3. Text Chunking with LangChain
The extracted markdown is split into chunks using LangChain’s MarkdownHeaderTextSplitter
and RecursiveCharacterTextSplitter
:
- Header-Based Splitting: Divides text by markdown headers (e.g.,
#
,##
), with an option to strip headers from content (stored in metadata instead). - Character-Based Splitting: Further splits chunks to a specified size (e.g., 1000 characters) with overlap for context.
- Image Tracking: Images are extracted, replaced with reference IDs, and stored in chunk metadata as base64 URIs.
This dual approach ensures flexibility—use header-only splitting for structured documents or character-based for dense text.
4. Dataset Creation with Hugging Face
Chunks are organized into a dataset with columns:
chunk_id
: Unique identifier (e.g.,filename_chunk_1
).text
: The chunk’s text content.metadata
: Headers, image base64 URIs, and other details.source_filename
: The original file name.
The dataset is pushed to a Hugging Face repository using the datasets
library, with authentication via a Hugging Face token. If the repository doesn’t exist, it’s created automatically.
Code Highlights
Here’s a glimpse of the main processing function:
def process_file_and_save(file_objs, chunk_size, chunk_overlap, strip_headers, hf_token, repo_name):
all_data = {"chunk_id": [], "text": [], "metadata": [], "source_filename": []}
for file_obj in file_objs:
markdown, _, img_map = perform_ocr_file(file_obj)
chunks = chunk_markdown(markdown, chunk_size, chunk_overlap, strip_headers)
all_data["chunk_id"].extend([f"{file_obj.orig_name}_chunk_{i}" for i in range(len(chunks))])
all_data["text"].extend([chunk.page_content for chunk in chunks])
all_data["metadata"].extend([chunk.metadata for chunk in chunks])
all_data["source_filename"].extend([file_obj.orig_name] * len(chunks))
dataset = Dataset.from_dict(all_data)
dataset.push_to_hub(repo_name, token=hf_token)
return f"Success! Dataset saved to: https://huggingface.co/datasets/{repo_name}"
This function orchestrates the pipeline, handling errors and aggregating results for multiple files.
Tips
- Duplicate the space: Click to this link to duplicate the space
- Test Small: Start with a few files to verify setup.
- Check Logs: The script logs detailed info to help troubleshoot.
Customization Ideas
The script is modular, so you can extend it:
- Add Semantic Chunking: Use NLP models to split text by meaning.
- Support More Formats: Extend OCR to handle DOCX or scanned TIFFs.
- Real-Time Progress: Stream updates to the Gradio interface during processing.
- Local Storage: Save chunks to a local database before uploading.
Conclusion
This application is a powerful tool for anyone looking to turn PDFs and images into structured, AI-ready datasets. By combining Mistral AI’s OCR, LangChain’s text processing, and Hugging Face’s dataset hosting, it simplifies a complex workflow into a few clicks. Whether you’re a researcher, data scientist, or hobbyist, this app can save time and unlock new possibilities for your dataset creation projects or experimentatio n .
Try it out, and let me know how Ricci in the comments—happy dataset creating!