camelot-pg / README.md
morisono
Upload folder using huggingface_hub
f92f684 verified
|
raw
history blame
1.82 kB
metadata
title: camelot-pg
app_file: src/app/run.py
sdk: gradio
sdk_version: 4.32.2

PDF Table Parser

This script extracts tables from PDF files and saves them as CSV files. It supports command-line interface (CLI) for batch processing and also provides an optional web UI for interactive processing.

Features

  • Multi-page PDF support
  • Progress display per lines/rows, per page, and per file
  • CSV output with UTF-8 with BOM encoding
  • Customizable edge and row tolerances for table detection
  • Optional web UI for interactive processing using Gradio

Installation

  1. Clone the repository or download the script.
  2. Install the required dependencies:
    pip install rich camelot-py polars gradio gradio_pdf
    

Usage

Command-Line Interface (CLI)

To run the script via CLI, use the following command:

python src/app/parser.py input1.pdf input2.pdf output1.csv output2.csv

Arguments:

  • input_files: List of input PDF files
  • output_files: List of output CSV files (must match the number of input files)

Optional Arguments:

  • --delimiter: Output file delimiter (default: ,)
  • --edge_tol: Tolerance parameter used to specify the distance between text and table edges (default: 50)
  • --row_tol: Tolerance parameter used to specify the distance between table rows (default: 10)
  • --webui: Launch the web UI

Web UI

To run the script with the web UI, use the following command:

python src/app/run.py

This will launch a Gradio-based web application where you can upload PDFs and view the extracted tables interactively.

Example

CLI Example

python src/app/parser.py data/demo.pdf data/output.csv --delimiter ";" --edge_tol 60 --row_tol 40

License

This project is licensed under the MIT License.