--- title: alps app_file: app.py sdk: gradio sdk_version: 4.44.0 --- # Alps Pipeline for OCRing PDFs and tables This repository contains different OCR methods using various libraries/models. ## Running gradio: `python app.py` in terminal ## Installation : Build the docker image and run the contianer Clone this repository and Install the required dependencies: ``` pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu117 apt install weasyprint ``` Note: You need a GPU to run this code. ## Example Usage Run python main.py inside the directory. Provide the path to the test file (the file must be placed inside the repository,and the file path should be relative to the repository (alps)). Next, provide the path to save intermediate outputs from the run (draw cell bounding boxes on the table, show table detection results in pdf), and specify which component to run. outputs are printed in terminal ``` usage: main.py [-h] [--test_file TEST_FILE] [--debug_folder DEBUG_FOLDER] [--englishFlag ENGLISHFLAG] [--denoise DENOISE] ocr ``` Description of the component: ### ocr1 ocr1 Input: Path to a PDF file Output: Dictionary of each page and list of line_annotations. List of LineAnnotations contains bboxes for each line and List of its children wordAnnotation. Each wordAnnotation contains bboxes and text inside. What it does: Runs Ragflow textline detector + OCR with DocTR Example: ``` python main.py ocr1 --test_file TestingFiles/OCRTest1German.pdf --debug_folder ./res/ocrdebug1/ python main.py ocr1 --test_file TestingFiles/OCRTest3English.pdf --debug_folder ./res/ocrdebug1/ --englishFlag True ``` ### table1 Input : file path to an image of a cropped table Output: Parsed table in HTML form What it does: Uses Unitable + DocTR ``` python main.py table1 --test_file cropped_table.png --debug_folder ./res/table1/ ``` ### table2 Input: File path to an image of a cropped table Output: Parsed table in HTML form What it does: Uses Unitable ``` python main.py table2 --test_file cropped_table.png --debug_folder ./res/table2/ ``` ### pdftable1 Input: PDF file path Output: Parsed table in HTML form What it does: Uses Unitable + DocTR ``` python main.py pdftable1 --test_file TestingFiles/OCRTest5English.pdf --debug_folder ./res/table_debug1/ python main.py pdftable3 --test_file TestingFiles/TableOCRTestEnglish.pdf --debug_folder ./res/poor_relief2 ``` ### pdftable2 : Input: PDF file path Output: Parsed table in HTML form What it does: Detects table and parses them, Runs Full Unitable Table detection ``` python main.py pdftable2 --test_file TestingFiles/OCRTest5English.pdf --debug_folder ./res/table_debug2/ ``` ### pdftable3 Input: PDF file path Output: Parsed table in HTML form What it does: Detects table with YOLO, Unitable + DocTR ### pdftable4 Input: PDF file path Output: Parsed table in HTML form What it does: Detects table with YOLO, Runs Full doctr Table detection python main.py pdftable4 --test_file TestingFiles/TableOCRTestEasier.pdf --debug_folder ./res/table_debug3/ ## bbox They are ordered as ordered as [xmin,ymin,xmax,ymax] . Cause the coordinates starts from (0,0) of the image which is upper left corner xmin ymim - upper left corner xmax ymax - bottom lower corner ![alt text](image-2.png)