MinerU

Paused

App Files Files Community

SkyNait commited on Feb 26

Commit

41b09be

1 Parent(s): 91de769

test

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

__pycache__/contents_extractor_v2.cpython-310.pyc +0 -0
__pycache__/inference_svm_model.cpython-310.pyc +0 -0
__pycache__/mineru_single.cpython-310.pyc +0 -0
__pycache__/mineru_test_local.cpython-310.pyc +0 -0
__pycache__/topic_extraction_upgrade.cpython-310.pyc +0 -0
__pycache__/worker.cpython-310.pyc +0 -0
contents_extractor_v2.py +110 -0
inference_svm_model.py +205 -24
input_output/output/final_output.md +0 -0
input_output/output/images/img_1.png +0 -0
input_output/output/images/img_10.png +0 -0
input_output/output/images/img_10.png_rows/row_0/col_0.png +0 -0
input_output/output/images/img_10.png_rows/row_0/col_2.png +0 -0
input_output/output/images/img_11.png +0 -0
input_output/output/images/img_11.png_rows/row_0/col_0.png +0 -0
input_output/output/images/img_11.png_rows/row_0/col_1.png +0 -0
input_output/output/images/img_11.png_rows/row_1/col_0.png +0 -0
input_output/output/images/img_11.png_rows/row_1/col_1.png +0 -0
input_output/output/images/img_11.png_rows/row_2/col_0.png +0 -0
input_output/output/images/img_11.png_rows/row_3/col_0.png +0 -0
input_output/output/images/img_12.png +0 -0
input_output/output/images/img_12.png_rows/row_0/col_0.png +0 -0
input_output/output/images/img_12.png_rows/row_0/col_1.png +0 -0
input_output/output/images/img_12.png_rows/row_1/col_0.png +0 -0
input_output/output/images/img_12.png_rows/row_1/col_1.png +0 -0
input_output/output/images/img_12.png_rows/row_2/col_0.png +0 -0
input_output/output/images/img_12.png_rows/row_2/col_1.png +0 -0
input_output/output/images/img_12.png_rows/row_3/col_0.png +0 -0
input_output/output/images/img_13.png +0 -0
input_output/output/images/img_13.png_rows/row_0/col_0.png +0 -0
input_output/output/images/img_13.png_rows/row_0/col_1.png +0 -0
input_output/output/images/img_13.png_rows/row_1/col_0.png +0 -0
input_output/output/images/img_13.png_rows/row_1/col_1.png +0 -0
input_output/output/images/img_13.png_rows/row_2/col_0.png +0 -0
input_output/output/images/img_13.png_rows/row_3/col_0.png +0 -0
input_output/output/images/img_13.png_rows/row_3/col_1.png +0 -0
input_output/output/images/img_14.png +0 -0
input_output/output/images/img_14.png_rows/row_0/col_0.png +0 -0
input_output/output/images/img_14.png_rows/row_0/col_1.png +0 -0
input_output/output/images/img_14.png_rows/row_1/col_0.png +0 -0
input_output/output/images/img_14.png_rows/row_1/col_1.png +0 -0
input_output/output/images/img_14.png_rows/row_2/col_0.png +0 -0
input_output/output/images/img_14.png_rows/row_3/col_0.png +0 -0
input_output/output/images/img_14.png_rows/row_4/col_0.png +0 -0
input_output/output/images/img_14.png_rows/row_5/col_0.png +0 -0
input_output/output/images/img_15.png +0 -0
input_output/output/images/img_15.png_rows/row_0/col_0.png +0 -0
input_output/output/images/img_15.png_rows/row_0/col_1.png +0 -0
input_output/output/images/img_15.png_rows/row_1/col_0.png +0 -0
input_output/output/images/img_15.png_rows/row_1/col_1.png +0 -0

__pycache__/contents_extractor_v2.cpython-310.pyc ADDED Viewed

Binary file (5.09 kB). View file

__pycache__/inference_svm_model.cpython-310.pyc CHANGED Viewed

Binary files a/__pycache__/inference_svm_model.cpython-310.pyc and b/__pycache__/inference_svm_model.cpython-310.pyc differ

__pycache__/mineru_single.cpython-310.pyc CHANGED Viewed

Binary files a/__pycache__/mineru_single.cpython-310.pyc and b/__pycache__/mineru_single.cpython-310.pyc differ

__pycache__/mineru_test_local.cpython-310.pyc ADDED Viewed

Binary file (11.9 kB). View file

__pycache__/topic_extraction_upgrade.cpython-310.pyc CHANGED Viewed

Binary files a/__pycache__/topic_extraction_upgrade.cpython-310.pyc and b/__pycache__/topic_extraction_upgrade.cpython-310.pyc differ

__pycache__/worker.cpython-310.pyc CHANGED Viewed

Binary files a/__pycache__/worker.cpython-310.pyc and b/__pycache__/worker.cpython-310.pyc differ

contents_extractor_v2.py ADDED Viewed

	@@ -0,0 +1,110 @@

+from google import genai
+from google.genai import types
+import fitz
+import requests
+MODEL = "gemini-2.0-flash"
+# TODO: Make sure the last page must be included
+class ContentsExtractor:
+    def __init__(self, api_key: str):
+        self.client = genai.Client(api_key=api_key)
+    @staticmethod
+    def extract_first_pages(pdf_path, num_pages=4, is_path_url=False):
+        try:
+            if is_path_url:
+                r = requests.get(pdf_path)
+                data = r.content
+                doc = fitz.open(stream=data, filetype="pdf")
+            else:
+                doc = fitz.open(pdf_path)
+            total_pages = doc.page_count
+            pages_to_read = min(total_pages, num_pages)
+            all_text = []
+            for page_num in range(pages_to_read):
+                page = doc[page_num]
+                page_text = page.get_text()
+                all_text.append(page_text)
+            doc.close()
+            return "\n".join(all_text)
+        except Exception as e:
+            print(f"Something went wrong: {e}")
+            return None
+    def extract_contents(self, content):
+        response = self.client.models.generate_content(
+            model=MODEL,
+            contents=[f"""
+                Task:
+                You will be provided with the first pages of an exam board document. Your goal is to extract
+                the main subject-related topics from the "Contents" section and structure them in a valid JSON format.
+                Instructions:
+                1. Identify the "Contents" section, which lists all topics, subtopics, and their corresponding pages.
+                2. Extract only the **highest-level, subject-related topics** (ignore organizational or administrative sections).
+                3. If a topic has subtopics, include the full range of pages from the first to the last subtopic.
+                4. Return the output in the following JSON format:
+                    {{
+                        "topic_name": [start_page, end_page]
+                    }}
+                Important Notes:
+                - Ignore non-subject-related sections (e.g., "Introduction", "Exam Guidelines", "Appendices").
+                - If a topic has subtopics, **only extract the main topic**, ensuring the page range covers all subtopics.
+                - The extracted topics should represent major academic areas, not organizational or structural elements.
+                - Only extract main topics without sub-topic numeration. Any topic with additional numbering (e.g., '3.1 Some Topic')
+                  should be ignored, as it is a sub-topic rather than a primary subject-related topic.
+                - Make sure that all of the pages for a topic are included, end page should be the start page of the topic
+                  that comes next after the extracted one in contents section.
+                Examples:
+                1. Given this table of contents:
+                    1. Introduction - 1
+                    2. Exam Rules - 4
+                    3. Subject content - 8
+                       3.1 Algebra - 12
+                       3.2 Geometry - 16
+                       3.3 Probability - 20
+                    4. The topics of subject of physics - 25
+                       4.1 Mechanics - 30
+                       4.2 Thermodynamics - 35
+                    5. Appendices - 40
+                   The correct output should be:
+                    {{
+                        "3. Subject content": [8, 25],
+                        "4. The topics of subject of physics": [25, 40]
+                    }}
+                2. Given this table of contents:
+                    1. Welcome Note - 1
+                    2. Exam Overview - 3
+                    3. Biology - 5
+                       3.1 Cell Biology - 7
+                       3.2 Genetics - 12
+                       3.3 Ecology - 18
+                    4. Chemistry - 22
+                       4.1 Organic Chemistry - 25
+                       4.2 Inorganic Chemistry - 30
+                       4.3 Physical Chemistry - 35
+                    5. References - 43
+                   The correct output should be:
+                    {{
+                        "Biology": [5, 22],
+                        "Chemistry": [22, 43]
+                    }}
+                    Now, extract topics from this text: {content}
+                """],
+            config=types.GenerateContentConfig(temperature=0.)
+        )
+        return response.text.strip().replace("```json", "").replace("```", "")

inference_svm_model.py CHANGED Viewed

@@ -1,31 +1,212 @@
 #!/usr/bin/env python3
-import cv2
-import numpy as np
 import os
-from joblib import load
-class SVMModel:
-    def __init__(self):
-        path = os.getenv("SVM_MODEL_PATH", "/home/user/app/model_classification/svm_model.joblib")
-        self.model = load(path)
-    def classify_image(
-        self,
-        image_bytes: bytes,
-        image_size=(128, 128)
-    ) -> int:
-        img = cv2.imdecode(np.frombuffer(image_bytes, np.uint8), cv2.IMREAD_COLOR)
-        if img is None:
-            # If image fails to load, default to "irrelevant" or handle differently
-            return 0
-        img = cv2.resize(img, image_size)
-        x = img.flatten().reshape(1, -1)
-        pred = self.model.predict(x)[0]
-        return pred
 if __name__ == "__main__":
-    model = load_svm_model("/home/user/app/model_classification/svm_model_2.joblib")
-    result = classify_image("test.jpg", model)
-    print("Classification result:", result)

 #!/usr/bin/env python3
 import os
+import re
+import json
+import logging
+import fitz  # PyMuPDF
+from typing import Optional, Tuple, Dict, List
+from contents_extractor_v2 import ContentsExtractor
+from mineru_test_local import LocalPDFProcessor
+# Configure logging
+logging.basicConfig(
+    level=logging.INFO,
+    format="%(asctime)s [%(levelname)s] %(name)s - %(message)s",
+    handlers=[
+        logging.StreamHandler(),
+        logging.FileHandler('selective_pdf_extractor.log')
+    ]
+)
+logger = logging.getLogger(__name__)
+logger.setLevel(logging.INFO)
+class SelectivePDFProcessor:
+    """
+    Processes PDF files by extracting only subject content sections.
+    First identifies if it's a specification document, then finds the Contents page,
+    extracts subject content page ranges, and processes only those pages.
+    """
+    def __init__(self, output_folder: str, api_key: str):
+        self.output_folder = output_folder
+        os.makedirs(self.output_folder, exist_ok=True)
+        self.api_key = api_key
+        self.contents_extractor = ContentsExtractor(api_key=api_key)
+        self.pdf_processor = LocalPDFProcessor(output_folder=output_folder)
+    def check_for_specification(self, pdf_path: str) -> bool:
+        """
+        Checks if the PDF is a specification document by looking for the word 'specification'
+        on the first page.
+        """
+        try:
+            doc = fitz.open(pdf_path)
+            first_page_text = doc[0].get_text().lower()
+            doc.close()
+            return 'specification' in first_page_text
+        except Exception as e:
+            logger.error(f"Error checking for specification: {e}")
+            return False
+    def find_contents_page(self, pdf_path: str) -> Optional[int]:
+        """
+        Finds the page number of the Contents section.
+        """
+        try:
+            doc = fitz.open(pdf_path)
+            # Check first 20 pages for "Contents"
+            # (assuming Contents is within the first 20 pages)
+            max_pages = min(20, doc.page_count)
+            for page_num in range(max_pages):
+                page_text = doc[page_num].get_text()
+                # Look for "Contents" as a standalone heading
+                if re.search(r'^\s*Contents\s*$', page_text, re.MULTILINE):
+                    logger.info(f"Found Contents page at page {page_num}")
+                    doc.close()
+                    return page_num
+            doc.close()
+            logger.warning("Contents page not found")
+            return None
+        except Exception as e:
+            logger.error(f"Error finding contents page: {e}")
+            return None
+    def extract_subject_content_pages(self, pdf_path: str, contents_page: int) -> Optional[Tuple[int, int]]:
+        """
+        Extracts subject content page range using the ContentsExtractor.
+        Focuses on "Subject content" section.
+        """
+        try:
+            doc = fitz.open(pdf_path)
+            contents_text = doc[contents_page].get_text()
+            doc.close()
+            # Use the ContentsExtractor to parse the Contents page
+            json_result = self.contents_extractor.extract_contents(contents_text)
+            topics_dict = json.loads(json_result)
+            # Look for subject content topics (with variations in naming)
+            subject_content_key = None
+            for key in topics_dict:
+                if 'subject content' in key.lower():
+                    subject_content_key = key
+                    break
+            if subject_content_key:
+                start_page, end_page = topics_dict[subject_content_key]
+                logger.info(f"Found subject content pages: {start_page} to {end_page}")
+                return start_page, end_page
+            else:
+                logger.warning("Subject content section not found in contents")
+                return None
+        except Exception as e:
+            logger.error(f"Error extracting subject content pages: {e}")
+            return None
+    def extract_pages_to_new_pdf(self, input_pdf: str, start_page: int, end_page: int) -> str:
+        """
+        Creates a new PDF containing only the specified page range.
+        """
+        try:
+            doc = fitz.open(input_pdf)
+            new_doc = fitz.open()
+            # Convert from page numbers in contents (1-based) to 0-based indices
+            start_idx = start_page - 1
+            end_idx = end_page - 1
+            # Ensure valid page range
+            start_idx = max(0, start_idx)
+            end_idx = min(doc.page_count - 1, end_idx)
+            # Copy pages from original to new document
+            for page_num in range(start_idx, end_idx + 1):
+                new_doc.insert_pdf(doc, from_page=page_num, to_page=page_num)
+            # Save new PDF
+            temp_pdf_path = os.path.join(self.output_folder, "subject_content.pdf")
+            new_doc.save(temp_pdf_path)
+            new_doc.close()
+            doc.close()
+            logger.info(f"Created new PDF with pages {start_page} to {end_page} at {temp_pdf_path}")
+            return temp_pdf_path
+        except Exception as e:
+            logger.error(f"Error extracting pages to new PDF: {e}")
+            return input_pdf  # Return original if extraction fails
+    def process(self, pdf_path: str) -> Optional[str]:
+        """
+        Main processing function:
+        1. Check if PDF is a specification document
+        2. Find the Contents page
+        3. Extract subject content page range
+        4. Create a new PDF with only those pages
+        5. Process the new PDF using the existing PDF processor
+        """
+        try:
+            # Check if it's a specification document
+            is_spec = self.check_for_specification(pdf_path)
+            if not is_spec:
+                logger.info(f"Not a specification document, processing entire PDF: {pdf_path}")
+                return self.pdf_processor.process(pdf_path)
+            # Find the Contents page
+            contents_page = self.find_contents_page(pdf_path)
+            if contents_page is None:
+                logger.warning("Contents page not found, processing entire PDF")
+                return self.pdf_processor.process(pdf_path)
+            # Extract subject content page range
+            page_range = self.extract_subject_content_pages(pdf_path, contents_page)
+            if page_range is None:
+                logger.warning("Subject content section not found, processing entire PDF")
+                return self.pdf_processor.process(pdf_path)
+            start_page, end_page = page_range
+            # Create new PDF with only subject content pages
+            subject_content_pdf = self.extract_pages_to_new_pdf(pdf_path, start_page, end_page)
+            # Process the new PDF
+            logger.info(f"Processing subject content PDF: {subject_content_pdf}")
+            markdown_result = self.pdf_processor.process(subject_content_pdf)
+            # Add metadata about the extraction
+            metadata = (
+                f"# Extracted Subject Content\n\n"
+                f"Source document: {os.path.basename(pdf_path)}\n"
+                f"Pages: {start_page} to {end_page}\n\n"
+                f"---\n\n"
+            )
+            final_markdown = metadata + markdown_result
+            # Save the final markdown
+            final_md_path = os.path.join(self.output_folder, "final_output_with_metadata.md")
+            with open(final_md_path, "w", encoding="utf-8") as f:
+                f.write(final_markdown)
+            return final_markdown
+        except Exception as e:
+            logger.error(f"Error in selective processing: {e}")
+            # Fallback to processing the entire PDF
+            return self.pdf_processor.process(pdf_path)
 if __name__ == "__main__":
+    # API key should be stored securely, this is just for demonstration
+    GEMINI_API_KEY = "AIzaSyDtoakpXa2pjJwcQB6TJ5QaXHNSA5JxcrU"  # Same as in the original scripts
+    input_pdf = "/home/user/app/input_output/a-level-pearson-mathematics-specification.pdf"
+    output_dir = "/home/user/app/input_output/outputs"
+    processor = SelectivePDFProcessor(output_folder=output_dir, api_key=GEMINI_API_KEY)
+    result = processor.process(input_pdf)
+    if result:
+        logger.info("Processing completed successfully")
+    else:
+        logger.error("Processing failed")

input_output/output/final_output.md CHANGED Viewed

The diff for this file is too large to render. See raw diff

input_output/output/images/img_1.png CHANGED Viewed

input_output/output/images/img_10.png CHANGED Viewed

input_output/output/images/img_10.png_rows/row_0/col_0.png CHANGED Viewed

input_output/output/images/img_10.png_rows/row_0/col_2.png ADDED Viewed

input_output/output/images/img_11.png CHANGED Viewed

input_output/output/images/img_11.png_rows/row_0/col_0.png CHANGED Viewed

input_output/output/images/img_11.png_rows/row_0/col_1.png CHANGED Viewed

input_output/output/images/img_11.png_rows/row_1/col_0.png CHANGED Viewed

input_output/output/images/img_11.png_rows/row_1/col_1.png ADDED Viewed

input_output/output/images/img_11.png_rows/row_2/col_0.png ADDED Viewed

input_output/output/images/img_11.png_rows/row_3/col_0.png ADDED Viewed

input_output/output/images/img_12.png CHANGED Viewed

input_output/output/images/img_12.png_rows/row_0/col_0.png CHANGED Viewed

input_output/output/images/img_12.png_rows/row_0/col_1.png CHANGED Viewed

input_output/output/images/img_12.png_rows/row_1/col_0.png CHANGED Viewed

input_output/output/images/img_12.png_rows/row_1/col_1.png CHANGED Viewed

input_output/output/images/img_12.png_rows/row_2/col_0.png ADDED Viewed

input_output/output/images/img_12.png_rows/row_2/col_1.png ADDED Viewed

input_output/output/images/img_12.png_rows/row_3/col_0.png ADDED Viewed

input_output/output/images/img_13.png CHANGED Viewed

input_output/output/images/img_13.png_rows/row_0/col_0.png CHANGED Viewed

input_output/output/images/img_13.png_rows/row_0/col_1.png CHANGED Viewed

input_output/output/images/img_13.png_rows/row_1/col_0.png CHANGED Viewed

input_output/output/images/img_13.png_rows/row_1/col_1.png CHANGED Viewed

input_output/output/images/img_13.png_rows/row_2/col_0.png CHANGED Viewed

input_output/output/images/img_13.png_rows/row_3/col_0.png ADDED Viewed

input_output/output/images/img_13.png_rows/row_3/col_1.png ADDED Viewed

input_output/output/images/img_14.png CHANGED Viewed

input_output/output/images/img_14.png_rows/row_0/col_0.png CHANGED Viewed

input_output/output/images/img_14.png_rows/row_0/col_1.png CHANGED Viewed

input_output/output/images/img_14.png_rows/row_1/col_0.png CHANGED Viewed

input_output/output/images/img_14.png_rows/row_1/col_1.png CHANGED Viewed

input_output/output/images/img_14.png_rows/row_2/col_0.png ADDED Viewed

input_output/output/images/img_14.png_rows/row_3/col_0.png ADDED Viewed

input_output/output/images/img_14.png_rows/row_4/col_0.png ADDED Viewed

input_output/output/images/img_14.png_rows/row_5/col_0.png ADDED Viewed

input_output/output/images/img_15.png CHANGED Viewed

input_output/output/images/img_15.png_rows/row_0/col_0.png CHANGED Viewed

input_output/output/images/img_15.png_rows/row_0/col_1.png CHANGED Viewed

input_output/output/images/img_15.png_rows/row_1/col_0.png CHANGED Viewed

input_output/output/images/img_15.png_rows/row_1/col_1.png ADDED Viewed