MinerU

Paused

App Files Files Community

SkyNait commited on Mar 4

Commit

145c342

1 Parent(s): 121f305

correct JSON and filtering

Browse files

Files changed (9) hide show

__pycache__/inference_svm_model.cpython-310.pyc +0 -0
__pycache__/mineru_single.cpython-310.pyc +0 -0
__pycache__/table_row_extraction.cpython-310.pyc +0 -0
__pycache__/topic_extraction.cpython-310.pyc +0 -0
__pycache__/worker.cpython-310.pyc +0 -0
pearson_json/subtopics.json +914 -0
table_row_extraction.py +167 -149
topic_extr.py +213 -289
topic_extraction.log +311 -0

__pycache__/inference_svm_model.cpython-310.pyc CHANGED Viewed

Binary files a/__pycache__/inference_svm_model.cpython-310.pyc and b/__pycache__/inference_svm_model.cpython-310.pyc differ

__pycache__/mineru_single.cpython-310.pyc CHANGED Viewed

Binary files a/__pycache__/mineru_single.cpython-310.pyc and b/__pycache__/mineru_single.cpython-310.pyc differ

__pycache__/table_row_extraction.cpython-310.pyc CHANGED Viewed

Binary files a/__pycache__/table_row_extraction.cpython-310.pyc and b/__pycache__/table_row_extraction.cpython-310.pyc differ

__pycache__/topic_extraction.cpython-310.pyc CHANGED Viewed

Binary files a/__pycache__/topic_extraction.cpython-310.pyc and b/__pycache__/topic_extraction.cpython-310.pyc differ

__pycache__/worker.cpython-310.pyc CHANGED Viewed

Binary files a/__pycache__/worker.cpython-310.pyc and b/__pycache__/worker.cpython-310.pyc differ

pearson_json/subtopics.json ADDED Viewed

	@@ -0,0 +1,914 @@

+[
+  {
+    "title": "1 Statistical sampling",
+    "contents": [
+      {
+        "type": "image",
+        "key": "/topic-extraction/cells/img_1.jpg_r1_c0.png"
+      },
+      {
+        "type": "image",
+        "key": "/topic-extraction/cells/img_19.jpg_r2_c0.png"
+      }
+    ],
+    "children": [
+      {
+        "title": "1.1",
+        "contents": [
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_1.jpg_r1_c1.png"
+          },
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_19.jpg_r2_c1.png"
+          }
+        ],
+        "children": []
+      }
+    ]
+  },
+  {
+    "title": "2 Data presentation and interpretation",
+    "contents": [
+      {
+        "type": "image",
+        "key": "/topic-extraction/cells/img_2.jpg_r1_c0.png"
+      },
+      {
+        "type": "image",
+        "key": "/topic-extraction/cells/img_3.jpg_r1_c0.png"
+      },
+      {
+        "type": "image",
+        "key": "/topic-extraction/cells/img_4.jpg_r2_c0.png"
+      },
+      {
+        "type": "image",
+        "key": "/topic-extraction/cells/img_5.jpg_r1_c0.png"
+      },
+      {
+        "type": "image",
+        "key": "/topic-extraction/cells/img_6.jpg_r1_c0.png"
+      },
+      {
+        "type": "image",
+        "key": "/topic-extraction/cells/img_20.jpg_r1_c0.png"
+      },
+      {
+        "type": "image",
+        "key": "/topic-extraction/cells/img_21.jpg_r1_c0.png"
+      }
+    ],
+    "children": [
+      {
+        "title": "2.1",
+        "contents": [
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_2.jpg_r1_c1.png"
+          },
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_19.jpg_r3_c1.png"
+          }
+        ],
+        "children": []
+      },
+      {
+        "title": "2.2",
+        "contents": [
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_2.jpg_r2_c0.png"
+          },
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_20.jpg_r1_c1.png"
+          }
+        ],
+        "children": []
+      },
+      {
+        "title": "2.3",
+        "contents": [
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_2.jpg_r3_c0.png"
+          },
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_20.jpg_r2_c0.png"
+          }
+        ],
+        "children": []
+      },
+      {
+        "title": "2.4",
+        "contents": [
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_2.jpg_r4_c0.png"
+          },
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_21.jpg_r1_c1.png"
+          }
+        ],
+        "children": []
+      },
+      {
+        "title": "2.5",
+        "contents": [
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_3.jpg_r1_c1.png"
+          }
+        ],
+        "children": []
+      },
+      {
+        "title": "2.6",
+        "contents": [
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_3.jpg_r2_c0.png"
+          }
+        ],
+        "children": []
+      },
+      {
+        "title": "2.7",
+        "contents": [
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_4.jpg_r2_c1.png"
+          }
+        ],
+        "children": []
+      },
+      {
+        "title": "2.8",
+        "contents": [
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_5.jpg_r1_c1.png"
+          }
+        ],
+        "children": []
+      },
+      {
+        "title": "2.9",
+        "contents": [
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_5.jpg_r2_c0.png"
+          }
+        ],
+        "children": []
+      },
+      {
+        "title": "2.10",
+        "contents": [
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_5.jpg_r3_c0.png"
+          }
+        ],
+        "children": []
+      },
+      {
+        "title": "2.11",
+        "contents": [
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_6.jpg_r1_c1.png"
+          }
+        ],
+        "children": []
+      }
+    ]
+  },
+  {
+    "title": "3 Coordinate geometry in the (x, y) plane",
+    "contents": [
+      {
+        "type": "image",
+        "key": "/topic-extraction/cells/img_7.jpg_r1_c0.png"
+      },
+      {
+        "type": "image",
+        "key": "/topic-extraction/cells/img_22.jpg_r1_c0.png"
+      }
+    ],
+    "children": [
+      {
+        "title": "3.1",
+        "contents": [
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_6.jpg_r2_c1.png"
+          },
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_21.jpg_r2_c1.png"
+          }
+        ],
+        "children": []
+      },
+      {
+        "title": "3.2",
+        "contents": [
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_6.jpg_r3_c0.png"
+          },
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_21.jpg_r3_c0.png"
+          }
+        ],
+        "children": []
+      },
+      {
+        "title": "3.3",
+        "contents": [
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_7.jpg_r1_c1.png"
+          },
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_22.jpg_r1_c1.png"
+          }
+        ],
+        "children": []
+      },
+      {
+        "title": "3.4",
+        "contents": [
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_7.jpg_r2_c0.png"
+          }
+        ],
+        "children": []
+      }
+    ]
+  },
+  {
+    "title": "4 Statistical distributions",
+    "contents": [
+      {
+        "type": "image",
+        "key": "/topic-extraction/cells/img_8.jpg_r2_c0.png"
+      },
+      {
+        "type": "image",
+        "key": "/topic-extraction/cells/img_23.jpg_r1_c0.png"
+      }
+    ],
+    "children": [
+      {
+        "title": "4.1",
+        "contents": [
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_7.jpg_r3_c1.png"
+          },
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_22.jpg_r2_c1.png"
+          }
+        ],
+        "children": []
+      },
+      {
+        "title": "4.2",
+        "contents": [
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_8.jpg_r2_c1.png"
+          },
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_22.jpg_r3_c0.png"
+          }
+        ],
+        "children": []
+      },
+      {
+        "title": "4.3",
+        "contents": [
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_8.jpg_r3_c0.png"
+          },
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_23.jpg_r1_c1.png"
+          }
+        ],
+        "children": []
+      },
+      {
+        "title": "4.4",
+        "contents": [
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_8.jpg_r4_c0.png"
+          }
+        ],
+        "children": []
+      },
+      {
+        "title": "4.5",
+        "contents": [
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_8.jpg_r5_c0.png"
+          }
+        ],
+        "children": []
+      },
+      {
+        "title": "4.6",
+        "contents": [
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_8.jpg_r6_c0.png"
+          }
+        ],
+        "children": []
+      }
+    ]
+  },
+  {
+    "title": "5 Statistical hypothesis testing",
+    "contents": [
+      {
+        "type": "image",
+        "key": "/topic-extraction/cells/img_9.jpg_r1_c0.png"
+      },
+      {
+        "type": "image",
+        "key": "/topic-extraction/cells/img_10.jpg_r1_c0.png"
+      },
+      {
+        "type": "image",
+        "key": "/topic-extraction/cells/img_24.jpg_r2_c0.png"
+      }
+    ],
+    "children": [
+      {
+        "title": "5.1",
+        "contents": [
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_9.jpg_r1_c1.png"
+          },
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_23.jpg_r2_c1.png"
+          }
+        ],
+        "children": []
+      },
+      {
+        "title": "5.2",
+        "contents": [
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_9.jpg_r2_c0.png"
+          },
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_24.jpg_r2_c1.png"
+          }
+        ],
+        "children": []
+      },
+      {
+        "title": "5.3",
+        "contents": [
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_9.jpg_r3_c0.png"
+          },
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_24.jpg_r3_c0.png"
+          }
+        ],
+        "children": []
+      },
+      {
+        "title": "5.4",
+        "contents": [
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_9.jpg_r4_c0.png"
+          }
+        ],
+        "children": []
+      },
+      {
+        "title": "5.5",
+        "contents": [
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_10.jpg_r1_c1.png"
+          }
+        ],
+        "children": []
+      },
+      {
+        "title": "5.6",
+        "contents": [
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_10.jpg_r2_c0.png"
+          }
+        ],
+        "children": []
+      },
+      {
+        "title": "5.7",
+        "contents": [
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_10.jpg_r3_c0.png"
+          }
+        ],
+        "children": []
+      },
+      {
+        "title": "5.8",
+        "contents": [
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_10.jpg_r4_c0.png"
+          }
+        ],
+        "children": []
+      },
+      {
+        "title": "5.9",
+        "contents": [
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_10.jpg_r5_c0.png"
+          }
+        ],
+        "children": []
+      }
+    ]
+  },
+  {
+    "title": "6 Exponentials and logarithms",
+    "contents": [
+      {
+        "type": "image",
+        "key": "/topic-extraction/cells/img_12.jpg_r2_c0.png"
+      }
+    ],
+    "children": [
+      {
+        "title": "6.1",
+        "contents": [
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_11.jpg_r1_c0.png"
+          },
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_24.jpg_r4_c1.png"
+          }
+        ],
+        "children": []
+      },
+      {
+        "title": "6.2",
+        "contents": [
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_11.jpg_r2_c0.png"
+          }
+        ],
+        "children": []
+      },
+      {
+        "title": "6.3",
+        "contents": [
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_11.jpg_r3_c0.png"
+          }
+        ],
+        "children": []
+      },
+      {
+        "title": "6.4",
+        "contents": [
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_11.jpg_r4_c0.png"
+          }
+        ],
+        "children": []
+      },
+      {
+        "title": "6.5",
+        "contents": [
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_11.jpg_r5_c0.png"
+          }
+        ],
+        "children": []
+      },
+      {
+        "title": "6.6",
+        "contents": [
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_11.jpg_r6_c0.png"
+          }
+        ],
+        "children": []
+      },
+      {
+        "title": "6.7",
+        "contents": [
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_12.jpg_r2_c1.png"
+          }
+        ],
+        "children": []
+      }
+    ]
+  },
+  {
+    "title": "7 Differentiation",
+    "contents": [
+      {
+        "type": "image",
+        "key": "/topic-extraction/cells/img_13.jpg_r2_c0.png"
+      },
+      {
+        "type": "image",
+        "key": "/topic-extraction/cells/img_14.jpg_r1_c0.png"
+      }
+    ],
+    "children": [
+      {
+        "title": "7.1",
+        "contents": [
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_13.jpg_r2_c1.png"
+          },
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_25.jpg_r1_c0.png"
+          },
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_12.jpg_r3_c1.png"
+          }
+        ],
+        "children": []
+      },
+      {
+        "title": "7.2",
+        "contents": [
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_13.jpg_r3_c0.png"
+          },
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_25.jpg_r2_c0.png"
+          }
+        ],
+        "children": []
+      },
+      {
+        "title": "7.3",
+        "contents": [
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_13.jpg_r5_c0.png"
+          },
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_25.jpg_r3_c0.png"
+          }
+        ],
+        "children": []
+      },
+      {
+        "title": "7.4",
+        "contents": [
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_14.jpg_r1_c1.png"
+          },
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_25.jpg_r4_c0.png"
+          }
+        ],
+        "children": []
+      },
+      {
+        "title": "7.5",
+        "contents": [
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_14.jpg_r2_c0.png"
+          },
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_25.jpg_r5_c0.png"
+          }
+        ],
+        "children": []
+      },
+      {
+        "title": "7.6",
+        "contents": [
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_14.jpg_r3_c0.png"
+          }
+        ],
+        "children": []
+      }
+    ]
+  },
+  {
+    "title": "8 Forces and Newton's laws",
+    "contents": [
+      {
+        "type": "image",
+        "key": "/topic-extraction/cells/img_15.jpg_r1_c0.png"
+      },
+      {
+        "type": "image",
+        "key": "/topic-extraction/cells/img_16.jpg_r2_c0.png"
+      },
+      {
+        "type": "image",
+        "key": "/topic-extraction/cells/img_26.jpg_r1_c0.png"
+      },
+      {
+        "type": "image",
+        "key": "/topic-extraction/cells/img_27.jpg_r1_c0.png"
+      }
+    ],
+    "children": [
+      {
+        "title": "8.1",
+        "contents": [
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_26.jpg_r1_c1.png"
+          },
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_14.jpg_r4_c1.png"
+          }
+        ],
+        "children": []
+      },
+      {
+        "title": "8.2",
+        "contents": [
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_26.jpg_r2_c0.png"
+          },
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_14.jpg_r5_c0.png"
+          }
+        ],
+        "children": []
+      },
+      {
+        "title": "8.3",
+        "contents": [
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_15.jpg_r1_c1.png"
+          },
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_26.jpg_r3_c0.png"
+          }
+        ],
+        "children": []
+      },
+      {
+        "title": "8.4",
+        "contents": [
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_15.jpg_r2_c0.png"
+          },
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_27.jpg_r1_c1.png"
+          }
+        ],
+        "children": []
+      },
+      {
+        "title": "8.5",
+        "contents": [
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_15.jpg_r3_c0.png"
+          },
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_27.jpg_r2_c0.png"
+          }
+        ],
+        "children": []
+      },
+      {
+        "title": "8.6",
+        "contents": [
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_15.jpg_r4_c0.png"
+          },
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_27.jpg_r3_c0.png"
+          }
+        ],
+        "children": []
+      },
+      {
+        "title": "8.7",
+        "contents": [
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_16.jpg_r2_c1.png"
+          }
+        ],
+        "children": []
+      },
+      {
+        "title": "8.8",
+        "contents": [
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_16.jpg_r3_c0.png"
+          }
+        ],
+        "children": []
+      }
+    ]
+  },
+  {
+    "title": "9 Numerical methods",
+    "contents": [
+      {
+        "type": "image",
+        "key": "/topic-extraction/cells/img_17.jpg_r1_c0.png"
+      }
+    ],
+    "children": [
+      {
+        "title": "9.1",
+        "contents": [
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_16.jpg_r4_c1.png"
+          },
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_27.jpg_r4_c1.png"
+          }
+        ],
+        "children": []
+      },
+      {
+        "title": "9.2",
+        "contents": [
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_16.jpg_r5_c0.png"
+          }
+        ],
+        "children": []
+      },
+      {
+        "title": "9.3",
+        "contents": [
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_16.jpg_r6_c0.png"
+          }
+        ],
+        "children": []
+      },
+      {
+        "title": "9.4",
+        "contents": [
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_17.jpg_r1_c1.png"
+          }
+        ],
+        "children": []
+      },
+      {
+        "title": "9.5",
+        "contents": [
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_17.jpg_r2_c0.png"
+          }
+        ],
+        "children": []
+      }
+    ]
+  },
+  {
+    "title": "10 Vectors",
+    "contents": [
+      {
+        "type": "image",
+        "key": "/topic-extraction/cells/img_18.jpg_r2_c0.png"
+      }
+    ],
+    "children": [
+      {
+        "title": "10.1",
+        "contents": [
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_17.jpg_r3_c1.png"
+          }
+        ],
+        "children": []
+      },
+      {
+        "title": "10.2",
+        "contents": [
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_17.jpg_r4_c0.png"
+          }
+        ],
+        "children": []
+      },
+      {
+        "title": "10.3",
+        "contents": [
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_17.jpg_r5_c0.png"
+          }
+        ],
+        "children": []
+      },
+      {
+        "title": "10.4",
+        "contents": [
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_17.jpg_r6_c0.png"
+          }
+        ],
+        "children": []
+      },
+      {
+        "title": "10.5",
+        "contents": [
+          {
+            "type": "image",
+            "key": "/topic-extraction/cells/img_18.jpg_r2_c1.png"
+          }
+        ],
+        "children": []
+      }
+    ]
+  },
+  {
+    "title": "A01",
+    "contents": [
+      {
+        "type": "image",
+        "key": "/topic-extraction/cells/img_28.jpg_r1_c0.png"
+      }
+    ],
+    "children": []
+  }
+]

table_row_extraction.py CHANGED Viewed

@@ -1,5 +1,6 @@
 import cv2
 import numpy as np
 import logging
 from pathlib import Path
 from typing import List, Tuple
@@ -10,10 +11,27 @@ logger = logging.getLogger(__name__)
 # if you are working with 3-column tables, change `merge_two_col_rows` and `enable_subtopic_merge` to False
 # otherwise set them to True if you are working with 2-column tables  (currently hardcoded, just test)
 class TableExtractor:
     def __init__(
             self,
-            #preprocessing parameters
             denoise_h: int = 10,
             clahe_clip: float = 3.0,
             clahe_grid: int = 8,
@@ -23,44 +41,40 @@ class TableExtractor:
             thresh_block_size: int = 21,
             thresh_C: int = 7,
-            # Row detection parameters
             horizontal_scale: int = 20,
-            row_morph_iterations: int = 2,
-            min_row_height: int = 30,
             min_row_density: float = 0.01,
-            # Column detection parameters
             vertical_scale: int = 20,
             col_morph_iterations: int = 2,
             min_col_height_ratio: float = 0.5,
             min_col_density: float = 0.01,
-            # Bounding box extraction
             padding: int = 0,
             skip_header: bool = True,
-            # Two-column & subtopic merges
-            merge_two_col_rows: bool = False,
-            enable_subtopic_merge: bool = False,
             subtopic_threshold: float = 0.2,
-            #gray artifact filter
-            std_threshold_for_artifacts: float = 5.0,
-            #parameters for line removal check
-            line_removal_scale: int = 15,
-            line_removal_iterations: int = 1,
-            min_text_ratio_after_line_removal: float = 0.001
     ):
-        """
-        :param merge_two_col_rows: If True, a row with exactly 1 vertical line => merges into 1 bounding box.
-        :param enable_subtopic_merge: If True, a row with 2 vertical lines => 3 columns can become 2 if left is narrow.
-        :param subtopic_threshold: Fraction of row width for subtopic detection.
-        :param std_threshold_for_artifacts: Grayscale std dev < this => skip as artifact.
-        :param line_removal_scale: Larger => more aggressive line detection inside the cell.
-        :param line_removal_iterations: Morphological iterations for line removal.
-        :param min_text_ratio_after_line_removal: If fraction of text after removing lines < this => skip cell.
-        """
         # Preprocessing
         self.denoise_h = denoise_h
         self.clahe_clip = clahe_clip
@@ -75,6 +89,11 @@ class TableExtractor:
         self.min_row_height = min_row_height
         self.min_row_density = min_row_density
         # Column detection
         self.vertical_scale = vertical_scale
         self.col_morph_iterations = col_morph_iterations
@@ -85,28 +104,31 @@ class TableExtractor:
         self.padding = padding
         self.skip_header = skip_header
-        # Two-column / subtopic merges
         self.merge_two_col_rows = merge_two_col_rows
         self.enable_subtopic_merge = enable_subtopic_merge
         self.subtopic_threshold = subtopic_threshold
-        #artifact filtering (gray headers, purple, etc) / currenty not working well
-        self.std_threshold_for_artifacts = std_threshold_for_artifacts
-        #line removal inside cell
-        self.line_removal_scale = line_removal_scale
-        self.line_removal_iterations = line_removal_iterations
-        self.min_text_ratio_after_line_removal = min_text_ratio_after_line_removal
     def preprocess(self, img: np.ndarray) -> np.ndarray:
-        """Grayscale, denoise, CLAHE, sharpen, adaptive threshold (binary_inv)."""
         if img.ndim == 3:
             gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
         else:
             gray = img.copy()
         denoised = cv2.fastNlMeansDenoising(gray, h=self.denoise_h)
-        clahe = cv2.createCLAHE(clipLimit=self.clahe_clip, tileGridSize=(self.clahe_grid, self.clahe_grid))
         enhanced = clahe.apply(denoised)
         sharpened = cv2.filter2D(enhanced, -1, self.sharpen_kernel)
@@ -120,75 +142,95 @@ class TableExtractor:
         return binarized
     def detect_full_rows(self, bin_img: np.ndarray) -> List[Tuple[int, int]]:
-        """Find horizontal row boundaries in the binarized image."""
         h_kernel_size = max(1, bin_img.shape[1] // self.horizontal_scale)
         horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (h_kernel_size, 1))
-        horizontal_lines = cv2.morphologyEx(bin_img, cv2.MORPH_OPEN, horizontal_kernel,
-                                            iterations=self.row_morph_iterations)
         row_projection = np.sum(horizontal_lines, axis=1)
         max_val = np.max(row_projection) if len(row_projection) else 0
-        # If no lines, treat entire image as one row (opt)
         if max_val < 1e-5:
             return [(0, bin_img.shape[0])]
-        threshold_val = 0.3 * max_val
         line_indices = np.where(row_projection > threshold_val)[0]
         if len(line_indices) < 2:
             return [(0, bin_img.shape[0])]
-        # Group consecutive indices
         lines = []
-        current = [line_indices[0]]
         for i in range(1, len(line_indices)):
-            if line_indices[i] - line_indices[i - 1] <= 2:
-                current.append(line_indices[i])
             else:
-                lines.append(int(np.mean(current)))
-                current = [line_indices[i]]
-        if current:
-            lines.append(int(np.mean(current)))
-        row_bounds = []
         for i in range(len(lines) - 1):
             y1 = lines[i]
             y2 = lines[i + 1]
-            if (y2 - y1) >= self.min_row_height:
-                row_bounds.append((y1, y2))
-        return row_bounds if row_bounds else [(0, bin_img.shape[0])]
-    def detect_columns_in_row(self, row_img: np.ndarray, y1: int, y2: int) -> List[Tuple[int, int, int, int]]:
-        """
-        Detect up to two vertical lines => up to 3 bounding boxes.
-         - 0 lines => 1 bounding box
-         - 1 line => 2 bounding boxes (unless merge_two_col_rows => 1)
-         - 2 lines => 3 bounding boxes by default
-                      if enable_subtopic_merge => check left box < subtopic_threshold => 2 boxes
-        """
         row_height = (y2 - y1)
         row_width = row_img.shape[1]
-        # Morph kernel for vertical lines
         v_kernel_size = max(1, row_height // self.vertical_scale)
         vertical_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, v_kernel_size))
-        vertical_lines = cv2.morphologyEx(row_img, cv2.MORPH_OPEN, vertical_kernel,
-                                          iterations=self.col_morph_iterations)
-        vertical_lines = cv2.dilate(vertical_lines, np.ones((3, 3), np.uint8), iterations=1)
         # Find contours => x positions
-        contours, _ = cv2.findContours(vertical_lines, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
         x_positions = []
         for c in contours:
-            x, y, w, h = cv2.boundingRect(c)
-            # Must be at least half the row height to be considered a real column divider
             if h >= self.min_col_height_ratio * row_height:
                 x_positions.append(x)
-        x_positions = sorted(set(x_positions))
         # Keep at most 2 vertical lines
         if len(x_positions) > 2:
             x_positions = x_positions[:2]
@@ -209,14 +251,12 @@ class TableExtractor:
                     (0,    y1, x1,             row_height),
                     (x1,   y1, row_width - x1, row_height)
                 ]
         else:
             # 2 lines => normally 3 bounding boxes
             x1, x2 = sorted(x_positions)
             if self.enable_subtopic_merge:
-                # If left bounding box is very narrow => treat as subtopic => 2 bounding boxes
-                left_box_width = x1
-                if left_box_width < (self.subtopic_threshold * row_width):
                     boxes = [
                         (0,  y1, x1,             row_height),
                         (x1, y1, row_width - x1, row_height)
@@ -239,12 +279,12 @@ class TableExtractor:
         for (x, y, w, h) in boxes:
             if w <= 0:
                 continue
-            subregion = row_img[:, x : x + w]
             white_pixels = np.sum(subregion == 255)
             total_pixels = subregion.size
             if total_pixels == 0:
                 continue
-            density = white_pixels / total_pixels
             if density >= self.min_col_density:
                 filtered.append((x, y, w, h))
@@ -253,9 +293,9 @@ class TableExtractor:
     def process_image(self, image_path: str) -> List[List[Tuple[int, int, int, int]]]:
         """
         1) Preprocess => bin_img
-        2) Detect row segments
         3) Filter out rows by density
-            - optionally skip first row (header)
         5) For each row => detect columns => bounding boxes
         """
         img = cv2.imread(image_path)
@@ -273,15 +313,15 @@ class TableExtractor:
             if area == 0:
                 continue
             white_pixels = np.sum(row_region == 255)
-            density = white_pixels / area
             if density >= self.min_row_density:
                 valid_rows.append((y1, y2))
-        # Possibly skip header row
         if self.skip_header and len(valid_rows) > 1:
             valid_rows = valid_rows[1:]
-        # Detect columns in each row
         all_rows_boxes = []
         for (y1, y2) in valid_rows:
             row_img = bin_img[y1:y2, :]
@@ -291,8 +331,12 @@ class TableExtractor:
         return all_rows_boxes
-    def extract_box_image(self, original: np.ndarray, box: Tuple[int, int, int, int]) -> np.ndarray:
-        """Crop bounding box from original with optional padding."""
         x, y, w, h = box
         Y1 = max(0, y - self.padding)
         Y2 = min(original.shape[0], y + h + self.padding)
@@ -300,59 +344,47 @@ class TableExtractor:
         X2 = min(original.shape[1], x + w + self.padding)
         return original[Y1:Y2, X1:X2]
-    def _remove_lines_in_cell(self, gray_bin: np.ndarray) -> np.ndarray:
         """
-        Remove horizontal + vertical lines from a binarized subregion
-        and return the 'text-only' mask.
-        """
-        # 1) horizontal line detection
-        horiz_kernel_size = max(1, gray_bin.shape[1] // self.line_removal_scale)
-        horiz_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (horiz_kernel_size, 1))
-        horizontal = cv2.morphologyEx(gray_bin, cv2.MORPH_OPEN, horiz_kernel, iterations=self.line_removal_iterations)
-        # 2) vertical line detection
-        vert_kernel_size = max(1, gray_bin.shape[0] // self.line_removal_scale)
-        vert_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, vert_kernel_size))
-        vertical = cv2.morphologyEx(gray_bin, cv2.MORPH_OPEN, vert_kernel, iterations=self.line_removal_iterations)
-        # Combine lines
-        lines = cv2.bitwise_or(horizontal, vertical)
-        # Subtract from the original => text-only
-        text_only = cv2.bitwise_and(gray_bin, cv2.bitwise_not(lines))
-        return text_only
-    def is_grey_artifact(self, cell_img: np.ndarray) -> bool:
-        """
-        1) If grayscale std dev < std_threshold_for_artifacts => skip as uniform.
-        2) Otherwise, remove lines from an Otsu-binarized version of the cell
-           and check if there's enough text left. If not, skip as artifact.
         """
         if cell_img.size == 0:
             return True
-        gray = cv2.cvtColor(cell_img, cv2.COLOR_BGR2GRAY)
-        std_val = np.std(gray)
-        if std_val < self.std_threshold_for_artifacts:
             return True
-        # 2) Binarize => remove lines => check leftover text
-        #    Use Otsu threshold for the local cell
-        _, cell_bin = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV | cv2.THRESH_OTSU)
-        text_only = self._remove_lines_in_cell(cell_bin)
-        nonzero_text = cv2.countNonZero(text_only)
-        ratio = nonzero_text / float(cell_bin.size)
-        if ratio < self.min_text_ratio_after_line_removal:
-            # Hardly any text remains => artifact
             return True
         return False
     def save_extracted_cells(
-            self, image_path: str, row_boxes: List[List[Tuple[int, int, int, int]]], output_dir: str
     ):
-        """Save each cell from the original image, skipping uniform/gray artifacts."""
         out_path = Path(output_dir)
         out_path.mkdir(exist_ok=True, parents=True)
@@ -365,14 +397,15 @@ class TableExtractor:
             row_dir.mkdir(exist_ok=True)
             for j, box in enumerate(row):
                 cell_img = self.extract_box_image(original, box)
-                # Skip if uniform or line-only artifact
-                if self.is_grey_artifact(cell_img):
-                    logger.info(f"Skipping artifact cell at row={i}, col={j}. (uniform/grey/line-only)")
                     continue
                 out_file = row_dir / f"col_{j}.png"
                 cv2.imwrite(str(out_file), cell_img)
-                logger.info(f"Saved cell image row={i}, col={j} -> {out_file}")
 class TableExtractorApp:
     def __init__(self, extractor: TableExtractor):
@@ -384,39 +417,24 @@ class TableExtractorApp:
         self.extractor.save_extracted_cells(input_image, row_boxes, output_folder)
         logger.info("Done. Check the output folder for results.")
 if __name__ == "__main__":
-    input_image = "images/test/img_2.png"
-    output_folder = "refined_outp"
     extractor = TableExtractor(
-        denoise_h=10,
-        clahe_clip=3.0,
-        clahe_grid=8,
-        thresh_block_size=21,
-        thresh_C=7,
-        horizontal_scale=20,
-        row_morph_iterations=2,
-        min_row_height=30,
-        min_row_density=0.01,
-        vertical_scale=20,
-        col_morph_iterations=2,
-        min_col_height_ratio=0.5,
-        min_col_density=0.01,
-        padding=1,
-        skip_header=True,
         merge_two_col_rows=True,
         enable_subtopic_merge=True,
         subtopic_threshold=0.2,
-        std_threshold_for_artifacts=10.0,
-        line_removal_scale=20,
-        line_removal_iterations=1,
-        min_text_ratio_after_line_removal=0.001
     )
     app = TableExtractorApp(extractor)

 import cv2
 import numpy as np
+import math
 import logging
 from pathlib import Path
 from typing import List, Tuple
 # if you are working with 3-column tables, change `merge_two_col_rows` and `enable_subtopic_merge` to False
 # otherwise set them to True if you are working with 2-column tables  (currently hardcoded, just test)
+def color_distance(c1: Tuple[float, float, float],
+                   c2: Tuple[float, float, float]) -> float:
+    """
+    Euclidean distance between two BGR colors c1 and c2.
+    """
+    return math.sqrt((c1[0] - c2[0])**2 + (c1[1] - c2[1])**2 + (c1[2] - c2[2])**2)
+def average_bgr(cell_img: np.ndarray) -> Tuple[float, float, float]:
+    """
+    Return the average BGR color of the entire cell_img.
+    """
+    b_mean = np.mean(cell_img[:, :, 0])
+    g_mean = np.mean(cell_img[:, :, 1])
+    r_mean = np.mean(cell_img[:, :, 2])
+    return (b_mean, g_mean, r_mean)
 class TableExtractor:
     def __init__(
             self,
+            # --- Preprocessing ---
             denoise_h: int = 10,
             clahe_clip: float = 3.0,
             clahe_grid: int = 8,
             thresh_block_size: int = 21,
             thresh_C: int = 7,
+            # --- Row detection ---
             horizontal_scale: int = 20,
+            row_morph_iterations: int = 1,
+            min_row_height: int = 15,
             min_row_density: float = 0.01,
+            # Additional row detection parameters
+            faint_line_threshold_factor: float = 0.1,
+            top_line_grouping_px: int = 8,
+            some_minimum_text_pixels: int = 50,
+            # --- Column detection ---
             vertical_scale: int = 20,
             col_morph_iterations: int = 2,
             min_col_height_ratio: float = 0.5,
             min_col_density: float = 0.01,
+            # --- Bbox extraction ---
             padding: int = 0,
             skip_header: bool = True,
+            # --- Two-column & subtopic merges ---
+            merge_two_col_rows: bool = True,
+            enable_subtopic_merge: bool = True,
             subtopic_threshold: float = 0.2,
+            # --- Color-based artifact filter ---
+            artifact_color_a6: Tuple[int, int, int] = (166, 166, 166),
+            artifact_color_a7: Tuple[int, int, int] = (180, 180, 180),
+            artifact_color_a8: Tuple[int, int, int] = (80, 48, 0),
+            artifact_color_a9: Tuple[int, int, int] = (223, 153, 180),
+            artifact_color_a10: Tuple[int, int, int] = (0, 0, 0),
+            color_tolerance: float = 30.0
     ):
         # Preprocessing
         self.denoise_h = denoise_h
         self.clahe_clip = clahe_clip
         self.min_row_height = min_row_height
         self.min_row_density = min_row_density
+        # Additional row detection
+        self.faint_line_threshold_factor = faint_line_threshold_factor
+        self.top_line_grouping_px = top_line_grouping_px
+        self.some_minimum_text_pixels = some_minimum_text_pixels
         # Column detection
         self.vertical_scale = vertical_scale
         self.col_morph_iterations = col_morph_iterations
         self.padding = padding
         self.skip_header = skip_header
+        # Two-column & subtopic merges
         self.merge_two_col_rows = merge_two_col_rows
         self.enable_subtopic_merge = enable_subtopic_merge
         self.subtopic_threshold = subtopic_threshold
+        # Color-based artifact filter
+        self.artifact_color_a6 = artifact_color_a6
+        self.artifact_color_a7 = artifact_color_a7
+        self.artifact_color_a8 = artifact_color_a8
+        self.artifact_color_a9 = artifact_color_a9
+        self.artifact_color_a10 = artifact_color_a10
+        self.color_tolerance = color_tolerance
     def preprocess(self, img: np.ndarray) -> np.ndarray:
+        """
+        Grayscale, denoise, CLAHE, sharpen, then adaptive threshold (binary_inv).
+        """
         if img.ndim == 3:
             gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
         else:
             gray = img.copy()
         denoised = cv2.fastNlMeansDenoising(gray, h=self.denoise_h)
+        clahe = cv2.createCLAHE(clipLimit=self.clahe_clip,
+                                tileGridSize=(self.clahe_grid, self.clahe_grid))
         enhanced = clahe.apply(denoised)
         sharpened = cv2.filter2D(enhanced, -1, self.sharpen_kernel)
         return binarized
     def detect_full_rows(self, bin_img: np.ndarray) -> List[Tuple[int, int]]:
         h_kernel_size = max(1, bin_img.shape[1] // self.horizontal_scale)
         horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (h_kernel_size, 1))
+        horizontal_lines = cv2.morphologyEx(
+            bin_img, cv2.MORPH_OPEN, horizontal_kernel,
+            iterations=self.row_morph_iterations
+        )
         row_projection = np.sum(horizontal_lines, axis=1)
         max_val = np.max(row_projection) if len(row_projection) else 0
         if max_val < 1e-5:
             return [(0, bin_img.shape[0])]
+        threshold_val = self.faint_line_threshold_factor * max_val
         line_indices = np.where(row_projection > threshold_val)[0]
         if len(line_indices) < 2:
             return [(0, bin_img.shape[0])]
         lines = []
+        group = [line_indices[0]]
         for i in range(1, len(line_indices)):
+            if (line_indices[i] - line_indices[i - 1]) <= self.top_line_grouping_px:
+                group.append(line_indices[i])
             else:
+                lines.append(int(np.mean(group)))
+                group = [line_indices[i]]
+        if group:
+            lines.append(int(np.mean(group)))
+        potential_bounds = []
         for i in range(len(lines) - 1):
             y1 = lines[i]
             y2 = lines[i + 1]
+            if (y2 - y1) > 0:
+                potential_bounds.append((y1, y2))
+        if potential_bounds:
+            if potential_bounds[0][0] > 0:
+                potential_bounds.insert(0, (0, potential_bounds[0][0]))
+            if potential_bounds[-1][1] < bin_img.shape[0]:
+                potential_bounds.append((potential_bounds[-1][1], bin_img.shape[0]))
+        else:
+            potential_bounds = [(0, bin_img.shape[0])]
+        final_rows = []
+        for (y1, y2) in potential_bounds:
+            height = (y2 - y1)
+            region = bin_img[y1:y2, :]
+            white_count = np.sum(region == 255)
+            if height < self.min_row_height:
+                if white_count >= self.some_minimum_text_pixels:
+                    final_rows.append((y1, y2))
+            else:
+                final_rows.append((y1, y2))
+        final_rows = sorted(final_rows, key=lambda x: x[0])
+        return final_rows if final_rows else [(0, bin_img.shape[0])]
+    def detect_columns_in_row(self,
+                              row_img: np.ndarray,
+                              y1: int,
+                              y2: int) -> List[Tuple[int, int, int, int]]:
         row_height = (y2 - y1)
         row_width = row_img.shape[1]
         v_kernel_size = max(1, row_height // self.vertical_scale)
         vertical_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, v_kernel_size))
+        vertical_lines = cv2.morphologyEx(
+            row_img, cv2.MORPH_OPEN, vertical_kernel,
+            iterations=self.col_morph_iterations
+        )
+        vertical_lines = cv2.dilate(vertical_lines,
+                                    np.ones((3, 3), np.uint8),
+                                    iterations=1)
         # Find contours => x positions
+        contours, _ = cv2.findContours(vertical_lines,
+                                       cv2.RETR_EXTERNAL,
+                                       cv2.CHAIN_APPROX_SIMPLE)
         x_positions = []
         for c in contours:
+            x, _, w, h = cv2.boundingRect(c)
+            # Must be at least half the row height to be a real divider
             if h >= self.min_col_height_ratio * row_height:
                 x_positions.append(x)
+        x_positions = sorted(set(x_positions))
         # Keep at most 2 vertical lines
         if len(x_positions) > 2:
             x_positions = x_positions[:2]
                     (0,    y1, x1,             row_height),
                     (x1,   y1, row_width - x1, row_height)
                 ]
         else:
             # 2 lines => normally 3 bounding boxes
             x1, x2 = sorted(x_positions)
             if self.enable_subtopic_merge:
+                # If left bounding box is very narrow => treat as subtopic => 2 boxes
+                if x1 < (self.subtopic_threshold * row_width):
                     boxes = [
                         (0,  y1, x1,             row_height),
                         (x1, y1, row_width - x1, row_height)
         for (x, y, w, h) in boxes:
             if w <= 0:
                 continue
+            subregion = row_img[:, x:x+w]
             white_pixels = np.sum(subregion == 255)
             total_pixels = subregion.size
             if total_pixels == 0:
                 continue
+            density = white_pixels / float(total_pixels)
             if density >= self.min_col_density:
                 filtered.append((x, y, w, h))
     def process_image(self, image_path: str) -> List[List[Tuple[int, int, int, int]]]:
         """
         1) Preprocess => bin_img
+        2) Detect row segments (with faint-line logic)
         3) Filter out rows by density
+        4) Optionally skip the first row (header)
         5) For each row => detect columns => bounding boxes
         """
         img = cv2.imread(image_path)
             if area == 0:
                 continue
             white_pixels = np.sum(row_region == 255)
+            density = white_pixels / float(area)
             if density >= self.min_row_density:
                 valid_rows.append((y1, y2))
+        # skip header row
         if self.skip_header and len(valid_rows) > 1:
             valid_rows = valid_rows[1:]
+        # Detect columns in each valid row
         all_rows_boxes = []
         for (y1, y2) in valid_rows:
             row_img = bin_img[y1:y2, :]
         return all_rows_boxes
+    def extract_box_image(self,
+                          original: np.ndarray,
+                          box: Tuple[int, int, int, int]) -> np.ndarray:
+        """
+        Crop bounding box from original with optional padding.
+        """
         x, y, w, h = box
         Y1 = max(0, y - self.padding)
         Y2 = min(original.shape[0], y + h + self.padding)
         X2 = min(original.shape[1], x + w + self.padding)
         return original[Y1:Y2, X1:X2]
+    def is_artifact_by_color(self, cell_img: np.ndarray) -> bool:
         """
+        Revert to the *exact* color-based artifact logic from the first script:
+          1) If the average color is near #a6a6a6 or #a7a7a7 (within color_tolerance),
+             skip it. Otherwise, keep it.
         """
         if cell_img.size == 0:
             return True
+        avg_col = average_bgr(cell_img)
+        dist_a6 = color_distance(avg_col, self.artifact_color_a6)
+        if dist_a6 < self.color_tolerance:
             return True
+        dist_a7 = color_distance(avg_col, self.artifact_color_a7)
+        if dist_a7 < self.color_tolerance:
+            return True
+        dist_a8 = color_distance(avg_col, self.artifact_color_a8)
+        if dist_a8 < self.color_tolerance:
+            return True
+        dist_a9 = color_distance(avg_col, self.artifact_color_a9)
+        if dist_a9 < self.color_tolerance:
+            return True
+        dist_a10 = color_distance(avg_col, self.artifact_color_a10)
+        if dist_a10 < self.color_tolerance:
             return True
         return False
     def save_extracted_cells(
+        self,
+        image_path: str,
+        row_boxes: List[List[Tuple[int, int, int, int]]],
+        output_dir: str
     ):
+        """
+        Save each cell from the original image, skipping if it's near #a6a6a6 or #a7a7a7.
+        """
         out_path = Path(output_dir)
         out_path.mkdir(exist_ok=True, parents=True)
             row_dir.mkdir(exist_ok=True)
             for j, box in enumerate(row):
                 cell_img = self.extract_box_image(original, box)
+                # Check color-based artifact
+                if self.is_artifact_by_color(cell_img):
+                    logger.info(f"Skipping artifact cell at row={i}, col={j} (color near #a6a6a6/#a7a7a7).")
                     continue
                 out_file = row_dir / f"col_{j}.png"
                 cv2.imwrite(str(out_file), cell_img)
+                logger.info(f"Saved cell row={i}, col={j} -> {out_file}")
 class TableExtractorApp:
     def __init__(self, extractor: TableExtractor):
         self.extractor.save_extracted_cells(input_image, row_boxes, output_folder)
         logger.info("Done. Check the output folder for results.")
 if __name__ == "__main__":
+    input_image = "images/test/img_9.png"
+    output_folder = "combined_outputs"
     extractor = TableExtractor(
+        row_morph_iterations=1,
+        min_row_height=15,
+        skip_header=False,
         merge_two_col_rows=True,
         enable_subtopic_merge=True,
         subtopic_threshold=0.2,
+        faint_line_threshold_factor=0.4,
+        top_line_grouping_px=12,
+        some_minimum_text_pixels=50,
+        color_tolerance=30.0
     )
     app = TableExtractorApp(extractor)

topic_extr.py CHANGED Viewed

@@ -35,6 +35,7 @@ logger.addHandler(file_handler)
 _GEMINI_CLIENT = None
 def unify_whitespace(text: str) -> str:
     return re.sub(r"\s+", " ", text).strip()
@@ -66,6 +67,123 @@ def create_subset_pdf(original_pdf_bytes: bytes, page_indices: List[int]) -> byt
     doc.close()
     return subset_bytes
 class s3Writer:
     def __init__(self, ak: str, sk: str, bucket: str, endpoint_url: str):
         self.bucket = bucket
@@ -114,15 +232,20 @@ def call_gemini_for_table_classification(image_data: bytes, api_key: str, max_re
             prompt = """You are given an image. Determine if it shows a table that has exactly 2 or 3 columns.
 The three-column 'table' image includes such key features:
     - Three columns header
-    - Headers like 'Topics', 'Content', 'Guidelines'
     - Possibly sections (e.g. 8.4, 9.1)
 The two-column 'table' image includes such key features:
     - Two columns
-    - Headers like 'Subject content' and 'Additional information'
-    - Possibly sections (e.g. 2.1, 3.4)
 If the image is a relevant table with 2 columns, respond with 'TWO_COLUMN'.
 If the image is a relevant table with 3 columns, respond with 'THREE_COLUMN'.
-If the image does not show a table at all, respond with 'NO_TABLE'.
 Return only one of these exact labels.
 """
             global _GEMINI_CLIENT
@@ -153,6 +276,8 @@ Return only one of these exact labels.
                     return "THREE_COLUMN"
                 elif "TWO" in classification:
                     return "TWO_COLUMN"
             return "NO_TABLE"
         except Exception as e:
             logger.error(f"Gemini table classification error: {e}")
@@ -172,54 +297,86 @@ def call_gemini_for_subtopic_identification_image(image_data: bytes, api_key: st
     for attempt in range(max_retries + 1):
         try:
             prompt = """
-            You are given an image from an educational curriculum specification. The image may contain either:
 1) A main topic heading in the format: "<number> <Topic Name>", for example "2 Algebra and functions continued".
 2) A subtopic heading in the format "<number>.<number>", for example "2.5", "2.6", or "3.4".
-3) Possibly no relevant text at all.
-Your task:
-1. If the cell shows a main topic, extract the topic name (e.g. "2 Algebra and functions") and place it in the JSON key "title".
-2. If the cell shows one or more subtopic numbers (e.g. "2.5", "2.6"), collect them in the JSON key "subtopics" as an array of strings.
-3. If neither a main topic nor subtopic is detected, return empty values.
-Output only valid JSON in this exact structure, with no extra text or explanation:
-Output only valid JSON in this exact structure, with no extra text or explanation:
-{
-  "title": "...",
-  "subtopics": [...]
-}
-Where:
-- "title" is the recognized main topic (if any). Otherwise, an empty string.
-- "subtopics" is an array of recognized subtopic numbers (e.g. ["2.5", "2.6"]). Otherwise, an empty array.
-Examples:
-1. If the image text is "2 Algebra and functions continued", return:
-{
-  "title": "2 Algebra and functions continued",
-  "subtopics": []
-}
-2. If the image text is "2.5 Solve linear and quadratic inequalities ...", return:
-{
-  "title": "",
-  "subtopics": ["2.5"]
-}
-3. If the image text is "2.6 Manipulate polynomials algebraically ...", return:
-{
-  "title": "",
-  "subtopics": ["2.6"]
-}
-If you cannot recognize any text matching these patterns, or if nothing is found, return:
-{
-  "title": "",
-  "subtopics": []
-}
 """
             global _GEMINI_CLIENT
             if _GEMINI_CLIENT is None:
                 _GEMINI_CLIENT = genai.Client(api_key=api_key)
@@ -242,13 +399,13 @@ If you cannot recognize any text matching these patterns, or if nothing is found
                 ],
                 config=types.GenerateContentConfig(temperature=0.0)
             )
-            # logger.info(f"Gemini subtopic extraction raw response: {resp.text if resp and resp.text else 'None'}")
             if not resp or not resp.text:
                 logger.warning("Gemini returned an empty response for subtopic extraction.")
                 return {"title": "", "subtopics": []}
             raw = resp.text.strip()
             raw = raw.replace("```json", "").replace("```", "").strip()
             data = json.loads(raw)
@@ -310,6 +467,10 @@ class S3ImageWriter(DataWriter):
                 info['final_alt'] = "HAS TO BE PROCESSED - two column table"
             elif cls == "THREE_COLUMN":
                 info['final_alt'] = "HAS TO BE PROCESSED - three column table"
             else:
                 info['final_alt'] = "NO_TABLE image"
             md_content = md_content.replace(f"![]({key}{p})", f"![{info['final_alt']}]({info['s3_path']})")
@@ -445,123 +606,6 @@ class S3ImageWriter(DataWriter):
     def post_process(self, key: str, md_content: str) -> str:
         return asyncio.run(self.post_process_async(key, md_content))
-class LocalImageWriter(DataWriter):
-    def __init__(self, output_folder: str, gemini_api_key: str):
-        self.output_folder = output_folder
-        os.makedirs(self.output_folder, exist_ok=True)
-        self.descriptions = {}
-        self._img_count = 0
-        self.gemini_api_key = gemini_api_key
-        self.extracted_tables = {}
-    def write(self, path: str, data: bytes) -> None:
-        self._img_count += 1
-        unique_id = f"img_{self._img_count}.jpg"
-        self.descriptions[path] = {
-            "data": data,
-            "relative_path": unique_id,
-            "table_classification": "NO_TABLE",
-            "final_alt": ""
-        }
-        image_path = os.path.join(self.output_folder, unique_id)
-        with open(image_path, "wb") as f:
-            f.write(data)
-    async def post_process_async(self, key: str, md_content: str) -> str:
-        logger.info("Classifying images to detect tables.")
-        tasks = []
-        for p, info in self.descriptions.items():
-            tasks.append((p, classify_image_async(info["data"], self.gemini_api_key)))
-        for p, task in tasks:
-            try:
-                classification = await task
-                self.descriptions[p]['table_classification'] = classification
-            except Exception as e:
-                logger.error(f"Table classification error: {e}")
-                self.descriptions[p]['table_classification'] = "NO_TABLE"
-        for p, info in self.descriptions.items():
-            cls = info['table_classification']
-            if cls == "TWO_COLUMN":
-                info['final_alt'] = "HAS TO BE PROCESSED - two column table"
-            elif cls == "THREE_COLUMN":
-                info['final_alt'] = "HAS TO BE PROCESSED - three column table"
-            else:
-                info['final_alt'] = "NO_TABLE image"
-            md_content = md_content.replace(f"![]({key}{p})", f"![{info['final_alt']}]({info['relative_path']})")
-        md_content = self._process_table_images_in_markdown(md_content)
-        final_lines = []
-        for line in md_content.split("\n"):
-            if re.match(r"^\!\[.*\]\(.*\)", line.strip()):
-                final_lines.append(line.strip())
-        return "\n".join(final_lines)
-    def _process_table_images_in_markdown(self, md_content: str) -> str:
-        pat = r"!\[HAS TO BE PROCESSED - (two|three) column table\]\(([^)]+)\)"
-        matches = re.findall(pat, md_content, flags=re.IGNORECASE)
-        if not matches:
-            return md_content
-        for (col_type, image_id) in matches:
-            logger.info(f"Processing table image => {image_id}, columns={col_type}")
-            temp_path = os.path.join(self.output_folder, image_id)
-            desc_item = None
-            for k, val in self.descriptions.items():
-                if val["relative_path"] == image_id:
-                    desc_item = val
-                    break
-            if not desc_item:
-                logger.warning(f"No matching image data for {image_id}, skipping extraction.")
-                continue
-            if not os.path.exists(temp_path):
-                with open(temp_path, "wb") as f:
-                    f.write(desc_item["data"])
-            try:
-                if col_type.lower() == 'two':       #check for table_row_extr script for more details
-                    extractor = TableExtractor(
-                        skip_header=True,
-                        merge_two_col_rows=True,
-                        enable_subtopic_merge=True,
-                        subtopic_threshold=0.2
-                    )
-                else:
-                    extractor = TableExtractor(
-                        skip_header=True,
-                        merge_two_col_rows=False,
-                        enable_subtopic_merge=False,
-                        subtopic_threshold=0.2
-                    )
-                row_boxes = extractor.process_image(temp_path)
-                out_folder = temp_path + "_rows"
-                os.makedirs(out_folder, exist_ok=True)
-                extractor.save_extracted_cells(temp_path, row_boxes, out_folder)
-                # List all extracted cell images relative to the output folder.
-                extracted_cells = []
-                for root, dirs, files in os.walk(out_folder):
-                    for file in files:
-                        rel_path = os.path.relpath(os.path.join(root, file), self.output_folder)
-                        extracted_cells.append(rel_path)
-                # Save mapping for testing.
-                self.extracted_tables[image_id] = extracted_cells
-                snippet = ["**Extracted table cells:**"]
-                for i, row in enumerate(row_boxes):
-                    row_dir = os.path.join(out_folder, f"row_{i}")
-                    for j, _ in enumerate(row):
-                        cell_file = f"col_{j}.jpg"
-                        cell_path = os.path.join(row_dir, cell_file)
-                        relp = os.path.relpath(cell_path, self.output_folder)
-                        snippet.append(f"![Row {i} Col {j}]({relp})")
-                new_snip = "\n".join(snippet)
-                old_line = f"![HAS TO BE PROCESSED - {col_type} column table]({image_id})"
-                md_content = md_content.replace(old_line, new_snip)
-            except Exception as e:
-                logger.error(f"Error processing table image {image_id}: {e}")
-            finally:
-                if os.path.exists(temp_path):
-                    os.remove(temp_path)
-        return md_content
-    def post_process(self, key: str, md_content: str) -> str:
-        return asyncio.run(self.post_process_async(key, md_content))
 class GeminiTopicExtractor:
     def __init__(self, api_key: str = None, num_pages: int = 14):
         self.api_key = api_key or os.getenv("GEMINI_API_KEY", "")
@@ -782,119 +826,6 @@ class MineruNoTextProcessor:
         except Exception as e:
             logger.error(f"Error during GPU cleanup: {e}")
-    def unify_topic_name(raw_title: str, children_subtopics: list) -> str:
-        """
-        Produce a cleaned-up topic name, removing any trailing '... continued'
-        and fixing partial or empty titles if it’s obvious from the subtopic numbering.
-        E.g. 'gonometry' with children '5.1', '5.2' → '5 Trigonometry'
-        """
-        title = raw_title.strip()
-        # Remove trailing " continued"
-        # E.g. "2 Algebra and functions continued" -> "2 Algebra and functions"
-        title = re.sub(r"\s+continued\s*$", "", title, flags=re.IGNORECASE)
-        # If the entire title is missing or obviously broken (like "gonometry"),
-        # guess a fix from the subtopics if they share a leading integer.
-        # e.g. if subtopics start with "5." => rename to "5 Trigonometry".
-        # You can add more sophisticated logic as needed.
-        if not title or title.lower().strip() in {"gonometry"}:
-            # Try to deduce from subtopic numbering
-            # Example: if children are "5.1", "5.2", that suggests a "5 Trigonometry"
-            all_subs = [child["title"] for child in children_subtopics]
-            # We'll parse the integer part from e.g. "5.1", "5.2"
-            # and guess "5 Trigonometry" if they're all "5.xxx".
-            if all_subs:
-                # Grab the first subtopic
-                first_sub = all_subs[0].strip()
-                m = re.match(r"^(\d+)\.", first_sub)
-                if m:
-                    parent_num = m.group(1)
-                    if parent_num == "5":
-                        title = "5 Trigonometry"
-                    elif parent_num == "2":
-                        title = "2 Algebra and functions"
-                    elif parent_num == "3":
-                        title = "3 Coordinate geometry in the (x, y) plane"
-                    elif parent_num == "4":
-                        title = "4 Statistical distributions"
-                    # etc., adapt to your needs
-                    # or leave as e.g. f"{parent_num} ???" if you cannot guess.
-        return title
-    def merge_topics(subtopic_list: list) -> list:
-        """
-        1. Cleans up each topic's title (remove " continued", fix partial titles).
-        2. Merges subtopics under the same cleaned-up parent name.
-        3. Sorts final output in ascending numeric order of the parent's leading number.
-        4. Sorts each parent's children in ascending numeric subtopic order.
-        """
-        # Dictionary keyed by *cleaned* parent title => {"title": "...", "contents": [...], "children": [...]}
-        merged = {}
-        for topic_obj in subtopic_list:
-            raw_title = topic_obj.get("title", "")
-            children = topic_obj.get("children", [])
-            contents = topic_obj.get("contents", [])
-            # Clean up the parent's title
-            new_title = unify_topic_name(raw_title, children)
-            # If we have already seen this (cleaned) title, merge
-            if new_title not in merged:
-                merged[new_title] = {
-                    "title": new_title,
-                    "contents": list(contents),  # copy
-                    "children": list(children),
-                }
-            else:
-                # Merge contents and children
-                merged[new_title]["contents"].extend(contents)
-                merged[new_title]["children"].extend(children)
-        # Next, for each parent's children, we might want to remove duplicates
-        # or unify them more. Here we simply unify if they have the same "title".
-        # If you have no duplicates, you can skip this loop.
-        for par_title, par_info in merged.items():
-            # Turn child list into map for merging
-            child_map = {}
-            for ch in par_info["children"]:
-                ctitle = ch.get("title", "").strip()
-                if ctitle not in child_map:
-                    child_map[ctitle] = ch
-                else:
-                    # Merge the "contents" and "children" if needed
-                    child_map[ctitle]["contents"].extend(ch.get("contents", []))
-                    child_map[ctitle]["children"].extend(ch.get("children", []))
-            # Overwrite the parent's children list with the merged versions
-            par_info["children"] = list(child_map.values())
-        # Sort the top-level topics by leading integer (e.g. "2 Algebra" < "5 Trigonometry")
-        # We'll parse the first integer from the parent's title, or push them last if no integer found.
-        def parse_parent_num(t):
-            match = re.match(r"^(\d+)", t)
-            return int(match.group(1)) if match else 9999
-        # Build the final list
-        final_list = list(merged.values())
-        final_list.sort(key=lambda x: parse_parent_num(x["title"]))
-        # Sort each parent's children by their numeric portion. E.g. "2.1" < "2.2" < "3.1"
-        def parse_subtopic_num(subtitle):
-            # "2.11" => (2, 11), "10.5" => (10, 5)
-            # or just parse all groups of digits
-            digits = re.findall(r"\d+", subtitle)
-            if not digits:
-                return (9999,)  # if no digits, push to end
-            return tuple(int(d) for d in digits)
-        for par_info in final_list:
-            par_info["children"].sort(key=lambda ch: parse_subtopic_num(ch["title"]))
-        return final_list
     def process(self, pdf_path: str) -> Dict[str, Any]:
         logger.info(f"Processing PDF: {pdf_path}")
         try:
@@ -972,9 +903,6 @@ class MineruNoTextProcessor:
             )
             #S3
             writer = S3ImageWriter(self.s3_writer, "/topic-extraction", self.gemini_api_key)
-            #local
-            # writer = LocalImageWriter(self.output_folder, self.gemini_api_key)
             md_prefix = "/topic-extraction/"
             pipe_result = inference.pipe_ocr_mode(writer, lang=self.language)
@@ -984,11 +912,7 @@ class MineruNoTextProcessor:
             subtopic_list = list(writer.extracted_subtopics.values())
             subtopic_list = merge_topics(subtopic_list)
-            # out_path = os.path.join(self.output_folder, "final_subtopics.json")
-            # with open(out_path, "w", encoding="utf-8") as f:
-            #     json.dump(subtopic_list, f, indent=2)
-            # logger.info(f"Final subtopics JSON saved locally at {out_path}")
-            out_path = os.path.join(self.output_folder, "final_subtopics.json")
             with open(out_path, "w", encoding="utf-8") as f:
                 json.dump(subtopic_list, f, indent=2)
             logger.info(f"Final subtopics JSON saved locally at {out_path}")

 _GEMINI_CLIENT = None
+#helper functions, also global
 def unify_whitespace(text: str) -> str:
     return re.sub(r"\s+", " ", text).strip()
     doc.close()
     return subset_bytes
+def unify_topic_name(raw_title: str, children_subtopics: list) -> str:
+    """
+    Clean up a topic title:
+    - Remove any trailing "continued".
+    - If the title does not start with a number but children provide a consistent numeric prefix,
+      then prepend that prefix.
+    """
+    title = raw_title.strip()
+    # Remove trailing "continued"
+    title = re.sub(r"\s+continued\s*$", "", title, flags=re.IGNORECASE)
+    # If title already starts with a number, use it as is.
+    if re.match(r"^\d+", title):
+        return title
+    # Otherwise, try to deduce a numeric prefix from the children.
+    prefixes = []
+    for child in children_subtopics:
+        child_title = child.get("title", "").strip()
+        m = re.match(r"^(\d+)\.", child_title)
+        if m:
+            prefixes.append(m.group(1))
+    if prefixes:
+        # If all numeric prefixes in children are the same, use that prefix.
+        if all(p == prefixes[0] for p in prefixes):
+            # If title is non-empty, prepend the number; otherwise, use a fallback.
+            if title:
+                title = f"{prefixes[0]} {title}"
+            else:
+                title = f"{prefixes[0]} Topic"
+    # Optionally, handle known broken titles explicitly.
+    if title.lower() in {"gonometry"}:
+        # For example, if children indicate "5.X", set to "5 Trigonometry"
+        if prefixes and prefixes[0] == "5":
+            title = "5 Trigonometry"
+    return title
+def merge_topics(subtopic_list: list) -> list:
+    """
+    Merge topics with an enhanced logic:
+    1. Clean up each topic's title using unify_topic_name.
+    2. Group topics by the parent's numeric prefix (if available). Topics without a numeric prefix use their title.
+    3. Reassign children: for each child whose title (e.g. "3.1") does not match its current parent's numeric prefix,
+       move it to the parent with the matching prefix if available.
+    4. Remove duplicate children by merging contents.
+    5. Sort parent topics and each parent's children by their numeric ordering.
+    """
+    # First, merge topics by parent's numeric prefix.
+    merged = {}
+    for topic_obj in subtopic_list:
+        raw_title = topic_obj.get("title", "")
+        children = topic_obj.get("children", [])
+        contents = topic_obj.get("contents", [])
+        new_title = unify_topic_name(raw_title, children)
+        # Extract parent's numeric prefix, if present.
+        m = re.match(r"^(\d+)", new_title)
+        parent_prefix = m.group(1) if m else None
+        key = parent_prefix if parent_prefix is not None else new_title
+        if key not in merged:
+            merged[key] = {
+                "title": new_title,
+                "contents": list(contents),
+                "children": list(children),
+            }
+        else:
+            # Merge contents and children; choose the longer title.
+            if len(new_title) > len(merged[key]["title"]):
+                merged[key]["title"] = new_title
+            merged[key]["contents"].extend(contents)
+            merged[key]["children"].extend(children)
+    # Build a lookup of merged topics by their numeric prefix.
+    parent_lookup = merged  # keys are numeric prefixes or the full title for non-numeric ones.
+    # Reassign children to the correct parent based on their numeric prefix.
+    for key, topic in merged.items():
+        new_children = []
+        for child in topic["children"]:
+            child_title = child.get("title", "").strip()
+            m_child = re.match(r"^(\d+)\.", child_title)
+            if m_child:
+                child_prefix = m_child.group(1)
+                if key != child_prefix and child_prefix in parent_lookup:
+                    # Reassign this child to the proper parent.
+                    parent_lookup[child_prefix]["children"].append(child)
+                    continue
+            new_children.append(child)
+        topic["children"] = new_children
+    # Remove duplicate children by merging their contents.
+    for topic in merged.values():
+        child_map = {}
+        for child in topic["children"]:
+            ctitle = child.get("title", "").strip()
+            if ctitle not in child_map:
+                child_map[ctitle] = child
+            else:
+                child_map[ctitle]["contents"].extend(child.get("contents", []))
+                child_map[ctitle]["children"].extend(child.get("children", []))
+        topic["children"] = list(child_map.values())
+        # Sort children by full numeric order (e.g. "2.1" < "2.10" < "2.2").
+        def parse_subtopic_num(subtitle):
+            digits = re.findall(r"\d+", subtitle)
+            return tuple(int(d) for d in digits) if digits else (9999,)
+        topic["children"].sort(key=lambda ch: parse_subtopic_num(ch.get("title", "")))
+    # Convert merged topics to a sorted list.
+    def parse_parent_num(topic):
+        m = re.match(r"^(\d+)", topic.get("title", ""))
+        return int(m.group(1)) if m else 9999
+    final_list = list(merged.values())
+    final_list.sort(key=lambda topic: parse_parent_num(topic))
+    return final_list
 class s3Writer:
     def __init__(self, ak: str, sk: str, bucket: str, endpoint_url: str):
         self.bucket = bucket
             prompt = """You are given an image. Determine if it shows a table that has exactly 2 or 3 columns.
 The three-column 'table' image includes such key features:
     - Three columns header
+    - Headers like 'Topics', 'Content', 'Guidelines', 'Amplification', 'Additional guidance notes', 'Area of Study'
     - Possibly sections (e.g. 8.4, 9.1)
 The two-column 'table' image includes such key features:
     - Two columns
+    - Headers like 'Subject content', 'Additional information'
+    - Possibly sections (e.g. 2.1, 3.4, G2, G3, )
+The empty image include such key features:
+    - Does not include anything at all (like a blank white or black image)
+    - Truncated image with words like 'Subject content', 'What students need to learn' with blue background.
+    - Truncated image with words like 'Topics', 'What students need to learn', 'Content' with grey background ((166, 166, 166) or (180,180,180) RGB color code).
+If the image is an empty image, respond with 'EMPTY_IMAGE'.
 If the image is a relevant table with 2 columns, respond with 'TWO_COLUMN'.
 If the image is a relevant table with 3 columns, respond with 'THREE_COLUMN'.
+If the image is non-empty but does not show a table, respond with 'NO_TABLE'.
 Return only one of these exact labels.
 """
             global _GEMINI_CLIENT
                     return "THREE_COLUMN"
                 elif "TWO" in classification:
                     return "TWO_COLUMN"
+                elif "EMPTY" in classification:
+                    return "EMPTY_IMAGE"
             return "NO_TABLE"
         except Exception as e:
             logger.error(f"Gemini table classification error: {e}")
     for attempt in range(max_retries + 1):
         try:
             prompt = """
+You are given an image from an educational curriculum specification. The image may contain:
 1) A main topic heading in the format: "<number> <Topic Name>", for example "2 Algebra and functions continued".
 2) A subtopic heading in the format "<number>.<number>", for example "2.5", "2.6", or "3.4".
+3) A label-like title in the left column of a two-column table, for example "G2", "G3", "Scarcity, choice and opportunity cost", or similar text without explicit numeric patterns (2.1, 3.4, etc.).
+4) Possibly no relevant text at all.
+Your task is to extract:
+- **"title"**: A recognized main topic or heading text.
+- **"subtopics"**: Any recognized subtopic numbers (e.g. "2.5", "2.6", "3.4"), as an array of strings.
+Follow these rules:
+(1) **If the cell shows a main topic in the format "<number> <Topic Name>",** for example "2 Algebra and functions continued", then:
+    - Put that text (without the word "continued") in "title". (e.g. "2 Algebra and functions")
+    - "subtopics" should be an empty array, unless you also see smaller subtopic numbers.
+(2) **If the cell shows one or more subtopic numbers** in the format "<number>.<number>", for example "2.5", "2.6", or "3.4", then:
+    - Collect those exact strings in the JSON key "subtopics" (an array of strings).
+    - "title" in this case should be an empty string if you only detect subtopics.
+      (Example: If text is "2.5 Solve linear inequalities...", then "title" = "", "subtopics" = ["2.5"]).
+(3) **If neither a main topic nor a subtopic is detected,** return empty values:
+    {
+      "title": "",
+      "subtopics": []
+    }
+(4) **If there is no numeric value in the left column** (e.g. "2.1" or "2 <Topic name>" not found) but the left column text appears to be a heading (for instance "Scarcity, choice and opportunity cost"), then:
+    - Use the **left column text** as "title".
+    - "subtopics" remains empty.
+    Example:
+    If the left column is "Scarcity, choice and opportunity cost" and the right column has definitions, your output is:
+    {
+      "title": "Scarcity, choice and opportunity cost",
+      "subtopics": []
+    }
+(5) **If there is a character + digit pattern** in the left column for a two-column table (for example "G2", "G3", "G4", "C1"), treat that as a topic-like label:
+    - Put that label text into "title" (e.g. "G2").
+    - "subtopics" remains empty unless you also see actual subtopic formats like "2.5", "3.4" inside the same cell.
+(6) **Output must be valid JSON** in this exact structure, with no extra text or explanation:
+    {
+      "title": "...",
+      "subtopics": [...]
+    }
+**Examples**:
+- If the image text is `"2 Algebra and functions continued"`, return:
+  {
+    "title": "2 Algebra and functions",
+    "subtopics": []
+  }
+- If the image text is `"2.5 Solve linear and quadratic inequalities ..."`, return:
+  {
+    "title": "",
+    "subtopics": ["2.5"]
+  }
+- If the image text is `"Scarcity, choice and opportunity cost"` (with no numeric patterns at all), return:
+  {
+    "title": "Scarcity, choice and opportunity cost",
+    "subtopics": []
+  }
+- If the left column says `"G2"` and the right column has details, but no subtopic numbers, return:
+  {
+    "title": "G2",
+    "subtopics": []
+  }
+- If you cannot recognize any text matching these patterns, or if nothing is found, return:
+  {
+    "title": "",
+    "subtopics": []
+  }
 """
             global _GEMINI_CLIENT
             if _GEMINI_CLIENT is None:
                 _GEMINI_CLIENT = genai.Client(api_key=api_key)
                 ],
                 config=types.GenerateContentConfig(temperature=0.0)
             )
             if not resp or not resp.text:
                 logger.warning("Gemini returned an empty response for subtopic extraction.")
                 return {"title": "", "subtopics": []}
             raw = resp.text.strip()
+            # Remove any markdown fences if present
             raw = raw.replace("```json", "").replace("```", "").strip()
             data = json.loads(raw)
                 info['final_alt'] = "HAS TO BE PROCESSED - two column table"
             elif cls == "THREE_COLUMN":
                 info['final_alt'] = "HAS TO BE PROCESSED - three column table"
+            elif cls == "EMPTY_IMAGE":
+                md_content = md_content.replace(f"![]({key}{p})", "")
+                del self.descriptions[p]
+                continue
             else:
                 info['final_alt'] = "NO_TABLE image"
             md_content = md_content.replace(f"![]({key}{p})", f"![{info['final_alt']}]({info['s3_path']})")
     def post_process(self, key: str, md_content: str) -> str:
         return asyncio.run(self.post_process_async(key, md_content))
 class GeminiTopicExtractor:
     def __init__(self, api_key: str = None, num_pages: int = 14):
         self.api_key = api_key or os.getenv("GEMINI_API_KEY", "")
         except Exception as e:
             logger.error(f"Error during GPU cleanup: {e}")
     def process(self, pdf_path: str) -> Dict[str, Any]:
         logger.info(f"Processing PDF: {pdf_path}")
         try:
             )
             #S3
             writer = S3ImageWriter(self.s3_writer, "/topic-extraction", self.gemini_api_key)
             md_prefix = "/topic-extraction/"
             pipe_result = inference.pipe_ocr_mode(writer, lang=self.language)
             subtopic_list = list(writer.extracted_subtopics.values())
             subtopic_list = merge_topics(subtopic_list)
+            out_path = os.path.join(self.output_folder, "subtopics.json")
             with open(out_path, "w", encoding="utf-8") as f:
                 json.dump(subtopic_list, f, indent=2)
             logger.info(f"Final subtopics JSON saved locally at {out_path}")

topic_extraction.log CHANGED Viewed

@@ -5558,3 +5558,314 @@ and series'. Using page 7.
 2025-03-03 18:09:13,257 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_28.jpg_r2_c0.png
 2025-03-03 18:09:15,022 [INFO] __main__ - GPU memory cleaned up.
 2025-03-03 18:09:15,023 [ERROR] __main__ - Processing failed: name 'merge_topics' is not defined

 2025-03-03 18:09:13,257 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_28.jpg_r2_c0.png
 2025-03-03 18:09:15,022 [INFO] __main__ - GPU memory cleaned up.
 2025-03-03 18:09:15,023 [ERROR] __main__ - Processing failed: name 'merge_topics' is not defined
+2025-03-04 14:56:39,218 [INFO] __main__ - Processing PDF: /home/user/app/input_output/a-level-pearson-mathematics-specification.pdf
+2025-03-04 14:56:40,018 [INFO] __main__ - Gemini returned subtopics: {'Paper 1 and Paper 2: Pure Mathematics': [11, 29], 'Paper 3: Statistics and Mechanics': [30, 40]}
+2025-03-04 14:56:40,019 [INFO] __main__ - Loaded 1135473 bytes from local file '/home/user/app/input_output/a-level-pearson-mathematics-specification.pdf'
+2025-03-04 14:56:40,316 [INFO] __main__ - Computed global offset: 4
+2025-03-04 14:56:40,316 [INFO] __main__ - Processing pages (0-based): [14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43]
+2025-03-04 14:58:48,246 [INFO] __main__ - Uploaded to S3: /topic-extraction/img_1.jpg
+2025-03-04 14:58:50,037 [INFO] __main__ - Uploaded to S3: /topic-extraction/img_2.jpg
+2025-03-04 14:58:50,583 [INFO] __main__ - Uploaded to S3: /topic-extraction/img_3.jpg
+2025-03-04 14:58:51,114 [INFO] __main__ - Uploaded to S3: /topic-extraction/img_4.jpg
+2025-03-04 14:58:51,657 [INFO] __main__ - Uploaded to S3: /topic-extraction/img_5.jpg
+2025-03-04 14:58:52,211 [INFO] __main__ - Uploaded to S3: /topic-extraction/img_6.jpg
+2025-03-04 14:58:52,686 [INFO] __main__ - Uploaded to S3: /topic-extraction/img_7.jpg
+2025-03-04 14:58:53,167 [INFO] __main__ - Uploaded to S3: /topic-extraction/img_8.jpg
+2025-03-04 14:58:53,667 [INFO] __main__ - Uploaded to S3: /topic-extraction/img_9.jpg
+2025-03-04 14:58:54,285 [INFO] __main__ - Uploaded to S3: /topic-extraction/img_10.jpg
+2025-03-04 14:58:54,850 [INFO] __main__ - Uploaded to S3: /topic-extraction/img_11.jpg
+2025-03-04 14:58:55,401 [INFO] __main__ - Uploaded to S3: /topic-extraction/img_12.jpg
+2025-03-04 14:58:55,916 [INFO] __main__ - Uploaded to S3: /topic-extraction/img_13.jpg
+2025-03-04 14:58:56,524 [INFO] __main__ - Uploaded to S3: /topic-extraction/img_14.jpg
+2025-03-04 14:58:56,999 [INFO] __main__ - Uploaded to S3: /topic-extraction/img_15.jpg
+2025-03-04 14:58:57,542 [INFO] __main__ - Uploaded to S3: /topic-extraction/img_16.jpg
+2025-03-04 14:58:58,071 [INFO] __main__ - Uploaded to S3: /topic-extraction/img_17.jpg
+2025-03-04 14:58:58,366 [INFO] __main__ - Uploaded to S3: /topic-extraction/img_18.jpg
+2025-03-04 14:58:58,849 [INFO] __main__ - Uploaded to S3: /topic-extraction/img_19.jpg
+2025-03-04 14:58:59,428 [INFO] __main__ - Uploaded to S3: /topic-extraction/img_20.jpg
+2025-03-04 14:58:59,995 [INFO] __main__ - Uploaded to S3: /topic-extraction/img_21.jpg
+2025-03-04 14:59:00,597 [INFO] __main__ - Uploaded to S3: /topic-extraction/img_22.jpg
+2025-03-04 14:59:01,070 [INFO] __main__ - Uploaded to S3: /topic-extraction/img_23.jpg
+2025-03-04 14:59:01,567 [INFO] __main__ - Uploaded to S3: /topic-extraction/img_24.jpg
+2025-03-04 14:59:02,141 [INFO] __main__ - Uploaded to S3: /topic-extraction/img_25.jpg
+2025-03-04 14:59:02,569 [INFO] __main__ - Uploaded to S3: /topic-extraction/img_26.jpg
+2025-03-04 14:59:03,024 [INFO] __main__ - Uploaded to S3: /topic-extraction/img_27.jpg
+2025-03-04 14:59:03,607 [INFO] __main__ - Uploaded to S3: /topic-extraction/img_28.jpg
+2025-03-04 14:59:04,016 [INFO] __main__ - Classifying images to detect tables.
+2025-03-04 14:59:20,581 [INFO] __main__ - Processing table image: /topic-extraction/img_1.jpg, columns=three
+2025-03-04 14:59:23,252 [WARNING] __main__ - Cell image not found: /tmp/tmpijzc040v.jpg_rows/row_0/col_0.png
+2025-03-04 14:59:23,252 [WARNING] __main__ - Cell image not found: /tmp/tmpijzc040v.jpg_rows/row_0/col_1.png
+2025-03-04 14:59:23,748 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_1.jpg_r1_c0.png
+2025-03-04 14:59:25,146 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_1.jpg_r1_c1.png
+2025-03-04 14:59:26,469 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_1.jpg_r2_c0.png
+2025-03-04 14:59:27,272 [INFO] __main__ - Processing table image: /topic-extraction/img_2.jpg, columns=three
+2025-03-04 14:59:30,158 [WARNING] __main__ - Cell image not found: /tmp/tmplbse6rk2.jpg_rows/row_0/col_0.png
+2025-03-04 14:59:30,158 [WARNING] __main__ - Cell image not found: /tmp/tmplbse6rk2.jpg_rows/row_0/col_1.png
+2025-03-04 14:59:30,420 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_2.jpg_r1_c0.png
+2025-03-04 14:59:31,612 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_2.jpg_r1_c1.png
+2025-03-04 14:59:34,174 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_2.jpg_r2_c0.png
+2025-03-04 14:59:35,585 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_2.jpg_r3_c0.png
+2025-03-04 14:59:36,908 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_2.jpg_r4_c0.png
+2025-03-04 14:59:38,024 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_2.jpg_r5_c0.png
+2025-03-04 14:59:38,783 [INFO] __main__ - Processing table image: /topic-extraction/img_3.jpg, columns=three
+2025-03-04 14:59:41,887 [WARNING] __main__ - Cell image not found: /tmp/tmp9jfrqv6f.jpg_rows/row_0/col_0.png
+2025-03-04 14:59:41,887 [WARNING] __main__ - Cell image not found: /tmp/tmp9jfrqv6f.jpg_rows/row_0/col_1.png
+2025-03-04 14:59:42,148 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_3.jpg_r1_c0.png
+2025-03-04 14:59:43,551 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_3.jpg_r1_c1.png
+2025-03-04 14:59:45,241 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_3.jpg_r2_c0.png
+2025-03-04 14:59:46,499 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_3.jpg_r3_c0.png
+2025-03-04 14:59:47,500 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_3.jpg_r4_c0.png
+2025-03-04 14:59:48,309 [INFO] __main__ - Processing table image: /topic-extraction/img_4.jpg, columns=three
+2025-03-04 14:59:51,311 [WARNING] __main__ - Cell image not found: /tmp/tmpbrv43l7_.jpg_rows/row_0/col_0.png
+2025-03-04 14:59:51,311 [WARNING] __main__ - Cell image not found: /tmp/tmpbrv43l7_.jpg_rows/row_0/col_1.png
+2025-03-04 14:59:51,311 [WARNING] __main__ - Cell image not found: /tmp/tmpbrv43l7_.jpg_rows/row_1/col_0.png
+2025-03-04 14:59:51,311 [WARNING] __main__ - Cell image not found: /tmp/tmpbrv43l7_.jpg_rows/row_1/col_1.png
+2025-03-04 14:59:51,579 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_4.jpg_r2_c0.png
+2025-03-04 14:59:53,042 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_4.jpg_r2_c1.png
+2025-03-04 14:59:54,470 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_4.jpg_r3_c0.png
+2025-03-04 14:59:55,460 [INFO] __main__ - Processing table image: /topic-extraction/img_5.jpg, columns=three
+2025-03-04 14:59:58,401 [WARNING] __main__ - Cell image not found: /tmp/tmpdj8vn5v4.jpg_rows/row_0/col_0.png
+2025-03-04 14:59:58,401 [WARNING] __main__ - Cell image not found: /tmp/tmpdj8vn5v4.jpg_rows/row_0/col_1.png
+2025-03-04 14:59:58,659 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_5.jpg_r1_c0.png
+2025-03-04 15:00:00,036 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_5.jpg_r1_c1.png
+2025-03-04 15:00:01,411 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_5.jpg_r2_c0.png
+2025-03-04 15:00:02,747 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_5.jpg_r3_c0.png
+2025-03-04 15:00:03,656 [INFO] __main__ - Processing table image: /topic-extraction/img_6.jpg, columns=three
+2025-03-04 15:00:06,880 [WARNING] __main__ - Cell image not found: /tmp/tmpw4hdm_vm.jpg_rows/row_0/col_0.png
+2025-03-04 15:00:06,881 [WARNING] __main__ - Cell image not found: /tmp/tmpw4hdm_vm.jpg_rows/row_0/col_1.png
+2025-03-04 15:00:07,144 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_6.jpg_r1_c0.png
+2025-03-04 15:00:08,578 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_6.jpg_r1_c1.png
+2025-03-04 15:00:09,789 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_6.jpg_r2_c0.png
+2025-03-04 15:00:12,763 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_6.jpg_r2_c1.png
+2025-03-04 15:00:14,173 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_6.jpg_r3_c0.png
+2025-03-04 15:00:15,229 [INFO] __main__ - Processing table image: /topic-extraction/img_7.jpg, columns=three
+2025-03-04 15:00:18,336 [WARNING] __main__ - Cell image not found: /tmp/tmpier2e_jn.jpg_rows/row_0/col_0.png
+2025-03-04 15:00:18,336 [WARNING] __main__ - Cell image not found: /tmp/tmpier2e_jn.jpg_rows/row_0/col_1.png
+2025-03-04 15:00:18,607 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_7.jpg_r1_c0.png
+2025-03-04 15:00:19,964 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_7.jpg_r1_c1.png
+2025-03-04 15:00:21,423 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_7.jpg_r2_c0.png
+2025-03-04 15:00:22,514 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_7.jpg_r3_c0.png
+2025-03-04 15:00:23,784 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_7.jpg_r3_c1.png
+2025-03-04 15:00:25,023 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_7.jpg_r4_c0.png
+2025-03-04 15:00:26,014 [INFO] __main__ - Processing table image: /topic-extraction/img_8.jpg, columns=three
+2025-03-04 15:00:30,110 [WARNING] __main__ - Cell image not found: /tmp/tmpwzp5zo9m.jpg_rows/row_0/col_0.png
+2025-03-04 15:00:30,295 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_8.jpg_r0_c1.png
+2025-03-04 15:00:30,957 [WARNING] __main__ - Cell image not found: /tmp/tmpwzp5zo9m.jpg_rows/row_1/col_0.png
+2025-03-04 15:00:30,958 [WARNING] __main__ - Cell image not found: /tmp/tmpwzp5zo9m.jpg_rows/row_1/col_1.png
+2025-03-04 15:00:30,958 [WARNING] __main__ - Cell image not found: /tmp/tmpwzp5zo9m.jpg_rows/row_1/col_2.png
+2025-03-04 15:00:31,219 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_8.jpg_r2_c0.png
+2025-03-04 15:00:32,311 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_8.jpg_r2_c1.png
+2025-03-04 15:00:33,619 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_8.jpg_r2_c2.png
+2025-03-04 15:00:34,694 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_8.jpg_r3_c0.png
+2025-03-04 15:00:35,762 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_8.jpg_r3_c1.png
+2025-03-04 15:00:36,796 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_8.jpg_r4_c0.png
+2025-03-04 15:00:37,972 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_8.jpg_r4_c1.png
+2025-03-04 15:00:39,110 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_8.jpg_r5_c0.png
+2025-03-04 15:00:40,404 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_8.jpg_r5_c1.png
+2025-03-04 15:00:41,716 [ERROR] __main__ - Gemini subtopic identification error on attempt 0: Expecting value: line 1 column 1 (char 0)
+2025-03-04 15:00:43,487 [ERROR] __main__ - Gemini subtopic identification error on attempt 1: Expecting value: line 1 column 1 (char 0)
+2025-03-04 15:00:43,665 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_8.jpg_r6_c0.png
+2025-03-04 15:00:44,879 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_8.jpg_r6_c1.png
+2025-03-04 15:00:45,862 [ERROR] __main__ - Gemini subtopic identification error on attempt 0: Expecting value: line 1 column 1 (char 0)
+2025-03-04 15:00:47,337 [ERROR] __main__ - Gemini subtopic identification error on attempt 1: Expecting value: line 1 column 1 (char 0)
+2025-03-04 15:00:47,338 [WARNING] __main__ - Cell image not found: /tmp/tmpwzp5zo9m.jpg_rows/row_7/col_0.png
+2025-03-04 15:00:47,338 [INFO] __main__ - Processing table image: /topic-extraction/img_9.jpg, columns=three
+2025-03-04 15:00:50,852 [WARNING] __main__ - Cell image not found: /tmp/tmp45kbg898.jpg_rows/row_0/col_0.png
+2025-03-04 15:00:50,853 [WARNING] __main__ - Cell image not found: /tmp/tmp45kbg898.jpg_rows/row_0/col_1.png
+2025-03-04 15:00:50,853 [WARNING] __main__ - Cell image not found: /tmp/tmp45kbg898.jpg_rows/row_0/col_2.png
+2025-03-04 15:00:52,290 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_9.jpg_r1_c0.png
+2025-03-04 15:00:53,354 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_9.jpg_r1_c1.png
+2025-03-04 15:00:54,709 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_9.jpg_r1_c2.png
+2025-03-04 15:00:55,877 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_9.jpg_r2_c0.png
+2025-03-04 15:00:57,178 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_9.jpg_r2_c1.png
+2025-03-04 15:00:58,304 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_9.jpg_r3_c0.png
+2025-03-04 15:00:59,735 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_9.jpg_r3_c1.png
+2025-03-04 15:01:00,944 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_9.jpg_r4_c0.png
+2025-03-04 15:01:02,239 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_9.jpg_r4_c1.png
+2025-03-04 15:01:03,416 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_9.jpg_r5_c0.png
+2025-03-04 15:01:04,618 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_9.jpg_r5_c1.png
+2025-03-04 15:01:05,434 [INFO] __main__ - Processing table image: /topic-extraction/img_10.jpg, columns=three
+2025-03-04 15:01:08,588 [WARNING] __main__ - Cell image not found: /tmp/tmpqskyhmda.jpg_rows/row_0/col_0.png
+2025-03-04 15:01:08,588 [WARNING] __main__ - Cell image not found: /tmp/tmpqskyhmda.jpg_rows/row_0/col_1.png
+2025-03-04 15:01:08,855 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_10.jpg_r1_c0.png
+2025-03-04 15:01:10,100 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_10.jpg_r1_c1.png
+2025-03-04 15:01:11,458 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_10.jpg_r2_c0.png
+2025-03-04 15:01:13,002 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_10.jpg_r3_c0.png
+2025-03-04 15:01:14,421 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_10.jpg_r4_c0.png
+2025-03-04 15:01:15,795 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_10.jpg_r5_c0.png
+2025-03-04 15:01:16,778 [INFO] __main__ - Processing table image: /topic-extraction/img_11.jpg, columns=two
+2025-03-04 15:01:19,849 [WARNING] __main__ - Cell image not found: /tmp/tmpragajvqv.jpg_rows/row_0/col_0.png
+2025-03-04 15:01:20,292 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_11.jpg_r1_c0.png
+2025-03-04 15:01:21,681 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_11.jpg_r2_c0.png
+2025-03-04 15:01:23,001 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_11.jpg_r3_c0.png
+2025-03-04 15:01:24,256 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_11.jpg_r4_c0.png
+2025-03-04 15:01:25,614 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_11.jpg_r5_c0.png
+2025-03-04 15:01:26,879 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_11.jpg_r6_c0.png
+2025-03-04 15:01:28,027 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_11.jpg_r7_c0.png
+2025-03-04 15:01:28,867 [INFO] __main__ - Processing table image: /topic-extraction/img_12.jpg, columns=three
+2025-03-04 15:01:31,707 [WARNING] __main__ - Cell image not found: /tmp/tmptajrb9oq.jpg_rows/row_0/col_0.png
+2025-03-04 15:01:31,708 [WARNING] __main__ - Cell image not found: /tmp/tmptajrb9oq.jpg_rows/row_0/col_1.png
+2025-03-04 15:01:31,708 [WARNING] __main__ - Cell image not found: /tmp/tmptajrb9oq.jpg_rows/row_1/col_0.png
+2025-03-04 15:01:31,708 [WARNING] __main__ - Cell image not found: /tmp/tmptajrb9oq.jpg_rows/row_1/col_1.png
+2025-03-04 15:01:31,968 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_12.jpg_r2_c0.png
+2025-03-04 15:01:33,379 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_12.jpg_r2_c1.png
+2025-03-04 15:01:34,597 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_12.jpg_r3_c0.png
+2025-03-04 15:01:35,923 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_12.jpg_r3_c1.png
+2025-03-04 15:01:37,229 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_12.jpg_r4_c0.png
+2025-03-04 15:01:38,254 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_12.jpg_r5_c0.png
+2025-03-04 15:01:39,166 [INFO] __main__ - Processing table image: /topic-extraction/img_13.jpg, columns=three
+2025-03-04 15:01:42,003 [WARNING] __main__ - Cell image not found: /tmp/tmpzd8rmysx.jpg_rows/row_0/col_0.png
+2025-03-04 15:01:42,004 [WARNING] __main__ - Cell image not found: /tmp/tmpzd8rmysx.jpg_rows/row_0/col_1.png
+2025-03-04 15:01:42,004 [WARNING] __main__ - Cell image not found: /tmp/tmpzd8rmysx.jpg_rows/row_1/col_0.png
+2025-03-04 15:01:42,004 [WARNING] __main__ - Cell image not found: /tmp/tmpzd8rmysx.jpg_rows/row_1/col_1.png
+2025-03-04 15:01:42,258 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_13.jpg_r2_c0.png
+2025-03-04 15:01:43,581 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_13.jpg_r2_c1.png
+2025-03-04 15:01:44,840 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_13.jpg_r3_c0.png
+2025-03-04 15:01:46,192 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_13.jpg_r4_c0.png
+2025-03-04 15:01:47,564 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_13.jpg_r5_c0.png
+2025-03-04 15:01:48,735 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_13.jpg_r6_c0.png
+2025-03-04 15:01:49,480 [INFO] __main__ - Processing table image: /topic-extraction/img_14.jpg, columns=three
+2025-03-04 15:01:53,309 [WARNING] __main__ - Cell image not found: /tmp/tmp6agbobyu.jpg_rows/row_0/col_0.png
+2025-03-04 15:01:53,310 [WARNING] __main__ - Cell image not found: /tmp/tmp6agbobyu.jpg_rows/row_0/col_1.png
+2025-03-04 15:01:53,583 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_14.jpg_r1_c0.png
+2025-03-04 15:01:54,959 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_14.jpg_r1_c1.png
+2025-03-04 15:01:56,286 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_14.jpg_r2_c0.png
+2025-03-04 15:01:57,618 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_14.jpg_r3_c0.png
+2025-03-04 15:01:58,711 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_14.jpg_r4_c0.png
+2025-03-04 15:01:59,972 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_14.jpg_r4_c1.png
+2025-03-04 15:02:01,443 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_14.jpg_r5_c0.png
+2025-03-04 15:02:02,711 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_14.jpg_r6_c0.png
+2025-03-04 15:02:03,674 [INFO] __main__ - Processing table image: /topic-extraction/img_15.jpg, columns=three
+2025-03-04 15:02:06,780 [WARNING] __main__ - Cell image not found: /tmp/tmp3lbuxp25.jpg_rows/row_0/col_0.png
+2025-03-04 15:02:06,781 [WARNING] __main__ - Cell image not found: /tmp/tmp3lbuxp25.jpg_rows/row_0/col_1.png
+2025-03-04 15:02:07,040 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_15.jpg_r1_c0.png
+2025-03-04 15:02:08,455 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_15.jpg_r1_c1.png
+2025-03-04 15:02:09,838 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_15.jpg_r2_c0.png
+2025-03-04 15:02:11,221 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_15.jpg_r3_c0.png
+2025-03-04 15:02:12,570 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_15.jpg_r4_c0.png
+2025-03-04 15:02:13,800 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_15.jpg_r5_c0.png
+2025-03-04 15:02:14,741 [INFO] __main__ - Processing table image: /topic-extraction/img_16.jpg, columns=three
+2025-03-04 15:02:18,051 [WARNING] __main__ - Cell image not found: /tmp/tmpqve047e1.jpg_rows/row_0/col_0.png
+2025-03-04 15:02:18,051 [WARNING] __main__ - Cell image not found: /tmp/tmpqve047e1.jpg_rows/row_0/col_1.png
+2025-03-04 15:02:18,051 [WARNING] __main__ - Cell image not found: /tmp/tmpqve047e1.jpg_rows/row_1/col_0.png
+2025-03-04 15:02:18,052 [WARNING] __main__ - Cell image not found: /tmp/tmpqve047e1.jpg_rows/row_1/col_1.png
+2025-03-04 15:02:18,310 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_16.jpg_r2_c0.png
+2025-03-04 15:02:19,484 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_16.jpg_r2_c1.png
+2025-03-04 15:02:20,750 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_16.jpg_r3_c0.png
+2025-03-04 15:02:21,962 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_16.jpg_r4_c0.png
+2025-03-04 15:02:23,279 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_16.jpg_r4_c1.png
+2025-03-04 15:02:24,677 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_16.jpg_r5_c0.png
+2025-03-04 15:02:25,990 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_16.jpg_r6_c0.png
+2025-03-04 15:02:27,144 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_16.jpg_r7_c0.png
+2025-03-04 15:02:27,953 [INFO] __main__ - Processing table image: /topic-extraction/img_17.jpg, columns=three
+2025-03-04 15:02:31,142 [WARNING] __main__ - Cell image not found: /tmp/tmp580zpmu1.jpg_rows/row_0/col_0.png
+2025-03-04 15:02:31,142 [WARNING] __main__ - Cell image not found: /tmp/tmp580zpmu1.jpg_rows/row_0/col_1.png
+2025-03-04 15:02:31,397 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_17.jpg_r1_c0.png
+2025-03-04 15:02:32,685 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_17.jpg_r1_c1.png
+2025-03-04 15:02:34,235 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_17.jpg_r2_c0.png
+2025-03-04 15:02:35,330 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_17.jpg_r3_c0.png
+2025-03-04 15:02:36,635 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_17.jpg_r3_c1.png
+2025-03-04 15:02:37,985 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_17.jpg_r4_c0.png
+2025-03-04 15:02:39,401 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_17.jpg_r5_c0.png
+2025-03-04 15:02:40,763 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_17.jpg_r6_c0.png
+2025-03-04 15:02:41,985 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_17.jpg_r7_c0.png
+2025-03-04 15:02:42,875 [INFO] __main__ - Processing table image: /topic-extraction/img_18.jpg, columns=three
+2025-03-04 15:02:43,771 [WARNING] __main__ - Cell image not found: /tmp/tmpccm4skpd.jpg_rows/row_0/col_0.png
+2025-03-04 15:02:43,772 [WARNING] __main__ - Cell image not found: /tmp/tmpccm4skpd.jpg_rows/row_0/col_1.png
+2025-03-04 15:02:43,772 [WARNING] __main__ - Cell image not found: /tmp/tmpccm4skpd.jpg_rows/row_1/col_0.png
+2025-03-04 15:02:43,772 [WARNING] __main__ - Cell image not found: /tmp/tmpccm4skpd.jpg_rows/row_1/col_1.png
+2025-03-04 15:02:44,032 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_18.jpg_r2_c0.png
+2025-03-04 15:02:45,366 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_18.jpg_r2_c1.png
+2025-03-04 15:02:46,585 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_18.jpg_r3_c0.png
+2025-03-04 15:02:47,559 [INFO] __main__ - Processing table image: /topic-extraction/img_19.jpg, columns=three
+2025-03-04 15:02:50,123 [WARNING] __main__ - Cell image not found: /tmp/tmpclhr29f1.jpg_rows/row_0/col_0.png
+2025-03-04 15:02:50,124 [WARNING] __main__ - Cell image not found: /tmp/tmpclhr29f1.jpg_rows/row_0/col_1.png
+2025-03-04 15:02:50,124 [WARNING] __main__ - Cell image not found: /tmp/tmpclhr29f1.jpg_rows/row_1/col_0.png
+2025-03-04 15:02:50,124 [WARNING] __main__ - Cell image not found: /tmp/tmpclhr29f1.jpg_rows/row_1/col_1.png
+2025-03-04 15:02:50,378 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_19.jpg_r2_c0.png
+2025-03-04 15:02:51,859 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_19.jpg_r2_c1.png
+2025-03-04 15:02:53,257 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_19.jpg_r3_c0.png
+2025-03-04 15:02:54,584 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_19.jpg_r3_c1.png
+2025-03-04 15:02:55,736 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_19.jpg_r4_c0.png
+2025-03-04 15:02:56,672 [INFO] __main__ - Processing table image: /topic-extraction/img_20.jpg, columns=three
+2025-03-04 15:03:00,454 [WARNING] __main__ - Cell image not found: /tmp/tmptx9dz9xc.jpg_rows/row_0/col_0.png
+2025-03-04 15:03:00,454 [WARNING] __main__ - Cell image not found: /tmp/tmptx9dz9xc.jpg_rows/row_0/col_1.png
+2025-03-04 15:03:00,737 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_20.jpg_r1_c0.png
+2025-03-04 15:03:02,337 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_20.jpg_r1_c1.png
+2025-03-04 15:03:03,839 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_20.jpg_r2_c0.png
+2025-03-04 15:03:04,889 [INFO] __main__ - Processing table image: /topic-extraction/img_21.jpg, columns=three
+2025-03-04 15:03:08,043 [WARNING] __main__ - Cell image not found: /tmp/tmp18_5p4lj.jpg_rows/row_0/col_0.png
+2025-03-04 15:03:08,044 [WARNING] __main__ - Cell image not found: /tmp/tmp18_5p4lj.jpg_rows/row_0/col_1.png
+2025-03-04 15:03:08,322 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_21.jpg_r1_c0.png
+2025-03-04 15:03:09,913 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_21.jpg_r1_c1.png
+2025-03-04 15:03:11,063 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_21.jpg_r2_c0.png
+2025-03-04 15:03:12,387 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_21.jpg_r2_c1.png
+2025-03-04 15:03:13,743 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_21.jpg_r3_c0.png
+2025-03-04 15:03:14,671 [INFO] __main__ - Processing table image: /topic-extraction/img_22.jpg, columns=three
+2025-03-04 15:03:17,999 [WARNING] __main__ - Cell image not found: /tmp/tmppc_cs35e.jpg_rows/row_0/col_0.png
+2025-03-04 15:03:18,000 [WARNING] __main__ - Cell image not found: /tmp/tmppc_cs35e.jpg_rows/row_0/col_1.png
+2025-03-04 15:03:18,271 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_22.jpg_r1_c0.png
+2025-03-04 15:03:19,493 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_22.jpg_r1_c1.png
+2025-03-04 15:03:20,669 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_22.jpg_r2_c0.png
+2025-03-04 15:03:22,038 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_22.jpg_r2_c1.png
+2025-03-04 15:03:23,431 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_22.jpg_r3_c0.png
+2025-03-04 15:03:24,490 [WARNING] __main__ - Cell image not found: /tmp/tmppc_cs35e.jpg_rows/row_4/col_0.png
+2025-03-04 15:03:24,491 [INFO] __main__ - Processing table image: /topic-extraction/img_23.jpg, columns=three
+2025-03-04 15:03:27,293 [WARNING] __main__ - Cell image not found: /tmp/tmpk98o_fpp.jpg_rows/row_0/col_0.png
+2025-03-04 15:03:27,294 [WARNING] __main__ - Cell image not found: /tmp/tmpk98o_fpp.jpg_rows/row_0/col_1.png
+2025-03-04 15:03:27,553 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_23.jpg_r1_c0.png
+2025-03-04 15:03:28,769 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_23.jpg_r1_c1.png
+2025-03-04 15:03:29,940 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_23.jpg_r2_c0.png
+2025-03-04 15:03:31,452 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_23.jpg_r2_c1.png
+2025-03-04 15:03:32,738 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_23.jpg_r3_c0.png
+2025-03-04 15:03:33,643 [INFO] __main__ - Processing table image: /topic-extraction/img_24.jpg, columns=three
+2025-03-04 15:03:36,892 [WARNING] __main__ - Cell image not found: /tmp/tmpsdjidh_w.jpg_rows/row_0/col_0.png
+2025-03-04 15:03:36,892 [WARNING] __main__ - Cell image not found: /tmp/tmpsdjidh_w.jpg_rows/row_0/col_1.png
+2025-03-04 15:03:36,892 [WARNING] __main__ - Cell image not found: /tmp/tmpsdjidh_w.jpg_rows/row_1/col_0.png
+2025-03-04 15:03:36,892 [WARNING] __main__ - Cell image not found: /tmp/tmpsdjidh_w.jpg_rows/row_1/col_1.png
+2025-03-04 15:03:37,188 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_24.jpg_r2_c0.png
+2025-03-04 15:03:38,642 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_24.jpg_r2_c1.png
+2025-03-04 15:03:40,017 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_24.jpg_r3_c0.png
+2025-03-04 15:03:41,095 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_24.jpg_r4_c0.png
+2025-03-04 15:03:42,514 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_24.jpg_r4_c1.png
+2025-03-04 15:03:43,481 [INFO] __main__ - Processing table image: /topic-extraction/img_25.jpg, columns=two
+2025-03-04 15:03:46,397 [WARNING] __main__ - Cell image not found: /tmp/tmpt9roe876.jpg_rows/row_0/col_0.png
+2025-03-04 15:03:46,809 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_25.jpg_r1_c0.png
+2025-03-04 15:03:48,153 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_25.jpg_r2_c0.png
+2025-03-04 15:03:49,855 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_25.jpg_r3_c0.png
+2025-03-04 15:03:51,232 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_25.jpg_r4_c0.png
+2025-03-04 15:03:52,577 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_25.jpg_r5_c0.png
+2025-03-04 15:03:53,542 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_25.jpg_r6_c0.png
+2025-03-04 15:03:54,702 [INFO] __main__ - Processing table image: /topic-extraction/img_26.jpg, columns=three
+2025-03-04 15:03:57,292 [WARNING] __main__ - Cell image not found: /tmp/tmpkt4w7cqg.jpg_rows/row_0/col_0.png
+2025-03-04 15:03:57,292 [WARNING] __main__ - Cell image not found: /tmp/tmpkt4w7cqg.jpg_rows/row_0/col_1.png
+2025-03-04 15:03:57,547 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_26.jpg_r1_c0.png
+2025-03-04 15:03:58,694 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_26.jpg_r1_c1.png
+2025-03-04 15:04:00,096 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_26.jpg_r2_c0.png
+2025-03-04 15:04:01,892 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_26.jpg_r3_c0.png
+2025-03-04 15:04:03,198 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_26.jpg_r4_c0.png
+2025-03-04 15:04:04,066 [INFO] __main__ - Processing table image: /topic-extraction/img_27.jpg, columns=three
+2025-03-04 15:04:06,633 [WARNING] __main__ - Cell image not found: /tmp/tmp1z8ov49i.jpg_rows/row_0/col_0.png
+2025-03-04 15:04:06,633 [WARNING] __main__ - Cell image not found: /tmp/tmp1z8ov49i.jpg_rows/row_0/col_1.png
+2025-03-04 15:04:06,892 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_27.jpg_r1_c0.png
+2025-03-04 15:04:08,314 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_27.jpg_r1_c1.png
+2025-03-04 15:04:09,655 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_27.jpg_r2_c0.png
+2025-03-04 15:04:10,910 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_27.jpg_r3_c0.png
+2025-03-04 15:04:12,042 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_27.jpg_r4_c0.png
+2025-03-04 15:04:13,234 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_27.jpg_r4_c1.png
+2025-03-04 15:04:14,345 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_27.jpg_r5_c0.png
+2025-03-04 15:04:15,180 [INFO] __main__ - Processing table image: /topic-extraction/img_28.jpg, columns=three
+2025-03-04 15:04:18,179 [WARNING] __main__ - Cell image not found: /tmp/tmpsij1nmfi.jpg_rows/row_0/col_0.png
+2025-03-04 15:04:18,179 [WARNING] __main__ - Cell image not found: /tmp/tmpsij1nmfi.jpg_rows/row_0/col_1.png
+2025-03-04 15:04:18,363 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_28.jpg_r1_c0.png
+2025-03-04 15:04:19,871 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_28.jpg_r1_c1.png
+2025-03-04 15:04:21,379 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_28.jpg_r2_c0.png
+2025-03-04 15:04:23,137 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_28.jpg_r2_c1.png
+2025-03-04 15:04:24,801 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_28.jpg_r3_c0.png
+2025-03-04 15:04:26,569 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_28.jpg_r3_c1.png
+2025-03-04 15:04:28,289 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_28.jpg_r4_c0.png
+2025-03-04 15:04:29,718 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_28.jpg_r4_c1.png
+2025-03-04 15:04:31,009 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_28.jpg_r5_c0.png
+2025-03-04 15:04:31,836 [INFO] __main__ - Final subtopics JSON saved locally at /home/user/app/pearson_json/subtopics.json
+2025-03-04 15:04:32,192 [INFO] __main__ - GPU memory cleaned up.
+2025-03-04 15:04:32,199 [INFO] __main__ - Processing completed successfully.