SkyNait commited on
Commit
145c342
·
1 Parent(s): 121f305

correct JSON and filtering

Browse files
__pycache__/inference_svm_model.cpython-310.pyc CHANGED
Binary files a/__pycache__/inference_svm_model.cpython-310.pyc and b/__pycache__/inference_svm_model.cpython-310.pyc differ
 
__pycache__/mineru_single.cpython-310.pyc CHANGED
Binary files a/__pycache__/mineru_single.cpython-310.pyc and b/__pycache__/mineru_single.cpython-310.pyc differ
 
__pycache__/table_row_extraction.cpython-310.pyc CHANGED
Binary files a/__pycache__/table_row_extraction.cpython-310.pyc and b/__pycache__/table_row_extraction.cpython-310.pyc differ
 
__pycache__/topic_extraction.cpython-310.pyc CHANGED
Binary files a/__pycache__/topic_extraction.cpython-310.pyc and b/__pycache__/topic_extraction.cpython-310.pyc differ
 
__pycache__/worker.cpython-310.pyc CHANGED
Binary files a/__pycache__/worker.cpython-310.pyc and b/__pycache__/worker.cpython-310.pyc differ
 
pearson_json/subtopics.json ADDED
@@ -0,0 +1,914 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "title": "1 Statistical sampling",
4
+ "contents": [
5
+ {
6
+ "type": "image",
7
+ "key": "/topic-extraction/cells/img_1.jpg_r1_c0.png"
8
+ },
9
+ {
10
+ "type": "image",
11
+ "key": "/topic-extraction/cells/img_19.jpg_r2_c0.png"
12
+ }
13
+ ],
14
+ "children": [
15
+ {
16
+ "title": "1.1",
17
+ "contents": [
18
+ {
19
+ "type": "image",
20
+ "key": "/topic-extraction/cells/img_1.jpg_r1_c1.png"
21
+ },
22
+ {
23
+ "type": "image",
24
+ "key": "/topic-extraction/cells/img_19.jpg_r2_c1.png"
25
+ }
26
+ ],
27
+ "children": []
28
+ }
29
+ ]
30
+ },
31
+ {
32
+ "title": "2 Data presentation and interpretation",
33
+ "contents": [
34
+ {
35
+ "type": "image",
36
+ "key": "/topic-extraction/cells/img_2.jpg_r1_c0.png"
37
+ },
38
+ {
39
+ "type": "image",
40
+ "key": "/topic-extraction/cells/img_3.jpg_r1_c0.png"
41
+ },
42
+ {
43
+ "type": "image",
44
+ "key": "/topic-extraction/cells/img_4.jpg_r2_c0.png"
45
+ },
46
+ {
47
+ "type": "image",
48
+ "key": "/topic-extraction/cells/img_5.jpg_r1_c0.png"
49
+ },
50
+ {
51
+ "type": "image",
52
+ "key": "/topic-extraction/cells/img_6.jpg_r1_c0.png"
53
+ },
54
+ {
55
+ "type": "image",
56
+ "key": "/topic-extraction/cells/img_20.jpg_r1_c0.png"
57
+ },
58
+ {
59
+ "type": "image",
60
+ "key": "/topic-extraction/cells/img_21.jpg_r1_c0.png"
61
+ }
62
+ ],
63
+ "children": [
64
+ {
65
+ "title": "2.1",
66
+ "contents": [
67
+ {
68
+ "type": "image",
69
+ "key": "/topic-extraction/cells/img_2.jpg_r1_c1.png"
70
+ },
71
+ {
72
+ "type": "image",
73
+ "key": "/topic-extraction/cells/img_19.jpg_r3_c1.png"
74
+ }
75
+ ],
76
+ "children": []
77
+ },
78
+ {
79
+ "title": "2.2",
80
+ "contents": [
81
+ {
82
+ "type": "image",
83
+ "key": "/topic-extraction/cells/img_2.jpg_r2_c0.png"
84
+ },
85
+ {
86
+ "type": "image",
87
+ "key": "/topic-extraction/cells/img_20.jpg_r1_c1.png"
88
+ }
89
+ ],
90
+ "children": []
91
+ },
92
+ {
93
+ "title": "2.3",
94
+ "contents": [
95
+ {
96
+ "type": "image",
97
+ "key": "/topic-extraction/cells/img_2.jpg_r3_c0.png"
98
+ },
99
+ {
100
+ "type": "image",
101
+ "key": "/topic-extraction/cells/img_20.jpg_r2_c0.png"
102
+ }
103
+ ],
104
+ "children": []
105
+ },
106
+ {
107
+ "title": "2.4",
108
+ "contents": [
109
+ {
110
+ "type": "image",
111
+ "key": "/topic-extraction/cells/img_2.jpg_r4_c0.png"
112
+ },
113
+ {
114
+ "type": "image",
115
+ "key": "/topic-extraction/cells/img_21.jpg_r1_c1.png"
116
+ }
117
+ ],
118
+ "children": []
119
+ },
120
+ {
121
+ "title": "2.5",
122
+ "contents": [
123
+ {
124
+ "type": "image",
125
+ "key": "/topic-extraction/cells/img_3.jpg_r1_c1.png"
126
+ }
127
+ ],
128
+ "children": []
129
+ },
130
+ {
131
+ "title": "2.6",
132
+ "contents": [
133
+ {
134
+ "type": "image",
135
+ "key": "/topic-extraction/cells/img_3.jpg_r2_c0.png"
136
+ }
137
+ ],
138
+ "children": []
139
+ },
140
+ {
141
+ "title": "2.7",
142
+ "contents": [
143
+ {
144
+ "type": "image",
145
+ "key": "/topic-extraction/cells/img_4.jpg_r2_c1.png"
146
+ }
147
+ ],
148
+ "children": []
149
+ },
150
+ {
151
+ "title": "2.8",
152
+ "contents": [
153
+ {
154
+ "type": "image",
155
+ "key": "/topic-extraction/cells/img_5.jpg_r1_c1.png"
156
+ }
157
+ ],
158
+ "children": []
159
+ },
160
+ {
161
+ "title": "2.9",
162
+ "contents": [
163
+ {
164
+ "type": "image",
165
+ "key": "/topic-extraction/cells/img_5.jpg_r2_c0.png"
166
+ }
167
+ ],
168
+ "children": []
169
+ },
170
+ {
171
+ "title": "2.10",
172
+ "contents": [
173
+ {
174
+ "type": "image",
175
+ "key": "/topic-extraction/cells/img_5.jpg_r3_c0.png"
176
+ }
177
+ ],
178
+ "children": []
179
+ },
180
+ {
181
+ "title": "2.11",
182
+ "contents": [
183
+ {
184
+ "type": "image",
185
+ "key": "/topic-extraction/cells/img_6.jpg_r1_c1.png"
186
+ }
187
+ ],
188
+ "children": []
189
+ }
190
+ ]
191
+ },
192
+ {
193
+ "title": "3 Coordinate geometry in the (x, y) plane",
194
+ "contents": [
195
+ {
196
+ "type": "image",
197
+ "key": "/topic-extraction/cells/img_7.jpg_r1_c0.png"
198
+ },
199
+ {
200
+ "type": "image",
201
+ "key": "/topic-extraction/cells/img_22.jpg_r1_c0.png"
202
+ }
203
+ ],
204
+ "children": [
205
+ {
206
+ "title": "3.1",
207
+ "contents": [
208
+ {
209
+ "type": "image",
210
+ "key": "/topic-extraction/cells/img_6.jpg_r2_c1.png"
211
+ },
212
+ {
213
+ "type": "image",
214
+ "key": "/topic-extraction/cells/img_21.jpg_r2_c1.png"
215
+ }
216
+ ],
217
+ "children": []
218
+ },
219
+ {
220
+ "title": "3.2",
221
+ "contents": [
222
+ {
223
+ "type": "image",
224
+ "key": "/topic-extraction/cells/img_6.jpg_r3_c0.png"
225
+ },
226
+ {
227
+ "type": "image",
228
+ "key": "/topic-extraction/cells/img_21.jpg_r3_c0.png"
229
+ }
230
+ ],
231
+ "children": []
232
+ },
233
+ {
234
+ "title": "3.3",
235
+ "contents": [
236
+ {
237
+ "type": "image",
238
+ "key": "/topic-extraction/cells/img_7.jpg_r1_c1.png"
239
+ },
240
+ {
241
+ "type": "image",
242
+ "key": "/topic-extraction/cells/img_22.jpg_r1_c1.png"
243
+ }
244
+ ],
245
+ "children": []
246
+ },
247
+ {
248
+ "title": "3.4",
249
+ "contents": [
250
+ {
251
+ "type": "image",
252
+ "key": "/topic-extraction/cells/img_7.jpg_r2_c0.png"
253
+ }
254
+ ],
255
+ "children": []
256
+ }
257
+ ]
258
+ },
259
+ {
260
+ "title": "4 Statistical distributions",
261
+ "contents": [
262
+ {
263
+ "type": "image",
264
+ "key": "/topic-extraction/cells/img_8.jpg_r2_c0.png"
265
+ },
266
+ {
267
+ "type": "image",
268
+ "key": "/topic-extraction/cells/img_23.jpg_r1_c0.png"
269
+ }
270
+ ],
271
+ "children": [
272
+ {
273
+ "title": "4.1",
274
+ "contents": [
275
+ {
276
+ "type": "image",
277
+ "key": "/topic-extraction/cells/img_7.jpg_r3_c1.png"
278
+ },
279
+ {
280
+ "type": "image",
281
+ "key": "/topic-extraction/cells/img_22.jpg_r2_c1.png"
282
+ }
283
+ ],
284
+ "children": []
285
+ },
286
+ {
287
+ "title": "4.2",
288
+ "contents": [
289
+ {
290
+ "type": "image",
291
+ "key": "/topic-extraction/cells/img_8.jpg_r2_c1.png"
292
+ },
293
+ {
294
+ "type": "image",
295
+ "key": "/topic-extraction/cells/img_22.jpg_r3_c0.png"
296
+ }
297
+ ],
298
+ "children": []
299
+ },
300
+ {
301
+ "title": "4.3",
302
+ "contents": [
303
+ {
304
+ "type": "image",
305
+ "key": "/topic-extraction/cells/img_8.jpg_r3_c0.png"
306
+ },
307
+ {
308
+ "type": "image",
309
+ "key": "/topic-extraction/cells/img_23.jpg_r1_c1.png"
310
+ }
311
+ ],
312
+ "children": []
313
+ },
314
+ {
315
+ "title": "4.4",
316
+ "contents": [
317
+ {
318
+ "type": "image",
319
+ "key": "/topic-extraction/cells/img_8.jpg_r4_c0.png"
320
+ }
321
+ ],
322
+ "children": []
323
+ },
324
+ {
325
+ "title": "4.5",
326
+ "contents": [
327
+ {
328
+ "type": "image",
329
+ "key": "/topic-extraction/cells/img_8.jpg_r5_c0.png"
330
+ }
331
+ ],
332
+ "children": []
333
+ },
334
+ {
335
+ "title": "4.6",
336
+ "contents": [
337
+ {
338
+ "type": "image",
339
+ "key": "/topic-extraction/cells/img_8.jpg_r6_c0.png"
340
+ }
341
+ ],
342
+ "children": []
343
+ }
344
+ ]
345
+ },
346
+ {
347
+ "title": "5 Statistical hypothesis testing",
348
+ "contents": [
349
+ {
350
+ "type": "image",
351
+ "key": "/topic-extraction/cells/img_9.jpg_r1_c0.png"
352
+ },
353
+ {
354
+ "type": "image",
355
+ "key": "/topic-extraction/cells/img_10.jpg_r1_c0.png"
356
+ },
357
+ {
358
+ "type": "image",
359
+ "key": "/topic-extraction/cells/img_24.jpg_r2_c0.png"
360
+ }
361
+ ],
362
+ "children": [
363
+ {
364
+ "title": "5.1",
365
+ "contents": [
366
+ {
367
+ "type": "image",
368
+ "key": "/topic-extraction/cells/img_9.jpg_r1_c1.png"
369
+ },
370
+ {
371
+ "type": "image",
372
+ "key": "/topic-extraction/cells/img_23.jpg_r2_c1.png"
373
+ }
374
+ ],
375
+ "children": []
376
+ },
377
+ {
378
+ "title": "5.2",
379
+ "contents": [
380
+ {
381
+ "type": "image",
382
+ "key": "/topic-extraction/cells/img_9.jpg_r2_c0.png"
383
+ },
384
+ {
385
+ "type": "image",
386
+ "key": "/topic-extraction/cells/img_24.jpg_r2_c1.png"
387
+ }
388
+ ],
389
+ "children": []
390
+ },
391
+ {
392
+ "title": "5.3",
393
+ "contents": [
394
+ {
395
+ "type": "image",
396
+ "key": "/topic-extraction/cells/img_9.jpg_r3_c0.png"
397
+ },
398
+ {
399
+ "type": "image",
400
+ "key": "/topic-extraction/cells/img_24.jpg_r3_c0.png"
401
+ }
402
+ ],
403
+ "children": []
404
+ },
405
+ {
406
+ "title": "5.4",
407
+ "contents": [
408
+ {
409
+ "type": "image",
410
+ "key": "/topic-extraction/cells/img_9.jpg_r4_c0.png"
411
+ }
412
+ ],
413
+ "children": []
414
+ },
415
+ {
416
+ "title": "5.5",
417
+ "contents": [
418
+ {
419
+ "type": "image",
420
+ "key": "/topic-extraction/cells/img_10.jpg_r1_c1.png"
421
+ }
422
+ ],
423
+ "children": []
424
+ },
425
+ {
426
+ "title": "5.6",
427
+ "contents": [
428
+ {
429
+ "type": "image",
430
+ "key": "/topic-extraction/cells/img_10.jpg_r2_c0.png"
431
+ }
432
+ ],
433
+ "children": []
434
+ },
435
+ {
436
+ "title": "5.7",
437
+ "contents": [
438
+ {
439
+ "type": "image",
440
+ "key": "/topic-extraction/cells/img_10.jpg_r3_c0.png"
441
+ }
442
+ ],
443
+ "children": []
444
+ },
445
+ {
446
+ "title": "5.8",
447
+ "contents": [
448
+ {
449
+ "type": "image",
450
+ "key": "/topic-extraction/cells/img_10.jpg_r4_c0.png"
451
+ }
452
+ ],
453
+ "children": []
454
+ },
455
+ {
456
+ "title": "5.9",
457
+ "contents": [
458
+ {
459
+ "type": "image",
460
+ "key": "/topic-extraction/cells/img_10.jpg_r5_c0.png"
461
+ }
462
+ ],
463
+ "children": []
464
+ }
465
+ ]
466
+ },
467
+ {
468
+ "title": "6 Exponentials and logarithms",
469
+ "contents": [
470
+ {
471
+ "type": "image",
472
+ "key": "/topic-extraction/cells/img_12.jpg_r2_c0.png"
473
+ }
474
+ ],
475
+ "children": [
476
+ {
477
+ "title": "6.1",
478
+ "contents": [
479
+ {
480
+ "type": "image",
481
+ "key": "/topic-extraction/cells/img_11.jpg_r1_c0.png"
482
+ },
483
+ {
484
+ "type": "image",
485
+ "key": "/topic-extraction/cells/img_24.jpg_r4_c1.png"
486
+ }
487
+ ],
488
+ "children": []
489
+ },
490
+ {
491
+ "title": "6.2",
492
+ "contents": [
493
+ {
494
+ "type": "image",
495
+ "key": "/topic-extraction/cells/img_11.jpg_r2_c0.png"
496
+ }
497
+ ],
498
+ "children": []
499
+ },
500
+ {
501
+ "title": "6.3",
502
+ "contents": [
503
+ {
504
+ "type": "image",
505
+ "key": "/topic-extraction/cells/img_11.jpg_r3_c0.png"
506
+ }
507
+ ],
508
+ "children": []
509
+ },
510
+ {
511
+ "title": "6.4",
512
+ "contents": [
513
+ {
514
+ "type": "image",
515
+ "key": "/topic-extraction/cells/img_11.jpg_r4_c0.png"
516
+ }
517
+ ],
518
+ "children": []
519
+ },
520
+ {
521
+ "title": "6.5",
522
+ "contents": [
523
+ {
524
+ "type": "image",
525
+ "key": "/topic-extraction/cells/img_11.jpg_r5_c0.png"
526
+ }
527
+ ],
528
+ "children": []
529
+ },
530
+ {
531
+ "title": "6.6",
532
+ "contents": [
533
+ {
534
+ "type": "image",
535
+ "key": "/topic-extraction/cells/img_11.jpg_r6_c0.png"
536
+ }
537
+ ],
538
+ "children": []
539
+ },
540
+ {
541
+ "title": "6.7",
542
+ "contents": [
543
+ {
544
+ "type": "image",
545
+ "key": "/topic-extraction/cells/img_12.jpg_r2_c1.png"
546
+ }
547
+ ],
548
+ "children": []
549
+ }
550
+ ]
551
+ },
552
+ {
553
+ "title": "7 Differentiation",
554
+ "contents": [
555
+ {
556
+ "type": "image",
557
+ "key": "/topic-extraction/cells/img_13.jpg_r2_c0.png"
558
+ },
559
+ {
560
+ "type": "image",
561
+ "key": "/topic-extraction/cells/img_14.jpg_r1_c0.png"
562
+ }
563
+ ],
564
+ "children": [
565
+ {
566
+ "title": "7.1",
567
+ "contents": [
568
+ {
569
+ "type": "image",
570
+ "key": "/topic-extraction/cells/img_13.jpg_r2_c1.png"
571
+ },
572
+ {
573
+ "type": "image",
574
+ "key": "/topic-extraction/cells/img_25.jpg_r1_c0.png"
575
+ },
576
+ {
577
+ "type": "image",
578
+ "key": "/topic-extraction/cells/img_12.jpg_r3_c1.png"
579
+ }
580
+ ],
581
+ "children": []
582
+ },
583
+ {
584
+ "title": "7.2",
585
+ "contents": [
586
+ {
587
+ "type": "image",
588
+ "key": "/topic-extraction/cells/img_13.jpg_r3_c0.png"
589
+ },
590
+ {
591
+ "type": "image",
592
+ "key": "/topic-extraction/cells/img_25.jpg_r2_c0.png"
593
+ }
594
+ ],
595
+ "children": []
596
+ },
597
+ {
598
+ "title": "7.3",
599
+ "contents": [
600
+ {
601
+ "type": "image",
602
+ "key": "/topic-extraction/cells/img_13.jpg_r5_c0.png"
603
+ },
604
+ {
605
+ "type": "image",
606
+ "key": "/topic-extraction/cells/img_25.jpg_r3_c0.png"
607
+ }
608
+ ],
609
+ "children": []
610
+ },
611
+ {
612
+ "title": "7.4",
613
+ "contents": [
614
+ {
615
+ "type": "image",
616
+ "key": "/topic-extraction/cells/img_14.jpg_r1_c1.png"
617
+ },
618
+ {
619
+ "type": "image",
620
+ "key": "/topic-extraction/cells/img_25.jpg_r4_c0.png"
621
+ }
622
+ ],
623
+ "children": []
624
+ },
625
+ {
626
+ "title": "7.5",
627
+ "contents": [
628
+ {
629
+ "type": "image",
630
+ "key": "/topic-extraction/cells/img_14.jpg_r2_c0.png"
631
+ },
632
+ {
633
+ "type": "image",
634
+ "key": "/topic-extraction/cells/img_25.jpg_r5_c0.png"
635
+ }
636
+ ],
637
+ "children": []
638
+ },
639
+ {
640
+ "title": "7.6",
641
+ "contents": [
642
+ {
643
+ "type": "image",
644
+ "key": "/topic-extraction/cells/img_14.jpg_r3_c0.png"
645
+ }
646
+ ],
647
+ "children": []
648
+ }
649
+ ]
650
+ },
651
+ {
652
+ "title": "8 Forces and Newton's laws",
653
+ "contents": [
654
+ {
655
+ "type": "image",
656
+ "key": "/topic-extraction/cells/img_15.jpg_r1_c0.png"
657
+ },
658
+ {
659
+ "type": "image",
660
+ "key": "/topic-extraction/cells/img_16.jpg_r2_c0.png"
661
+ },
662
+ {
663
+ "type": "image",
664
+ "key": "/topic-extraction/cells/img_26.jpg_r1_c0.png"
665
+ },
666
+ {
667
+ "type": "image",
668
+ "key": "/topic-extraction/cells/img_27.jpg_r1_c0.png"
669
+ }
670
+ ],
671
+ "children": [
672
+ {
673
+ "title": "8.1",
674
+ "contents": [
675
+ {
676
+ "type": "image",
677
+ "key": "/topic-extraction/cells/img_26.jpg_r1_c1.png"
678
+ },
679
+ {
680
+ "type": "image",
681
+ "key": "/topic-extraction/cells/img_14.jpg_r4_c1.png"
682
+ }
683
+ ],
684
+ "children": []
685
+ },
686
+ {
687
+ "title": "8.2",
688
+ "contents": [
689
+ {
690
+ "type": "image",
691
+ "key": "/topic-extraction/cells/img_26.jpg_r2_c0.png"
692
+ },
693
+ {
694
+ "type": "image",
695
+ "key": "/topic-extraction/cells/img_14.jpg_r5_c0.png"
696
+ }
697
+ ],
698
+ "children": []
699
+ },
700
+ {
701
+ "title": "8.3",
702
+ "contents": [
703
+ {
704
+ "type": "image",
705
+ "key": "/topic-extraction/cells/img_15.jpg_r1_c1.png"
706
+ },
707
+ {
708
+ "type": "image",
709
+ "key": "/topic-extraction/cells/img_26.jpg_r3_c0.png"
710
+ }
711
+ ],
712
+ "children": []
713
+ },
714
+ {
715
+ "title": "8.4",
716
+ "contents": [
717
+ {
718
+ "type": "image",
719
+ "key": "/topic-extraction/cells/img_15.jpg_r2_c0.png"
720
+ },
721
+ {
722
+ "type": "image",
723
+ "key": "/topic-extraction/cells/img_27.jpg_r1_c1.png"
724
+ }
725
+ ],
726
+ "children": []
727
+ },
728
+ {
729
+ "title": "8.5",
730
+ "contents": [
731
+ {
732
+ "type": "image",
733
+ "key": "/topic-extraction/cells/img_15.jpg_r3_c0.png"
734
+ },
735
+ {
736
+ "type": "image",
737
+ "key": "/topic-extraction/cells/img_27.jpg_r2_c0.png"
738
+ }
739
+ ],
740
+ "children": []
741
+ },
742
+ {
743
+ "title": "8.6",
744
+ "contents": [
745
+ {
746
+ "type": "image",
747
+ "key": "/topic-extraction/cells/img_15.jpg_r4_c0.png"
748
+ },
749
+ {
750
+ "type": "image",
751
+ "key": "/topic-extraction/cells/img_27.jpg_r3_c0.png"
752
+ }
753
+ ],
754
+ "children": []
755
+ },
756
+ {
757
+ "title": "8.7",
758
+ "contents": [
759
+ {
760
+ "type": "image",
761
+ "key": "/topic-extraction/cells/img_16.jpg_r2_c1.png"
762
+ }
763
+ ],
764
+ "children": []
765
+ },
766
+ {
767
+ "title": "8.8",
768
+ "contents": [
769
+ {
770
+ "type": "image",
771
+ "key": "/topic-extraction/cells/img_16.jpg_r3_c0.png"
772
+ }
773
+ ],
774
+ "children": []
775
+ }
776
+ ]
777
+ },
778
+ {
779
+ "title": "9 Numerical methods",
780
+ "contents": [
781
+ {
782
+ "type": "image",
783
+ "key": "/topic-extraction/cells/img_17.jpg_r1_c0.png"
784
+ }
785
+ ],
786
+ "children": [
787
+ {
788
+ "title": "9.1",
789
+ "contents": [
790
+ {
791
+ "type": "image",
792
+ "key": "/topic-extraction/cells/img_16.jpg_r4_c1.png"
793
+ },
794
+ {
795
+ "type": "image",
796
+ "key": "/topic-extraction/cells/img_27.jpg_r4_c1.png"
797
+ }
798
+ ],
799
+ "children": []
800
+ },
801
+ {
802
+ "title": "9.2",
803
+ "contents": [
804
+ {
805
+ "type": "image",
806
+ "key": "/topic-extraction/cells/img_16.jpg_r5_c0.png"
807
+ }
808
+ ],
809
+ "children": []
810
+ },
811
+ {
812
+ "title": "9.3",
813
+ "contents": [
814
+ {
815
+ "type": "image",
816
+ "key": "/topic-extraction/cells/img_16.jpg_r6_c0.png"
817
+ }
818
+ ],
819
+ "children": []
820
+ },
821
+ {
822
+ "title": "9.4",
823
+ "contents": [
824
+ {
825
+ "type": "image",
826
+ "key": "/topic-extraction/cells/img_17.jpg_r1_c1.png"
827
+ }
828
+ ],
829
+ "children": []
830
+ },
831
+ {
832
+ "title": "9.5",
833
+ "contents": [
834
+ {
835
+ "type": "image",
836
+ "key": "/topic-extraction/cells/img_17.jpg_r2_c0.png"
837
+ }
838
+ ],
839
+ "children": []
840
+ }
841
+ ]
842
+ },
843
+ {
844
+ "title": "10 Vectors",
845
+ "contents": [
846
+ {
847
+ "type": "image",
848
+ "key": "/topic-extraction/cells/img_18.jpg_r2_c0.png"
849
+ }
850
+ ],
851
+ "children": [
852
+ {
853
+ "title": "10.1",
854
+ "contents": [
855
+ {
856
+ "type": "image",
857
+ "key": "/topic-extraction/cells/img_17.jpg_r3_c1.png"
858
+ }
859
+ ],
860
+ "children": []
861
+ },
862
+ {
863
+ "title": "10.2",
864
+ "contents": [
865
+ {
866
+ "type": "image",
867
+ "key": "/topic-extraction/cells/img_17.jpg_r4_c0.png"
868
+ }
869
+ ],
870
+ "children": []
871
+ },
872
+ {
873
+ "title": "10.3",
874
+ "contents": [
875
+ {
876
+ "type": "image",
877
+ "key": "/topic-extraction/cells/img_17.jpg_r5_c0.png"
878
+ }
879
+ ],
880
+ "children": []
881
+ },
882
+ {
883
+ "title": "10.4",
884
+ "contents": [
885
+ {
886
+ "type": "image",
887
+ "key": "/topic-extraction/cells/img_17.jpg_r6_c0.png"
888
+ }
889
+ ],
890
+ "children": []
891
+ },
892
+ {
893
+ "title": "10.5",
894
+ "contents": [
895
+ {
896
+ "type": "image",
897
+ "key": "/topic-extraction/cells/img_18.jpg_r2_c1.png"
898
+ }
899
+ ],
900
+ "children": []
901
+ }
902
+ ]
903
+ },
904
+ {
905
+ "title": "A01",
906
+ "contents": [
907
+ {
908
+ "type": "image",
909
+ "key": "/topic-extraction/cells/img_28.jpg_r1_c0.png"
910
+ }
911
+ ],
912
+ "children": []
913
+ }
914
+ ]
table_row_extraction.py CHANGED
@@ -1,5 +1,6 @@
1
  import cv2
2
  import numpy as np
 
3
  import logging
4
  from pathlib import Path
5
  from typing import List, Tuple
@@ -10,10 +11,27 @@ logger = logging.getLogger(__name__)
10
  # if you are working with 3-column tables, change `merge_two_col_rows` and `enable_subtopic_merge` to False
11
  # otherwise set them to True if you are working with 2-column tables (currently hardcoded, just test)
12
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
  class TableExtractor:
14
  def __init__(
15
  self,
16
- #preprocessing parameters
17
  denoise_h: int = 10,
18
  clahe_clip: float = 3.0,
19
  clahe_grid: int = 8,
@@ -23,44 +41,40 @@ class TableExtractor:
23
  thresh_block_size: int = 21,
24
  thresh_C: int = 7,
25
 
26
- # Row detection parameters
27
  horizontal_scale: int = 20,
28
- row_morph_iterations: int = 2,
29
- min_row_height: int = 30,
30
  min_row_density: float = 0.01,
31
 
32
- # Column detection parameters
 
 
 
 
 
33
  vertical_scale: int = 20,
34
  col_morph_iterations: int = 2,
35
  min_col_height_ratio: float = 0.5,
36
  min_col_density: float = 0.01,
37
 
38
- # Bounding box extraction
39
  padding: int = 0,
40
  skip_header: bool = True,
41
 
42
- # Two-column & subtopic merges
43
- merge_two_col_rows: bool = False,
44
- enable_subtopic_merge: bool = False,
45
  subtopic_threshold: float = 0.2,
46
 
47
- #gray artifact filter
48
- std_threshold_for_artifacts: float = 5.0,
49
-
50
- #parameters for line removal check
51
- line_removal_scale: int = 15,
52
- line_removal_iterations: int = 1,
53
- min_text_ratio_after_line_removal: float = 0.001
54
  ):
55
- """
56
- :param merge_two_col_rows: If True, a row with exactly 1 vertical line => merges into 1 bounding box.
57
- :param enable_subtopic_merge: If True, a row with 2 vertical lines => 3 columns can become 2 if left is narrow.
58
- :param subtopic_threshold: Fraction of row width for subtopic detection.
59
- :param std_threshold_for_artifacts: Grayscale std dev < this => skip as artifact.
60
- :param line_removal_scale: Larger => more aggressive line detection inside the cell.
61
- :param line_removal_iterations: Morphological iterations for line removal.
62
- :param min_text_ratio_after_line_removal: If fraction of text after removing lines < this => skip cell.
63
- """
64
  # Preprocessing
65
  self.denoise_h = denoise_h
66
  self.clahe_clip = clahe_clip
@@ -75,6 +89,11 @@ class TableExtractor:
75
  self.min_row_height = min_row_height
76
  self.min_row_density = min_row_density
77
 
 
 
 
 
 
78
  # Column detection
79
  self.vertical_scale = vertical_scale
80
  self.col_morph_iterations = col_morph_iterations
@@ -85,28 +104,31 @@ class TableExtractor:
85
  self.padding = padding
86
  self.skip_header = skip_header
87
 
88
- # Two-column / subtopic merges
89
  self.merge_two_col_rows = merge_two_col_rows
90
  self.enable_subtopic_merge = enable_subtopic_merge
91
  self.subtopic_threshold = subtopic_threshold
92
 
93
- #artifact filtering (gray headers, purple, etc) / currenty not working well
94
- self.std_threshold_for_artifacts = std_threshold_for_artifacts
95
-
96
- #line removal inside cell
97
- self.line_removal_scale = line_removal_scale
98
- self.line_removal_iterations = line_removal_iterations
99
- self.min_text_ratio_after_line_removal = min_text_ratio_after_line_removal
100
 
101
  def preprocess(self, img: np.ndarray) -> np.ndarray:
102
- """Grayscale, denoise, CLAHE, sharpen, adaptive threshold (binary_inv)."""
 
 
103
  if img.ndim == 3:
104
  gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
105
  else:
106
  gray = img.copy()
107
 
108
  denoised = cv2.fastNlMeansDenoising(gray, h=self.denoise_h)
109
- clahe = cv2.createCLAHE(clipLimit=self.clahe_clip, tileGridSize=(self.clahe_grid, self.clahe_grid))
 
110
  enhanced = clahe.apply(denoised)
111
  sharpened = cv2.filter2D(enhanced, -1, self.sharpen_kernel)
112
 
@@ -120,75 +142,95 @@ class TableExtractor:
120
  return binarized
121
 
122
  def detect_full_rows(self, bin_img: np.ndarray) -> List[Tuple[int, int]]:
123
- """Find horizontal row boundaries in the binarized image."""
124
  h_kernel_size = max(1, bin_img.shape[1] // self.horizontal_scale)
125
  horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (h_kernel_size, 1))
 
 
 
 
126
 
127
- horizontal_lines = cv2.morphologyEx(bin_img, cv2.MORPH_OPEN, horizontal_kernel,
128
- iterations=self.row_morph_iterations)
129
  row_projection = np.sum(horizontal_lines, axis=1)
130
  max_val = np.max(row_projection) if len(row_projection) else 0
131
 
132
- # If no lines, treat entire image as one row (opt)
133
  if max_val < 1e-5:
134
  return [(0, bin_img.shape[0])]
135
 
136
- threshold_val = 0.3 * max_val
137
  line_indices = np.where(row_projection > threshold_val)[0]
138
-
139
  if len(line_indices) < 2:
140
  return [(0, bin_img.shape[0])]
141
 
142
- # Group consecutive indices
143
  lines = []
144
- current = [line_indices[0]]
145
  for i in range(1, len(line_indices)):
146
- if line_indices[i] - line_indices[i - 1] <= 2:
147
- current.append(line_indices[i])
148
  else:
149
- lines.append(int(np.mean(current)))
150
- current = [line_indices[i]]
151
- if current:
152
- lines.append(int(np.mean(current)))
153
 
154
- row_bounds = []
155
  for i in range(len(lines) - 1):
156
  y1 = lines[i]
157
  y2 = lines[i + 1]
158
- if (y2 - y1) >= self.min_row_height:
159
- row_bounds.append((y1, y2))
 
 
 
 
 
 
 
 
160
 
161
- return row_bounds if row_bounds else [(0, bin_img.shape[0])]
 
 
 
 
162
 
163
- def detect_columns_in_row(self, row_img: np.ndarray, y1: int, y2: int) -> List[Tuple[int, int, int, int]]:
164
- """
165
- Detect up to two vertical lines => up to 3 bounding boxes.
166
- - 0 lines => 1 bounding box
167
- - 1 line => 2 bounding boxes (unless merge_two_col_rows => 1)
168
- - 2 lines => 3 bounding boxes by default
169
- if enable_subtopic_merge => check left box < subtopic_threshold => 2 boxes
170
- """
 
 
 
 
 
171
  row_height = (y2 - y1)
172
  row_width = row_img.shape[1]
173
 
174
- # Morph kernel for vertical lines
175
  v_kernel_size = max(1, row_height // self.vertical_scale)
176
  vertical_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, v_kernel_size))
177
 
178
- vertical_lines = cv2.morphologyEx(row_img, cv2.MORPH_OPEN, vertical_kernel,
179
- iterations=self.col_morph_iterations)
180
- vertical_lines = cv2.dilate(vertical_lines, np.ones((3, 3), np.uint8), iterations=1)
 
 
 
 
181
 
182
  # Find contours => x positions
183
- contours, _ = cv2.findContours(vertical_lines, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
 
 
184
  x_positions = []
185
  for c in contours:
186
- x, y, w, h = cv2.boundingRect(c)
187
- # Must be at least half the row height to be considered a real column divider
188
  if h >= self.min_col_height_ratio * row_height:
189
  x_positions.append(x)
190
- x_positions = sorted(set(x_positions))
191
 
 
192
  # Keep at most 2 vertical lines
193
  if len(x_positions) > 2:
194
  x_positions = x_positions[:2]
@@ -209,14 +251,12 @@ class TableExtractor:
209
  (0, y1, x1, row_height),
210
  (x1, y1, row_width - x1, row_height)
211
  ]
212
-
213
  else:
214
  # 2 lines => normally 3 bounding boxes
215
  x1, x2 = sorted(x_positions)
216
  if self.enable_subtopic_merge:
217
- # If left bounding box is very narrow => treat as subtopic => 2 bounding boxes
218
- left_box_width = x1
219
- if left_box_width < (self.subtopic_threshold * row_width):
220
  boxes = [
221
  (0, y1, x1, row_height),
222
  (x1, y1, row_width - x1, row_height)
@@ -239,12 +279,12 @@ class TableExtractor:
239
  for (x, y, w, h) in boxes:
240
  if w <= 0:
241
  continue
242
- subregion = row_img[:, x : x + w]
243
  white_pixels = np.sum(subregion == 255)
244
  total_pixels = subregion.size
245
  if total_pixels == 0:
246
  continue
247
- density = white_pixels / total_pixels
248
  if density >= self.min_col_density:
249
  filtered.append((x, y, w, h))
250
 
@@ -253,9 +293,9 @@ class TableExtractor:
253
  def process_image(self, image_path: str) -> List[List[Tuple[int, int, int, int]]]:
254
  """
255
  1) Preprocess => bin_img
256
- 2) Detect row segments
257
  3) Filter out rows by density
258
- - optionally skip first row (header)
259
  5) For each row => detect columns => bounding boxes
260
  """
261
  img = cv2.imread(image_path)
@@ -273,15 +313,15 @@ class TableExtractor:
273
  if area == 0:
274
  continue
275
  white_pixels = np.sum(row_region == 255)
276
- density = white_pixels / area
277
  if density >= self.min_row_density:
278
  valid_rows.append((y1, y2))
279
 
280
- # Possibly skip header row
281
  if self.skip_header and len(valid_rows) > 1:
282
  valid_rows = valid_rows[1:]
283
 
284
- # Detect columns in each row
285
  all_rows_boxes = []
286
  for (y1, y2) in valid_rows:
287
  row_img = bin_img[y1:y2, :]
@@ -291,8 +331,12 @@ class TableExtractor:
291
 
292
  return all_rows_boxes
293
 
294
- def extract_box_image(self, original: np.ndarray, box: Tuple[int, int, int, int]) -> np.ndarray:
295
- """Crop bounding box from original with optional padding."""
 
 
 
 
296
  x, y, w, h = box
297
  Y1 = max(0, y - self.padding)
298
  Y2 = min(original.shape[0], y + h + self.padding)
@@ -300,59 +344,47 @@ class TableExtractor:
300
  X2 = min(original.shape[1], x + w + self.padding)
301
  return original[Y1:Y2, X1:X2]
302
 
303
- def _remove_lines_in_cell(self, gray_bin: np.ndarray) -> np.ndarray:
304
  """
305
- Remove horizontal + vertical lines from a binarized subregion
306
- and return the 'text-only' mask.
307
- """
308
- # 1) horizontal line detection
309
- horiz_kernel_size = max(1, gray_bin.shape[1] // self.line_removal_scale)
310
- horiz_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (horiz_kernel_size, 1))
311
- horizontal = cv2.morphologyEx(gray_bin, cv2.MORPH_OPEN, horiz_kernel, iterations=self.line_removal_iterations)
312
-
313
- # 2) vertical line detection
314
- vert_kernel_size = max(1, gray_bin.shape[0] // self.line_removal_scale)
315
- vert_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, vert_kernel_size))
316
- vertical = cv2.morphologyEx(gray_bin, cv2.MORPH_OPEN, vert_kernel, iterations=self.line_removal_iterations)
317
-
318
- # Combine lines
319
- lines = cv2.bitwise_or(horizontal, vertical)
320
- # Subtract from the original => text-only
321
- text_only = cv2.bitwise_and(gray_bin, cv2.bitwise_not(lines))
322
- return text_only
323
-
324
- def is_grey_artifact(self, cell_img: np.ndarray) -> bool:
325
- """
326
- 1) If grayscale std dev < std_threshold_for_artifacts => skip as uniform.
327
- 2) Otherwise, remove lines from an Otsu-binarized version of the cell
328
- and check if there's enough text left. If not, skip as artifact.
329
  """
330
  if cell_img.size == 0:
331
  return True
332
 
333
- gray = cv2.cvtColor(cell_img, cv2.COLOR_BGR2GRAY)
334
- std_val = np.std(gray)
335
- if std_val < self.std_threshold_for_artifacts:
336
  return True
337
 
338
- # 2) Binarize => remove lines => check leftover text
339
- # Use Otsu threshold for the local cell
340
- _, cell_bin = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV | cv2.THRESH_OTSU)
341
 
342
- text_only = self._remove_lines_in_cell(cell_bin)
343
- nonzero_text = cv2.countNonZero(text_only)
344
- ratio = nonzero_text / float(cell_bin.size)
345
 
346
- if ratio < self.min_text_ratio_after_line_removal:
347
- # Hardly any text remains => artifact
 
 
 
 
348
  return True
349
 
350
  return False
351
 
352
  def save_extracted_cells(
353
- self, image_path: str, row_boxes: List[List[Tuple[int, int, int, int]]], output_dir: str
 
 
 
354
  ):
355
- """Save each cell from the original image, skipping uniform/gray artifacts."""
 
 
356
  out_path = Path(output_dir)
357
  out_path.mkdir(exist_ok=True, parents=True)
358
 
@@ -365,14 +397,15 @@ class TableExtractor:
365
  row_dir.mkdir(exist_ok=True)
366
  for j, box in enumerate(row):
367
  cell_img = self.extract_box_image(original, box)
368
- # Skip if uniform or line-only artifact
369
- if self.is_grey_artifact(cell_img):
370
- logger.info(f"Skipping artifact cell at row={i}, col={j}. (uniform/grey/line-only)")
 
371
  continue
372
 
373
  out_file = row_dir / f"col_{j}.png"
374
  cv2.imwrite(str(out_file), cell_img)
375
- logger.info(f"Saved cell image row={i}, col={j} -> {out_file}")
376
 
377
  class TableExtractorApp:
378
  def __init__(self, extractor: TableExtractor):
@@ -384,39 +417,24 @@ class TableExtractorApp:
384
  self.extractor.save_extracted_cells(input_image, row_boxes, output_folder)
385
  logger.info("Done. Check the output folder for results.")
386
 
387
-
388
  if __name__ == "__main__":
389
- input_image = "images/test/img_2.png"
390
- output_folder = "refined_outp"
391
 
392
  extractor = TableExtractor(
393
- denoise_h=10,
394
- clahe_clip=3.0,
395
- clahe_grid=8,
396
- thresh_block_size=21,
397
- thresh_C=7,
398
-
399
- horizontal_scale=20,
400
- row_morph_iterations=2,
401
- min_row_height=30,
402
- min_row_density=0.01,
403
-
404
- vertical_scale=20,
405
- col_morph_iterations=2,
406
- min_col_height_ratio=0.5,
407
- min_col_density=0.01,
408
-
409
- padding=1,
410
- skip_header=True,
411
 
412
  merge_two_col_rows=True,
413
  enable_subtopic_merge=True,
414
  subtopic_threshold=0.2,
415
 
416
- std_threshold_for_artifacts=10.0,
417
- line_removal_scale=20,
418
- line_removal_iterations=1,
419
- min_text_ratio_after_line_removal=0.001
 
420
  )
421
 
422
  app = TableExtractorApp(extractor)
 
1
  import cv2
2
  import numpy as np
3
+ import math
4
  import logging
5
  from pathlib import Path
6
  from typing import List, Tuple
 
11
  # if you are working with 3-column tables, change `merge_two_col_rows` and `enable_subtopic_merge` to False
12
  # otherwise set them to True if you are working with 2-column tables (currently hardcoded, just test)
13
 
14
+
15
+ def color_distance(c1: Tuple[float, float, float],
16
+ c2: Tuple[float, float, float]) -> float:
17
+ """
18
+ Euclidean distance between two BGR colors c1 and c2.
19
+ """
20
+ return math.sqrt((c1[0] - c2[0])**2 + (c1[1] - c2[1])**2 + (c1[2] - c2[2])**2)
21
+
22
+ def average_bgr(cell_img: np.ndarray) -> Tuple[float, float, float]:
23
+ """
24
+ Return the average BGR color of the entire cell_img.
25
+ """
26
+ b_mean = np.mean(cell_img[:, :, 0])
27
+ g_mean = np.mean(cell_img[:, :, 1])
28
+ r_mean = np.mean(cell_img[:, :, 2])
29
+ return (b_mean, g_mean, r_mean)
30
+
31
  class TableExtractor:
32
  def __init__(
33
  self,
34
+ # --- Preprocessing ---
35
  denoise_h: int = 10,
36
  clahe_clip: float = 3.0,
37
  clahe_grid: int = 8,
 
41
  thresh_block_size: int = 21,
42
  thresh_C: int = 7,
43
 
44
+ # --- Row detection ---
45
  horizontal_scale: int = 20,
46
+ row_morph_iterations: int = 1,
47
+ min_row_height: int = 15,
48
  min_row_density: float = 0.01,
49
 
50
+ # Additional row detection parameters
51
+ faint_line_threshold_factor: float = 0.1,
52
+ top_line_grouping_px: int = 8,
53
+ some_minimum_text_pixels: int = 50,
54
+
55
+ # --- Column detection ---
56
  vertical_scale: int = 20,
57
  col_morph_iterations: int = 2,
58
  min_col_height_ratio: float = 0.5,
59
  min_col_density: float = 0.01,
60
 
61
+ # --- Bbox extraction ---
62
  padding: int = 0,
63
  skip_header: bool = True,
64
 
65
+ # --- Two-column & subtopic merges ---
66
+ merge_two_col_rows: bool = True,
67
+ enable_subtopic_merge: bool = True,
68
  subtopic_threshold: float = 0.2,
69
 
70
+ # --- Color-based artifact filter ---
71
+ artifact_color_a6: Tuple[int, int, int] = (166, 166, 166),
72
+ artifact_color_a7: Tuple[int, int, int] = (180, 180, 180),
73
+ artifact_color_a8: Tuple[int, int, int] = (80, 48, 0),
74
+ artifact_color_a9: Tuple[int, int, int] = (223, 153, 180),
75
+ artifact_color_a10: Tuple[int, int, int] = (0, 0, 0),
76
+ color_tolerance: float = 30.0
77
  ):
 
 
 
 
 
 
 
 
 
78
  # Preprocessing
79
  self.denoise_h = denoise_h
80
  self.clahe_clip = clahe_clip
 
89
  self.min_row_height = min_row_height
90
  self.min_row_density = min_row_density
91
 
92
+ # Additional row detection
93
+ self.faint_line_threshold_factor = faint_line_threshold_factor
94
+ self.top_line_grouping_px = top_line_grouping_px
95
+ self.some_minimum_text_pixels = some_minimum_text_pixels
96
+
97
  # Column detection
98
  self.vertical_scale = vertical_scale
99
  self.col_morph_iterations = col_morph_iterations
 
104
  self.padding = padding
105
  self.skip_header = skip_header
106
 
107
+ # Two-column & subtopic merges
108
  self.merge_two_col_rows = merge_two_col_rows
109
  self.enable_subtopic_merge = enable_subtopic_merge
110
  self.subtopic_threshold = subtopic_threshold
111
 
112
+ # Color-based artifact filter
113
+ self.artifact_color_a6 = artifact_color_a6
114
+ self.artifact_color_a7 = artifact_color_a7
115
+ self.artifact_color_a8 = artifact_color_a8
116
+ self.artifact_color_a9 = artifact_color_a9
117
+ self.artifact_color_a10 = artifact_color_a10
118
+ self.color_tolerance = color_tolerance
119
 
120
  def preprocess(self, img: np.ndarray) -> np.ndarray:
121
+ """
122
+ Grayscale, denoise, CLAHE, sharpen, then adaptive threshold (binary_inv).
123
+ """
124
  if img.ndim == 3:
125
  gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
126
  else:
127
  gray = img.copy()
128
 
129
  denoised = cv2.fastNlMeansDenoising(gray, h=self.denoise_h)
130
+ clahe = cv2.createCLAHE(clipLimit=self.clahe_clip,
131
+ tileGridSize=(self.clahe_grid, self.clahe_grid))
132
  enhanced = clahe.apply(denoised)
133
  sharpened = cv2.filter2D(enhanced, -1, self.sharpen_kernel)
134
 
 
142
  return binarized
143
 
144
  def detect_full_rows(self, bin_img: np.ndarray) -> List[Tuple[int, int]]:
 
145
  h_kernel_size = max(1, bin_img.shape[1] // self.horizontal_scale)
146
  horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (h_kernel_size, 1))
147
+ horizontal_lines = cv2.morphologyEx(
148
+ bin_img, cv2.MORPH_OPEN, horizontal_kernel,
149
+ iterations=self.row_morph_iterations
150
+ )
151
 
 
 
152
  row_projection = np.sum(horizontal_lines, axis=1)
153
  max_val = np.max(row_projection) if len(row_projection) else 0
154
 
 
155
  if max_val < 1e-5:
156
  return [(0, bin_img.shape[0])]
157
 
158
+ threshold_val = self.faint_line_threshold_factor * max_val
159
  line_indices = np.where(row_projection > threshold_val)[0]
 
160
  if len(line_indices) < 2:
161
  return [(0, bin_img.shape[0])]
162
 
 
163
  lines = []
164
+ group = [line_indices[0]]
165
  for i in range(1, len(line_indices)):
166
+ if (line_indices[i] - line_indices[i - 1]) <= self.top_line_grouping_px:
167
+ group.append(line_indices[i])
168
  else:
169
+ lines.append(int(np.mean(group)))
170
+ group = [line_indices[i]]
171
+ if group:
172
+ lines.append(int(np.mean(group)))
173
 
174
+ potential_bounds = []
175
  for i in range(len(lines) - 1):
176
  y1 = lines[i]
177
  y2 = lines[i + 1]
178
+ if (y2 - y1) > 0:
179
+ potential_bounds.append((y1, y2))
180
+
181
+ if potential_bounds:
182
+ if potential_bounds[0][0] > 0:
183
+ potential_bounds.insert(0, (0, potential_bounds[0][0]))
184
+ if potential_bounds[-1][1] < bin_img.shape[0]:
185
+ potential_bounds.append((potential_bounds[-1][1], bin_img.shape[0]))
186
+ else:
187
+ potential_bounds = [(0, bin_img.shape[0])]
188
 
189
+ final_rows = []
190
+ for (y1, y2) in potential_bounds:
191
+ height = (y2 - y1)
192
+ region = bin_img[y1:y2, :]
193
+ white_count = np.sum(region == 255)
194
 
195
+ if height < self.min_row_height:
196
+ if white_count >= self.some_minimum_text_pixels:
197
+ final_rows.append((y1, y2))
198
+ else:
199
+ final_rows.append((y1, y2))
200
+
201
+ final_rows = sorted(final_rows, key=lambda x: x[0])
202
+ return final_rows if final_rows else [(0, bin_img.shape[0])]
203
+
204
+ def detect_columns_in_row(self,
205
+ row_img: np.ndarray,
206
+ y1: int,
207
+ y2: int) -> List[Tuple[int, int, int, int]]:
208
  row_height = (y2 - y1)
209
  row_width = row_img.shape[1]
210
 
 
211
  v_kernel_size = max(1, row_height // self.vertical_scale)
212
  vertical_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, v_kernel_size))
213
 
214
+ vertical_lines = cv2.morphologyEx(
215
+ row_img, cv2.MORPH_OPEN, vertical_kernel,
216
+ iterations=self.col_morph_iterations
217
+ )
218
+ vertical_lines = cv2.dilate(vertical_lines,
219
+ np.ones((3, 3), np.uint8),
220
+ iterations=1)
221
 
222
  # Find contours => x positions
223
+ contours, _ = cv2.findContours(vertical_lines,
224
+ cv2.RETR_EXTERNAL,
225
+ cv2.CHAIN_APPROX_SIMPLE)
226
  x_positions = []
227
  for c in contours:
228
+ x, _, w, h = cv2.boundingRect(c)
229
+ # Must be at least half the row height to be a real divider
230
  if h >= self.min_col_height_ratio * row_height:
231
  x_positions.append(x)
 
232
 
233
+ x_positions = sorted(set(x_positions))
234
  # Keep at most 2 vertical lines
235
  if len(x_positions) > 2:
236
  x_positions = x_positions[:2]
 
251
  (0, y1, x1, row_height),
252
  (x1, y1, row_width - x1, row_height)
253
  ]
 
254
  else:
255
  # 2 lines => normally 3 bounding boxes
256
  x1, x2 = sorted(x_positions)
257
  if self.enable_subtopic_merge:
258
+ # If left bounding box is very narrow => treat as subtopic => 2 boxes
259
+ if x1 < (self.subtopic_threshold * row_width):
 
260
  boxes = [
261
  (0, y1, x1, row_height),
262
  (x1, y1, row_width - x1, row_height)
 
279
  for (x, y, w, h) in boxes:
280
  if w <= 0:
281
  continue
282
+ subregion = row_img[:, x:x+w]
283
  white_pixels = np.sum(subregion == 255)
284
  total_pixels = subregion.size
285
  if total_pixels == 0:
286
  continue
287
+ density = white_pixels / float(total_pixels)
288
  if density >= self.min_col_density:
289
  filtered.append((x, y, w, h))
290
 
 
293
  def process_image(self, image_path: str) -> List[List[Tuple[int, int, int, int]]]:
294
  """
295
  1) Preprocess => bin_img
296
+ 2) Detect row segments (with faint-line logic)
297
  3) Filter out rows by density
298
+ 4) Optionally skip the first row (header)
299
  5) For each row => detect columns => bounding boxes
300
  """
301
  img = cv2.imread(image_path)
 
313
  if area == 0:
314
  continue
315
  white_pixels = np.sum(row_region == 255)
316
+ density = white_pixels / float(area)
317
  if density >= self.min_row_density:
318
  valid_rows.append((y1, y2))
319
 
320
+ # skip header row
321
  if self.skip_header and len(valid_rows) > 1:
322
  valid_rows = valid_rows[1:]
323
 
324
+ # Detect columns in each valid row
325
  all_rows_boxes = []
326
  for (y1, y2) in valid_rows:
327
  row_img = bin_img[y1:y2, :]
 
331
 
332
  return all_rows_boxes
333
 
334
+ def extract_box_image(self,
335
+ original: np.ndarray,
336
+ box: Tuple[int, int, int, int]) -> np.ndarray:
337
+ """
338
+ Crop bounding box from original with optional padding.
339
+ """
340
  x, y, w, h = box
341
  Y1 = max(0, y - self.padding)
342
  Y2 = min(original.shape[0], y + h + self.padding)
 
344
  X2 = min(original.shape[1], x + w + self.padding)
345
  return original[Y1:Y2, X1:X2]
346
 
347
+ def is_artifact_by_color(self, cell_img: np.ndarray) -> bool:
348
  """
349
+ Revert to the *exact* color-based artifact logic from the first script:
350
+ 1) If the average color is near #a6a6a6 or #a7a7a7 (within color_tolerance),
351
+ skip it. Otherwise, keep it.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
352
  """
353
  if cell_img.size == 0:
354
  return True
355
 
356
+ avg_col = average_bgr(cell_img)
357
+ dist_a6 = color_distance(avg_col, self.artifact_color_a6)
358
+ if dist_a6 < self.color_tolerance:
359
  return True
360
 
361
+ dist_a7 = color_distance(avg_col, self.artifact_color_a7)
362
+ if dist_a7 < self.color_tolerance:
363
+ return True
364
 
365
+ dist_a8 = color_distance(avg_col, self.artifact_color_a8)
366
+ if dist_a8 < self.color_tolerance:
367
+ return True
368
 
369
+ dist_a9 = color_distance(avg_col, self.artifact_color_a9)
370
+ if dist_a9 < self.color_tolerance:
371
+ return True
372
+
373
+ dist_a10 = color_distance(avg_col, self.artifact_color_a10)
374
+ if dist_a10 < self.color_tolerance:
375
  return True
376
 
377
  return False
378
 
379
  def save_extracted_cells(
380
+ self,
381
+ image_path: str,
382
+ row_boxes: List[List[Tuple[int, int, int, int]]],
383
+ output_dir: str
384
  ):
385
+ """
386
+ Save each cell from the original image, skipping if it's near #a6a6a6 or #a7a7a7.
387
+ """
388
  out_path = Path(output_dir)
389
  out_path.mkdir(exist_ok=True, parents=True)
390
 
 
397
  row_dir.mkdir(exist_ok=True)
398
  for j, box in enumerate(row):
399
  cell_img = self.extract_box_image(original, box)
400
+
401
+ # Check color-based artifact
402
+ if self.is_artifact_by_color(cell_img):
403
+ logger.info(f"Skipping artifact cell at row={i}, col={j} (color near #a6a6a6/#a7a7a7).")
404
  continue
405
 
406
  out_file = row_dir / f"col_{j}.png"
407
  cv2.imwrite(str(out_file), cell_img)
408
+ logger.info(f"Saved cell row={i}, col={j} -> {out_file}")
409
 
410
  class TableExtractorApp:
411
  def __init__(self, extractor: TableExtractor):
 
417
  self.extractor.save_extracted_cells(input_image, row_boxes, output_folder)
418
  logger.info("Done. Check the output folder for results.")
419
 
 
420
  if __name__ == "__main__":
421
+ input_image = "images/test/img_9.png"
422
+ output_folder = "combined_outputs"
423
 
424
  extractor = TableExtractor(
425
+ row_morph_iterations=1,
426
+ min_row_height=15,
427
+ skip_header=False,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
428
 
429
  merge_two_col_rows=True,
430
  enable_subtopic_merge=True,
431
  subtopic_threshold=0.2,
432
 
433
+ faint_line_threshold_factor=0.4,
434
+ top_line_grouping_px=12,
435
+ some_minimum_text_pixels=50,
436
+
437
+ color_tolerance=30.0
438
  )
439
 
440
  app = TableExtractorApp(extractor)
topic_extr.py CHANGED
@@ -35,6 +35,7 @@ logger.addHandler(file_handler)
35
 
36
  _GEMINI_CLIENT = None
37
 
 
38
  def unify_whitespace(text: str) -> str:
39
  return re.sub(r"\s+", " ", text).strip()
40
 
@@ -66,6 +67,123 @@ def create_subset_pdf(original_pdf_bytes: bytes, page_indices: List[int]) -> byt
66
  doc.close()
67
  return subset_bytes
68
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
69
  class s3Writer:
70
  def __init__(self, ak: str, sk: str, bucket: str, endpoint_url: str):
71
  self.bucket = bucket
@@ -114,15 +232,20 @@ def call_gemini_for_table_classification(image_data: bytes, api_key: str, max_re
114
  prompt = """You are given an image. Determine if it shows a table that has exactly 2 or 3 columns.
115
  The three-column 'table' image includes such key features:
116
  - Three columns header
117
- - Headers like 'Topics', 'Content', 'Guidelines'
118
  - Possibly sections (e.g. 8.4, 9.1)
119
  The two-column 'table' image includes such key features:
120
  - Two columns
121
- - Headers like 'Subject content' and 'Additional information'
122
- - Possibly sections (e.g. 2.1, 3.4)
 
 
 
 
 
123
  If the image is a relevant table with 2 columns, respond with 'TWO_COLUMN'.
124
  If the image is a relevant table with 3 columns, respond with 'THREE_COLUMN'.
125
- If the image does not show a table at all, respond with 'NO_TABLE'.
126
  Return only one of these exact labels.
127
  """
128
  global _GEMINI_CLIENT
@@ -153,6 +276,8 @@ Return only one of these exact labels.
153
  return "THREE_COLUMN"
154
  elif "TWO" in classification:
155
  return "TWO_COLUMN"
 
 
156
  return "NO_TABLE"
157
  except Exception as e:
158
  logger.error(f"Gemini table classification error: {e}")
@@ -172,54 +297,86 @@ def call_gemini_for_subtopic_identification_image(image_data: bytes, api_key: st
172
  for attempt in range(max_retries + 1):
173
  try:
174
  prompt = """
175
- You are given an image from an educational curriculum specification. The image may contain either:
176
  1) A main topic heading in the format: "<number> <Topic Name>", for example "2 Algebra and functions continued".
177
  2) A subtopic heading in the format "<number>.<number>", for example "2.5", "2.6", or "3.4".
178
- 3) Possibly no relevant text at all.
179
-
180
- Your task:
181
- 1. If the cell shows a main topic, extract the topic name (e.g. "2 Algebra and functions") and place it in the JSON key "title".
182
- 2. If the cell shows one or more subtopic numbers (e.g. "2.5", "2.6"), collect them in the JSON key "subtopics" as an array of strings.
183
- 3. If neither a main topic nor subtopic is detected, return empty values.
184
-
185
- Output only valid JSON in this exact structure, with no extra text or explanation:
186
-
187
- Output only valid JSON in this exact structure, with no extra text or explanation:
188
-
189
- {
190
- "title": "...",
191
- "subtopics": [...]
192
- }
193
-
194
- Where:
195
- - "title" is the recognized main topic (if any). Otherwise, an empty string.
196
- - "subtopics" is an array of recognized subtopic numbers (e.g. ["2.5", "2.6"]). Otherwise, an empty array.
197
-
198
- Examples:
199
- 1. If the image text is "2 Algebra and functions continued", return:
200
- {
201
- "title": "2 Algebra and functions continued",
202
- "subtopics": []
203
- }
204
-
205
- 2. If the image text is "2.5 Solve linear and quadratic inequalities ...", return:
206
- {
207
- "title": "",
208
- "subtopics": ["2.5"]
209
- }
210
-
211
- 3. If the image text is "2.6 Manipulate polynomials algebraically ...", return:
212
- {
213
- "title": "",
214
- "subtopics": ["2.6"]
215
- }
216
-
217
- If you cannot recognize any text matching these patterns, or if nothing is found, return:
218
- {
219
- "title": "",
220
- "subtopics": []
221
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
222
  """
 
223
  global _GEMINI_CLIENT
224
  if _GEMINI_CLIENT is None:
225
  _GEMINI_CLIENT = genai.Client(api_key=api_key)
@@ -242,13 +399,13 @@ If you cannot recognize any text matching these patterns, or if nothing is found
242
  ],
243
  config=types.GenerateContentConfig(temperature=0.0)
244
  )
245
- # logger.info(f"Gemini subtopic extraction raw response: {resp.text if resp and resp.text else 'None'}")
246
-
247
  if not resp or not resp.text:
248
  logger.warning("Gemini returned an empty response for subtopic extraction.")
249
  return {"title": "", "subtopics": []}
250
 
251
  raw = resp.text.strip()
 
252
  raw = raw.replace("```json", "").replace("```", "").strip()
253
  data = json.loads(raw)
254
 
@@ -310,6 +467,10 @@ class S3ImageWriter(DataWriter):
310
  info['final_alt'] = "HAS TO BE PROCESSED - two column table"
311
  elif cls == "THREE_COLUMN":
312
  info['final_alt'] = "HAS TO BE PROCESSED - three column table"
 
 
 
 
313
  else:
314
  info['final_alt'] = "NO_TABLE image"
315
  md_content = md_content.replace(f"![]({key}{p})", f"![{info['final_alt']}]({info['s3_path']})")
@@ -445,123 +606,6 @@ class S3ImageWriter(DataWriter):
445
  def post_process(self, key: str, md_content: str) -> str:
446
  return asyncio.run(self.post_process_async(key, md_content))
447
 
448
- class LocalImageWriter(DataWriter):
449
- def __init__(self, output_folder: str, gemini_api_key: str):
450
- self.output_folder = output_folder
451
- os.makedirs(self.output_folder, exist_ok=True)
452
- self.descriptions = {}
453
- self._img_count = 0
454
- self.gemini_api_key = gemini_api_key
455
- self.extracted_tables = {}
456
-
457
- def write(self, path: str, data: bytes) -> None:
458
- self._img_count += 1
459
- unique_id = f"img_{self._img_count}.jpg"
460
- self.descriptions[path] = {
461
- "data": data,
462
- "relative_path": unique_id,
463
- "table_classification": "NO_TABLE",
464
- "final_alt": ""
465
- }
466
- image_path = os.path.join(self.output_folder, unique_id)
467
- with open(image_path, "wb") as f:
468
- f.write(data)
469
-
470
- async def post_process_async(self, key: str, md_content: str) -> str:
471
- logger.info("Classifying images to detect tables.")
472
- tasks = []
473
- for p, info in self.descriptions.items():
474
- tasks.append((p, classify_image_async(info["data"], self.gemini_api_key)))
475
- for p, task in tasks:
476
- try:
477
- classification = await task
478
- self.descriptions[p]['table_classification'] = classification
479
- except Exception as e:
480
- logger.error(f"Table classification error: {e}")
481
- self.descriptions[p]['table_classification'] = "NO_TABLE"
482
- for p, info in self.descriptions.items():
483
- cls = info['table_classification']
484
- if cls == "TWO_COLUMN":
485
- info['final_alt'] = "HAS TO BE PROCESSED - two column table"
486
- elif cls == "THREE_COLUMN":
487
- info['final_alt'] = "HAS TO BE PROCESSED - three column table"
488
- else:
489
- info['final_alt'] = "NO_TABLE image"
490
- md_content = md_content.replace(f"![]({key}{p})", f"![{info['final_alt']}]({info['relative_path']})")
491
- md_content = self._process_table_images_in_markdown(md_content)
492
- final_lines = []
493
- for line in md_content.split("\n"):
494
- if re.match(r"^\!\[.*\]\(.*\)", line.strip()):
495
- final_lines.append(line.strip())
496
- return "\n".join(final_lines)
497
-
498
- def _process_table_images_in_markdown(self, md_content: str) -> str:
499
- pat = r"!\[HAS TO BE PROCESSED - (two|three) column table\]\(([^)]+)\)"
500
- matches = re.findall(pat, md_content, flags=re.IGNORECASE)
501
- if not matches:
502
- return md_content
503
- for (col_type, image_id) in matches:
504
- logger.info(f"Processing table image => {image_id}, columns={col_type}")
505
- temp_path = os.path.join(self.output_folder, image_id)
506
- desc_item = None
507
- for k, val in self.descriptions.items():
508
- if val["relative_path"] == image_id:
509
- desc_item = val
510
- break
511
- if not desc_item:
512
- logger.warning(f"No matching image data for {image_id}, skipping extraction.")
513
- continue
514
- if not os.path.exists(temp_path):
515
- with open(temp_path, "wb") as f:
516
- f.write(desc_item["data"])
517
- try:
518
- if col_type.lower() == 'two': #check for table_row_extr script for more details
519
- extractor = TableExtractor(
520
- skip_header=True,
521
- merge_two_col_rows=True,
522
- enable_subtopic_merge=True,
523
- subtopic_threshold=0.2
524
- )
525
- else:
526
- extractor = TableExtractor(
527
- skip_header=True,
528
- merge_two_col_rows=False,
529
- enable_subtopic_merge=False,
530
- subtopic_threshold=0.2
531
- )
532
- row_boxes = extractor.process_image(temp_path)
533
- out_folder = temp_path + "_rows"
534
- os.makedirs(out_folder, exist_ok=True)
535
- extractor.save_extracted_cells(temp_path, row_boxes, out_folder)
536
- # List all extracted cell images relative to the output folder.
537
- extracted_cells = []
538
- for root, dirs, files in os.walk(out_folder):
539
- for file in files:
540
- rel_path = os.path.relpath(os.path.join(root, file), self.output_folder)
541
- extracted_cells.append(rel_path)
542
- # Save mapping for testing.
543
- self.extracted_tables[image_id] = extracted_cells
544
- snippet = ["**Extracted table cells:**"]
545
- for i, row in enumerate(row_boxes):
546
- row_dir = os.path.join(out_folder, f"row_{i}")
547
- for j, _ in enumerate(row):
548
- cell_file = f"col_{j}.jpg"
549
- cell_path = os.path.join(row_dir, cell_file)
550
- relp = os.path.relpath(cell_path, self.output_folder)
551
- snippet.append(f"![Row {i} Col {j}]({relp})")
552
- new_snip = "\n".join(snippet)
553
- old_line = f"![HAS TO BE PROCESSED - {col_type} column table]({image_id})"
554
- md_content = md_content.replace(old_line, new_snip)
555
- except Exception as e:
556
- logger.error(f"Error processing table image {image_id}: {e}")
557
- finally:
558
- if os.path.exists(temp_path):
559
- os.remove(temp_path)
560
- return md_content
561
-
562
- def post_process(self, key: str, md_content: str) -> str:
563
- return asyncio.run(self.post_process_async(key, md_content))
564
-
565
  class GeminiTopicExtractor:
566
  def __init__(self, api_key: str = None, num_pages: int = 14):
567
  self.api_key = api_key or os.getenv("GEMINI_API_KEY", "")
@@ -782,119 +826,6 @@ class MineruNoTextProcessor:
782
  except Exception as e:
783
  logger.error(f"Error during GPU cleanup: {e}")
784
 
785
- def unify_topic_name(raw_title: str, children_subtopics: list) -> str:
786
- """
787
- Produce a cleaned-up topic name, removing any trailing '... continued'
788
- and fixing partial or empty titles if it’s obvious from the subtopic numbering.
789
- E.g. 'gonometry' with children '5.1', '5.2' → '5 Trigonometry'
790
- """
791
- title = raw_title.strip()
792
-
793
- # Remove trailing " continued"
794
- # E.g. "2 Algebra and functions continued" -> "2 Algebra and functions"
795
- title = re.sub(r"\s+continued\s*$", "", title, flags=re.IGNORECASE)
796
-
797
- # If the entire title is missing or obviously broken (like "gonometry"),
798
- # guess a fix from the subtopics if they share a leading integer.
799
- # e.g. if subtopics start with "5." => rename to "5 Trigonometry".
800
- # You can add more sophisticated logic as needed.
801
- if not title or title.lower().strip() in {"gonometry"}:
802
- # Try to deduce from subtopic numbering
803
- # Example: if children are "5.1", "5.2", that suggests a "5 Trigonometry"
804
- all_subs = [child["title"] for child in children_subtopics]
805
- # We'll parse the integer part from e.g. "5.1", "5.2"
806
- # and guess "5 Trigonometry" if they're all "5.xxx".
807
- if all_subs:
808
- # Grab the first subtopic
809
- first_sub = all_subs[0].strip()
810
- m = re.match(r"^(\d+)\.", first_sub)
811
- if m:
812
- parent_num = m.group(1)
813
- if parent_num == "5":
814
- title = "5 Trigonometry"
815
- elif parent_num == "2":
816
- title = "2 Algebra and functions"
817
- elif parent_num == "3":
818
- title = "3 Coordinate geometry in the (x, y) plane"
819
- elif parent_num == "4":
820
- title = "4 Statistical distributions"
821
- # etc., adapt to your needs
822
- # or leave as e.g. f"{parent_num} ???" if you cannot guess.
823
-
824
- return title
825
-
826
-
827
- def merge_topics(subtopic_list: list) -> list:
828
- """
829
- 1. Cleans up each topic's title (remove " continued", fix partial titles).
830
- 2. Merges subtopics under the same cleaned-up parent name.
831
- 3. Sorts final output in ascending numeric order of the parent's leading number.
832
- 4. Sorts each parent's children in ascending numeric subtopic order.
833
- """
834
- # Dictionary keyed by *cleaned* parent title => {"title": "...", "contents": [...], "children": [...]}
835
- merged = {}
836
-
837
- for topic_obj in subtopic_list:
838
- raw_title = topic_obj.get("title", "")
839
- children = topic_obj.get("children", [])
840
- contents = topic_obj.get("contents", [])
841
-
842
- # Clean up the parent's title
843
- new_title = unify_topic_name(raw_title, children)
844
-
845
- # If we have already seen this (cleaned) title, merge
846
- if new_title not in merged:
847
- merged[new_title] = {
848
- "title": new_title,
849
- "contents": list(contents), # copy
850
- "children": list(children),
851
- }
852
- else:
853
- # Merge contents and children
854
- merged[new_title]["contents"].extend(contents)
855
- merged[new_title]["children"].extend(children)
856
-
857
- # Next, for each parent's children, we might want to remove duplicates
858
- # or unify them more. Here we simply unify if they have the same "title".
859
- # If you have no duplicates, you can skip this loop.
860
- for par_title, par_info in merged.items():
861
- # Turn child list into map for merging
862
- child_map = {}
863
- for ch in par_info["children"]:
864
- ctitle = ch.get("title", "").strip()
865
- if ctitle not in child_map:
866
- child_map[ctitle] = ch
867
- else:
868
- # Merge the "contents" and "children" if needed
869
- child_map[ctitle]["contents"].extend(ch.get("contents", []))
870
- child_map[ctitle]["children"].extend(ch.get("children", []))
871
- # Overwrite the parent's children list with the merged versions
872
- par_info["children"] = list(child_map.values())
873
-
874
- # Sort the top-level topics by leading integer (e.g. "2 Algebra" < "5 Trigonometry")
875
- # We'll parse the first integer from the parent's title, or push them last if no integer found.
876
- def parse_parent_num(t):
877
- match = re.match(r"^(\d+)", t)
878
- return int(match.group(1)) if match else 9999
879
-
880
- # Build the final list
881
- final_list = list(merged.values())
882
- final_list.sort(key=lambda x: parse_parent_num(x["title"]))
883
-
884
- # Sort each parent's children by their numeric portion. E.g. "2.1" < "2.2" < "3.1"
885
- def parse_subtopic_num(subtitle):
886
- # "2.11" => (2, 11), "10.5" => (10, 5)
887
- # or just parse all groups of digits
888
- digits = re.findall(r"\d+", subtitle)
889
- if not digits:
890
- return (9999,) # if no digits, push to end
891
- return tuple(int(d) for d in digits)
892
-
893
- for par_info in final_list:
894
- par_info["children"].sort(key=lambda ch: parse_subtopic_num(ch["title"]))
895
-
896
- return final_list
897
-
898
  def process(self, pdf_path: str) -> Dict[str, Any]:
899
  logger.info(f"Processing PDF: {pdf_path}")
900
  try:
@@ -972,9 +903,6 @@ class MineruNoTextProcessor:
972
  )
973
  #S3
974
  writer = S3ImageWriter(self.s3_writer, "/topic-extraction", self.gemini_api_key)
975
-
976
- #local
977
- # writer = LocalImageWriter(self.output_folder, self.gemini_api_key)
978
 
979
  md_prefix = "/topic-extraction/"
980
  pipe_result = inference.pipe_ocr_mode(writer, lang=self.language)
@@ -984,11 +912,7 @@ class MineruNoTextProcessor:
984
  subtopic_list = list(writer.extracted_subtopics.values())
985
  subtopic_list = merge_topics(subtopic_list)
986
 
987
- # out_path = os.path.join(self.output_folder, "final_subtopics.json")
988
- # with open(out_path, "w", encoding="utf-8") as f:
989
- # json.dump(subtopic_list, f, indent=2)
990
- # logger.info(f"Final subtopics JSON saved locally at {out_path}")
991
- out_path = os.path.join(self.output_folder, "final_subtopics.json")
992
  with open(out_path, "w", encoding="utf-8") as f:
993
  json.dump(subtopic_list, f, indent=2)
994
  logger.info(f"Final subtopics JSON saved locally at {out_path}")
 
35
 
36
  _GEMINI_CLIENT = None
37
 
38
+ #helper functions, also global
39
  def unify_whitespace(text: str) -> str:
40
  return re.sub(r"\s+", " ", text).strip()
41
 
 
67
  doc.close()
68
  return subset_bytes
69
 
70
+ def unify_topic_name(raw_title: str, children_subtopics: list) -> str:
71
+ """
72
+ Clean up a topic title:
73
+ - Remove any trailing "continued".
74
+ - If the title does not start with a number but children provide a consistent numeric prefix,
75
+ then prepend that prefix.
76
+ """
77
+ title = raw_title.strip()
78
+ # Remove trailing "continued"
79
+ title = re.sub(r"\s+continued\s*$", "", title, flags=re.IGNORECASE)
80
+
81
+ # If title already starts with a number, use it as is.
82
+ if re.match(r"^\d+", title):
83
+ return title
84
+
85
+ # Otherwise, try to deduce a numeric prefix from the children.
86
+ prefixes = []
87
+ for child in children_subtopics:
88
+ child_title = child.get("title", "").strip()
89
+ m = re.match(r"^(\d+)\.", child_title)
90
+ if m:
91
+ prefixes.append(m.group(1))
92
+ if prefixes:
93
+ # If all numeric prefixes in children are the same, use that prefix.
94
+ if all(p == prefixes[0] for p in prefixes):
95
+ # If title is non-empty, prepend the number; otherwise, use a fallback.
96
+ if title:
97
+ title = f"{prefixes[0]} {title}"
98
+ else:
99
+ title = f"{prefixes[0]} Topic"
100
+ # Optionally, handle known broken titles explicitly.
101
+ if title.lower() in {"gonometry"}:
102
+ # For example, if children indicate "5.X", set to "5 Trigonometry"
103
+ if prefixes and prefixes[0] == "5":
104
+ title = "5 Trigonometry"
105
+ return title
106
+
107
+
108
+ def merge_topics(subtopic_list: list) -> list:
109
+ """
110
+ Merge topics with an enhanced logic:
111
+ 1. Clean up each topic's title using unify_topic_name.
112
+ 2. Group topics by the parent's numeric prefix (if available). Topics without a numeric prefix use their title.
113
+ 3. Reassign children: for each child whose title (e.g. "3.1") does not match its current parent's numeric prefix,
114
+ move it to the parent with the matching prefix if available.
115
+ 4. Remove duplicate children by merging contents.
116
+ 5. Sort parent topics and each parent's children by their numeric ordering.
117
+ """
118
+ # First, merge topics by parent's numeric prefix.
119
+ merged = {}
120
+ for topic_obj in subtopic_list:
121
+ raw_title = topic_obj.get("title", "")
122
+ children = topic_obj.get("children", [])
123
+ contents = topic_obj.get("contents", [])
124
+ new_title = unify_topic_name(raw_title, children)
125
+ # Extract parent's numeric prefix, if present.
126
+ m = re.match(r"^(\d+)", new_title)
127
+ parent_prefix = m.group(1) if m else None
128
+ key = parent_prefix if parent_prefix is not None else new_title
129
+
130
+ if key not in merged:
131
+ merged[key] = {
132
+ "title": new_title,
133
+ "contents": list(contents),
134
+ "children": list(children),
135
+ }
136
+ else:
137
+ # Merge contents and children; choose the longer title.
138
+ if len(new_title) > len(merged[key]["title"]):
139
+ merged[key]["title"] = new_title
140
+ merged[key]["contents"].extend(contents)
141
+ merged[key]["children"].extend(children)
142
+
143
+ # Build a lookup of merged topics by their numeric prefix.
144
+ parent_lookup = merged # keys are numeric prefixes or the full title for non-numeric ones.
145
+
146
+ # Reassign children to the correct parent based on their numeric prefix.
147
+ for key, topic in merged.items():
148
+ new_children = []
149
+ for child in topic["children"]:
150
+ child_title = child.get("title", "").strip()
151
+ m_child = re.match(r"^(\d+)\.", child_title)
152
+ if m_child:
153
+ child_prefix = m_child.group(1)
154
+ if key != child_prefix and child_prefix in parent_lookup:
155
+ # Reassign this child to the proper parent.
156
+ parent_lookup[child_prefix]["children"].append(child)
157
+ continue
158
+ new_children.append(child)
159
+ topic["children"] = new_children
160
+
161
+ # Remove duplicate children by merging their contents.
162
+ for topic in merged.values():
163
+ child_map = {}
164
+ for child in topic["children"]:
165
+ ctitle = child.get("title", "").strip()
166
+ if ctitle not in child_map:
167
+ child_map[ctitle] = child
168
+ else:
169
+ child_map[ctitle]["contents"].extend(child.get("contents", []))
170
+ child_map[ctitle]["children"].extend(child.get("children", []))
171
+ topic["children"] = list(child_map.values())
172
+
173
+ # Sort children by full numeric order (e.g. "2.1" < "2.10" < "2.2").
174
+ def parse_subtopic_num(subtitle):
175
+ digits = re.findall(r"\d+", subtitle)
176
+ return tuple(int(d) for d in digits) if digits else (9999,)
177
+ topic["children"].sort(key=lambda ch: parse_subtopic_num(ch.get("title", "")))
178
+
179
+ # Convert merged topics to a sorted list.
180
+ def parse_parent_num(topic):
181
+ m = re.match(r"^(\d+)", topic.get("title", ""))
182
+ return int(m.group(1)) if m else 9999
183
+ final_list = list(merged.values())
184
+ final_list.sort(key=lambda topic: parse_parent_num(topic))
185
+ return final_list
186
+
187
  class s3Writer:
188
  def __init__(self, ak: str, sk: str, bucket: str, endpoint_url: str):
189
  self.bucket = bucket
 
232
  prompt = """You are given an image. Determine if it shows a table that has exactly 2 or 3 columns.
233
  The three-column 'table' image includes such key features:
234
  - Three columns header
235
+ - Headers like 'Topics', 'Content', 'Guidelines', 'Amplification', 'Additional guidance notes', 'Area of Study'
236
  - Possibly sections (e.g. 8.4, 9.1)
237
  The two-column 'table' image includes such key features:
238
  - Two columns
239
+ - Headers like 'Subject content', 'Additional information'
240
+ - Possibly sections (e.g. 2.1, 3.4, G2, G3, )
241
+ The empty image include such key features:
242
+ - Does not include anything at all (like a blank white or black image)
243
+ - Truncated image with words like 'Subject content', 'What students need to learn' with blue background.
244
+ - Truncated image with words like 'Topics', 'What students need to learn', 'Content' with grey background ((166, 166, 166) or (180,180,180) RGB color code).
245
+ If the image is an empty image, respond with 'EMPTY_IMAGE'.
246
  If the image is a relevant table with 2 columns, respond with 'TWO_COLUMN'.
247
  If the image is a relevant table with 3 columns, respond with 'THREE_COLUMN'.
248
+ If the image is non-empty but does not show a table, respond with 'NO_TABLE'.
249
  Return only one of these exact labels.
250
  """
251
  global _GEMINI_CLIENT
 
276
  return "THREE_COLUMN"
277
  elif "TWO" in classification:
278
  return "TWO_COLUMN"
279
+ elif "EMPTY" in classification:
280
+ return "EMPTY_IMAGE"
281
  return "NO_TABLE"
282
  except Exception as e:
283
  logger.error(f"Gemini table classification error: {e}")
 
297
  for attempt in range(max_retries + 1):
298
  try:
299
  prompt = """
300
+ You are given an image from an educational curriculum specification. The image may contain:
301
  1) A main topic heading in the format: "<number> <Topic Name>", for example "2 Algebra and functions continued".
302
  2) A subtopic heading in the format "<number>.<number>", for example "2.5", "2.6", or "3.4".
303
+ 3) A label-like title in the left column of a two-column table, for example "G2", "G3", "Scarcity, choice and opportunity cost", or similar text without explicit numeric patterns (2.1, 3.4, etc.).
304
+ 4) Possibly no relevant text at all.
305
+
306
+ Your task is to extract:
307
+ - **"title"**: A recognized main topic or heading text.
308
+ - **"subtopics"**: Any recognized subtopic numbers (e.g. "2.5", "2.6", "3.4"), as an array of strings.
309
+
310
+ Follow these rules:
311
+
312
+ (1) **If the cell shows a main topic in the format "<number> <Topic Name>",** for example "2 Algebra and functions continued", then:
313
+ - Put that text (without the word "continued") in "title". (e.g. "2 Algebra and functions")
314
+ - "subtopics" should be an empty array, unless you also see smaller subtopic numbers.
315
+
316
+ (2) **If the cell shows one or more subtopic numbers** in the format "<number>.<number>", for example "2.5", "2.6", or "3.4", then:
317
+ - Collect those exact strings in the JSON key "subtopics" (an array of strings).
318
+ - "title" in this case should be an empty string if you only detect subtopics.
319
+ (Example: If text is "2.5 Solve linear inequalities...", then "title" = "", "subtopics" = ["2.5"]).
320
+
321
+ (3) **If neither a main topic nor a subtopic is detected,** return empty values:
322
+ {
323
+ "title": "",
324
+ "subtopics": []
325
+ }
326
+
327
+ (4) **If there is no numeric value in the left column** (e.g. "2.1" or "2 <Topic name>" not found) but the left column text appears to be a heading (for instance "Scarcity, choice and opportunity cost"), then:
328
+ - Use the **left column text** as "title".
329
+ - "subtopics" remains empty.
330
+ Example:
331
+ If the left column is "Scarcity, choice and opportunity cost" and the right column has definitions, your output is:
332
+ {
333
+ "title": "Scarcity, choice and opportunity cost",
334
+ "subtopics": []
335
+ }
336
+
337
+ (5) **If there is a character + digit pattern** in the left column for a two-column table (for example "G2", "G3", "G4", "C1"), treat that as a topic-like label:
338
+ - Put that label text into "title" (e.g. "G2").
339
+ - "subtopics" remains empty unless you also see actual subtopic formats like "2.5", "3.4" inside the same cell.
340
+
341
+ (6) **Output must be valid JSON** in this exact structure, with no extra text or explanation:
342
+ {
343
+ "title": "...",
344
+ "subtopics": [...]
345
+ }
346
+
347
+ **Examples**:
348
+
349
+ - If the image text is `"2 Algebra and functions continued"`, return:
350
+ {
351
+ "title": "2 Algebra and functions",
352
+ "subtopics": []
353
+ }
354
+
355
+ - If the image text is `"2.5 Solve linear and quadratic inequalities ..."`, return:
356
+ {
357
+ "title": "",
358
+ "subtopics": ["2.5"]
359
+ }
360
+
361
+ - If the image text is `"Scarcity, choice and opportunity cost"` (with no numeric patterns at all), return:
362
+ {
363
+ "title": "Scarcity, choice and opportunity cost",
364
+ "subtopics": []
365
+ }
366
+
367
+ - If the left column says `"G2"` and the right column has details, but no subtopic numbers, return:
368
+ {
369
+ "title": "G2",
370
+ "subtopics": []
371
+ }
372
+
373
+ - If you cannot recognize any text matching these patterns, or if nothing is found, return:
374
+ {
375
+ "title": "",
376
+ "subtopics": []
377
+ }
378
  """
379
+
380
  global _GEMINI_CLIENT
381
  if _GEMINI_CLIENT is None:
382
  _GEMINI_CLIENT = genai.Client(api_key=api_key)
 
399
  ],
400
  config=types.GenerateContentConfig(temperature=0.0)
401
  )
402
+
 
403
  if not resp or not resp.text:
404
  logger.warning("Gemini returned an empty response for subtopic extraction.")
405
  return {"title": "", "subtopics": []}
406
 
407
  raw = resp.text.strip()
408
+ # Remove any markdown fences if present
409
  raw = raw.replace("```json", "").replace("```", "").strip()
410
  data = json.loads(raw)
411
 
 
467
  info['final_alt'] = "HAS TO BE PROCESSED - two column table"
468
  elif cls == "THREE_COLUMN":
469
  info['final_alt'] = "HAS TO BE PROCESSED - three column table"
470
+ elif cls == "EMPTY_IMAGE":
471
+ md_content = md_content.replace(f"![]({key}{p})", "")
472
+ del self.descriptions[p]
473
+ continue
474
  else:
475
  info['final_alt'] = "NO_TABLE image"
476
  md_content = md_content.replace(f"![]({key}{p})", f"![{info['final_alt']}]({info['s3_path']})")
 
606
  def post_process(self, key: str, md_content: str) -> str:
607
  return asyncio.run(self.post_process_async(key, md_content))
608
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
609
  class GeminiTopicExtractor:
610
  def __init__(self, api_key: str = None, num_pages: int = 14):
611
  self.api_key = api_key or os.getenv("GEMINI_API_KEY", "")
 
826
  except Exception as e:
827
  logger.error(f"Error during GPU cleanup: {e}")
828
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
829
  def process(self, pdf_path: str) -> Dict[str, Any]:
830
  logger.info(f"Processing PDF: {pdf_path}")
831
  try:
 
903
  )
904
  #S3
905
  writer = S3ImageWriter(self.s3_writer, "/topic-extraction", self.gemini_api_key)
 
 
 
906
 
907
  md_prefix = "/topic-extraction/"
908
  pipe_result = inference.pipe_ocr_mode(writer, lang=self.language)
 
912
  subtopic_list = list(writer.extracted_subtopics.values())
913
  subtopic_list = merge_topics(subtopic_list)
914
 
915
+ out_path = os.path.join(self.output_folder, "subtopics.json")
 
 
 
 
916
  with open(out_path, "w", encoding="utf-8") as f:
917
  json.dump(subtopic_list, f, indent=2)
918
  logger.info(f"Final subtopics JSON saved locally at {out_path}")
topic_extraction.log CHANGED
@@ -5558,3 +5558,314 @@ and series'. Using page 7.
5558
  2025-03-03 18:09:13,257 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_28.jpg_r2_c0.png
5559
  2025-03-03 18:09:15,022 [INFO] __main__ - GPU memory cleaned up.
5560
  2025-03-03 18:09:15,023 [ERROR] __main__ - Processing failed: name 'merge_topics' is not defined
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5558
  2025-03-03 18:09:13,257 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_28.jpg_r2_c0.png
5559
  2025-03-03 18:09:15,022 [INFO] __main__ - GPU memory cleaned up.
5560
  2025-03-03 18:09:15,023 [ERROR] __main__ - Processing failed: name 'merge_topics' is not defined
5561
+ 2025-03-04 14:56:39,218 [INFO] __main__ - Processing PDF: /home/user/app/input_output/a-level-pearson-mathematics-specification.pdf
5562
+ 2025-03-04 14:56:40,018 [INFO] __main__ - Gemini returned subtopics: {'Paper 1 and Paper 2: Pure Mathematics': [11, 29], 'Paper 3: Statistics and Mechanics': [30, 40]}
5563
+ 2025-03-04 14:56:40,019 [INFO] __main__ - Loaded 1135473 bytes from local file '/home/user/app/input_output/a-level-pearson-mathematics-specification.pdf'
5564
+ 2025-03-04 14:56:40,316 [INFO] __main__ - Computed global offset: 4
5565
+ 2025-03-04 14:56:40,316 [INFO] __main__ - Processing pages (0-based): [14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43]
5566
+ 2025-03-04 14:58:48,246 [INFO] __main__ - Uploaded to S3: /topic-extraction/img_1.jpg
5567
+ 2025-03-04 14:58:50,037 [INFO] __main__ - Uploaded to S3: /topic-extraction/img_2.jpg
5568
+ 2025-03-04 14:58:50,583 [INFO] __main__ - Uploaded to S3: /topic-extraction/img_3.jpg
5569
+ 2025-03-04 14:58:51,114 [INFO] __main__ - Uploaded to S3: /topic-extraction/img_4.jpg
5570
+ 2025-03-04 14:58:51,657 [INFO] __main__ - Uploaded to S3: /topic-extraction/img_5.jpg
5571
+ 2025-03-04 14:58:52,211 [INFO] __main__ - Uploaded to S3: /topic-extraction/img_6.jpg
5572
+ 2025-03-04 14:58:52,686 [INFO] __main__ - Uploaded to S3: /topic-extraction/img_7.jpg
5573
+ 2025-03-04 14:58:53,167 [INFO] __main__ - Uploaded to S3: /topic-extraction/img_8.jpg
5574
+ 2025-03-04 14:58:53,667 [INFO] __main__ - Uploaded to S3: /topic-extraction/img_9.jpg
5575
+ 2025-03-04 14:58:54,285 [INFO] __main__ - Uploaded to S3: /topic-extraction/img_10.jpg
5576
+ 2025-03-04 14:58:54,850 [INFO] __main__ - Uploaded to S3: /topic-extraction/img_11.jpg
5577
+ 2025-03-04 14:58:55,401 [INFO] __main__ - Uploaded to S3: /topic-extraction/img_12.jpg
5578
+ 2025-03-04 14:58:55,916 [INFO] __main__ - Uploaded to S3: /topic-extraction/img_13.jpg
5579
+ 2025-03-04 14:58:56,524 [INFO] __main__ - Uploaded to S3: /topic-extraction/img_14.jpg
5580
+ 2025-03-04 14:58:56,999 [INFO] __main__ - Uploaded to S3: /topic-extraction/img_15.jpg
5581
+ 2025-03-04 14:58:57,542 [INFO] __main__ - Uploaded to S3: /topic-extraction/img_16.jpg
5582
+ 2025-03-04 14:58:58,071 [INFO] __main__ - Uploaded to S3: /topic-extraction/img_17.jpg
5583
+ 2025-03-04 14:58:58,366 [INFO] __main__ - Uploaded to S3: /topic-extraction/img_18.jpg
5584
+ 2025-03-04 14:58:58,849 [INFO] __main__ - Uploaded to S3: /topic-extraction/img_19.jpg
5585
+ 2025-03-04 14:58:59,428 [INFO] __main__ - Uploaded to S3: /topic-extraction/img_20.jpg
5586
+ 2025-03-04 14:58:59,995 [INFO] __main__ - Uploaded to S3: /topic-extraction/img_21.jpg
5587
+ 2025-03-04 14:59:00,597 [INFO] __main__ - Uploaded to S3: /topic-extraction/img_22.jpg
5588
+ 2025-03-04 14:59:01,070 [INFO] __main__ - Uploaded to S3: /topic-extraction/img_23.jpg
5589
+ 2025-03-04 14:59:01,567 [INFO] __main__ - Uploaded to S3: /topic-extraction/img_24.jpg
5590
+ 2025-03-04 14:59:02,141 [INFO] __main__ - Uploaded to S3: /topic-extraction/img_25.jpg
5591
+ 2025-03-04 14:59:02,569 [INFO] __main__ - Uploaded to S3: /topic-extraction/img_26.jpg
5592
+ 2025-03-04 14:59:03,024 [INFO] __main__ - Uploaded to S3: /topic-extraction/img_27.jpg
5593
+ 2025-03-04 14:59:03,607 [INFO] __main__ - Uploaded to S3: /topic-extraction/img_28.jpg
5594
+ 2025-03-04 14:59:04,016 [INFO] __main__ - Classifying images to detect tables.
5595
+ 2025-03-04 14:59:20,581 [INFO] __main__ - Processing table image: /topic-extraction/img_1.jpg, columns=three
5596
+ 2025-03-04 14:59:23,252 [WARNING] __main__ - Cell image not found: /tmp/tmpijzc040v.jpg_rows/row_0/col_0.png
5597
+ 2025-03-04 14:59:23,252 [WARNING] __main__ - Cell image not found: /tmp/tmpijzc040v.jpg_rows/row_0/col_1.png
5598
+ 2025-03-04 14:59:23,748 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_1.jpg_r1_c0.png
5599
+ 2025-03-04 14:59:25,146 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_1.jpg_r1_c1.png
5600
+ 2025-03-04 14:59:26,469 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_1.jpg_r2_c0.png
5601
+ 2025-03-04 14:59:27,272 [INFO] __main__ - Processing table image: /topic-extraction/img_2.jpg, columns=three
5602
+ 2025-03-04 14:59:30,158 [WARNING] __main__ - Cell image not found: /tmp/tmplbse6rk2.jpg_rows/row_0/col_0.png
5603
+ 2025-03-04 14:59:30,158 [WARNING] __main__ - Cell image not found: /tmp/tmplbse6rk2.jpg_rows/row_0/col_1.png
5604
+ 2025-03-04 14:59:30,420 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_2.jpg_r1_c0.png
5605
+ 2025-03-04 14:59:31,612 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_2.jpg_r1_c1.png
5606
+ 2025-03-04 14:59:34,174 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_2.jpg_r2_c0.png
5607
+ 2025-03-04 14:59:35,585 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_2.jpg_r3_c0.png
5608
+ 2025-03-04 14:59:36,908 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_2.jpg_r4_c0.png
5609
+ 2025-03-04 14:59:38,024 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_2.jpg_r5_c0.png
5610
+ 2025-03-04 14:59:38,783 [INFO] __main__ - Processing table image: /topic-extraction/img_3.jpg, columns=three
5611
+ 2025-03-04 14:59:41,887 [WARNING] __main__ - Cell image not found: /tmp/tmp9jfrqv6f.jpg_rows/row_0/col_0.png
5612
+ 2025-03-04 14:59:41,887 [WARNING] __main__ - Cell image not found: /tmp/tmp9jfrqv6f.jpg_rows/row_0/col_1.png
5613
+ 2025-03-04 14:59:42,148 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_3.jpg_r1_c0.png
5614
+ 2025-03-04 14:59:43,551 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_3.jpg_r1_c1.png
5615
+ 2025-03-04 14:59:45,241 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_3.jpg_r2_c0.png
5616
+ 2025-03-04 14:59:46,499 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_3.jpg_r3_c0.png
5617
+ 2025-03-04 14:59:47,500 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_3.jpg_r4_c0.png
5618
+ 2025-03-04 14:59:48,309 [INFO] __main__ - Processing table image: /topic-extraction/img_4.jpg, columns=three
5619
+ 2025-03-04 14:59:51,311 [WARNING] __main__ - Cell image not found: /tmp/tmpbrv43l7_.jpg_rows/row_0/col_0.png
5620
+ 2025-03-04 14:59:51,311 [WARNING] __main__ - Cell image not found: /tmp/tmpbrv43l7_.jpg_rows/row_0/col_1.png
5621
+ 2025-03-04 14:59:51,311 [WARNING] __main__ - Cell image not found: /tmp/tmpbrv43l7_.jpg_rows/row_1/col_0.png
5622
+ 2025-03-04 14:59:51,311 [WARNING] __main__ - Cell image not found: /tmp/tmpbrv43l7_.jpg_rows/row_1/col_1.png
5623
+ 2025-03-04 14:59:51,579 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_4.jpg_r2_c0.png
5624
+ 2025-03-04 14:59:53,042 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_4.jpg_r2_c1.png
5625
+ 2025-03-04 14:59:54,470 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_4.jpg_r3_c0.png
5626
+ 2025-03-04 14:59:55,460 [INFO] __main__ - Processing table image: /topic-extraction/img_5.jpg, columns=three
5627
+ 2025-03-04 14:59:58,401 [WARNING] __main__ - Cell image not found: /tmp/tmpdj8vn5v4.jpg_rows/row_0/col_0.png
5628
+ 2025-03-04 14:59:58,401 [WARNING] __main__ - Cell image not found: /tmp/tmpdj8vn5v4.jpg_rows/row_0/col_1.png
5629
+ 2025-03-04 14:59:58,659 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_5.jpg_r1_c0.png
5630
+ 2025-03-04 15:00:00,036 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_5.jpg_r1_c1.png
5631
+ 2025-03-04 15:00:01,411 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_5.jpg_r2_c0.png
5632
+ 2025-03-04 15:00:02,747 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_5.jpg_r3_c0.png
5633
+ 2025-03-04 15:00:03,656 [INFO] __main__ - Processing table image: /topic-extraction/img_6.jpg, columns=three
5634
+ 2025-03-04 15:00:06,880 [WARNING] __main__ - Cell image not found: /tmp/tmpw4hdm_vm.jpg_rows/row_0/col_0.png
5635
+ 2025-03-04 15:00:06,881 [WARNING] __main__ - Cell image not found: /tmp/tmpw4hdm_vm.jpg_rows/row_0/col_1.png
5636
+ 2025-03-04 15:00:07,144 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_6.jpg_r1_c0.png
5637
+ 2025-03-04 15:00:08,578 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_6.jpg_r1_c1.png
5638
+ 2025-03-04 15:00:09,789 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_6.jpg_r2_c0.png
5639
+ 2025-03-04 15:00:12,763 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_6.jpg_r2_c1.png
5640
+ 2025-03-04 15:00:14,173 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_6.jpg_r3_c0.png
5641
+ 2025-03-04 15:00:15,229 [INFO] __main__ - Processing table image: /topic-extraction/img_7.jpg, columns=three
5642
+ 2025-03-04 15:00:18,336 [WARNING] __main__ - Cell image not found: /tmp/tmpier2e_jn.jpg_rows/row_0/col_0.png
5643
+ 2025-03-04 15:00:18,336 [WARNING] __main__ - Cell image not found: /tmp/tmpier2e_jn.jpg_rows/row_0/col_1.png
5644
+ 2025-03-04 15:00:18,607 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_7.jpg_r1_c0.png
5645
+ 2025-03-04 15:00:19,964 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_7.jpg_r1_c1.png
5646
+ 2025-03-04 15:00:21,423 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_7.jpg_r2_c0.png
5647
+ 2025-03-04 15:00:22,514 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_7.jpg_r3_c0.png
5648
+ 2025-03-04 15:00:23,784 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_7.jpg_r3_c1.png
5649
+ 2025-03-04 15:00:25,023 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_7.jpg_r4_c0.png
5650
+ 2025-03-04 15:00:26,014 [INFO] __main__ - Processing table image: /topic-extraction/img_8.jpg, columns=three
5651
+ 2025-03-04 15:00:30,110 [WARNING] __main__ - Cell image not found: /tmp/tmpwzp5zo9m.jpg_rows/row_0/col_0.png
5652
+ 2025-03-04 15:00:30,295 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_8.jpg_r0_c1.png
5653
+ 2025-03-04 15:00:30,957 [WARNING] __main__ - Cell image not found: /tmp/tmpwzp5zo9m.jpg_rows/row_1/col_0.png
5654
+ 2025-03-04 15:00:30,958 [WARNING] __main__ - Cell image not found: /tmp/tmpwzp5zo9m.jpg_rows/row_1/col_1.png
5655
+ 2025-03-04 15:00:30,958 [WARNING] __main__ - Cell image not found: /tmp/tmpwzp5zo9m.jpg_rows/row_1/col_2.png
5656
+ 2025-03-04 15:00:31,219 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_8.jpg_r2_c0.png
5657
+ 2025-03-04 15:00:32,311 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_8.jpg_r2_c1.png
5658
+ 2025-03-04 15:00:33,619 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_8.jpg_r2_c2.png
5659
+ 2025-03-04 15:00:34,694 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_8.jpg_r3_c0.png
5660
+ 2025-03-04 15:00:35,762 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_8.jpg_r3_c1.png
5661
+ 2025-03-04 15:00:36,796 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_8.jpg_r4_c0.png
5662
+ 2025-03-04 15:00:37,972 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_8.jpg_r4_c1.png
5663
+ 2025-03-04 15:00:39,110 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_8.jpg_r5_c0.png
5664
+ 2025-03-04 15:00:40,404 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_8.jpg_r5_c1.png
5665
+ 2025-03-04 15:00:41,716 [ERROR] __main__ - Gemini subtopic identification error on attempt 0: Expecting value: line 1 column 1 (char 0)
5666
+ 2025-03-04 15:00:43,487 [ERROR] __main__ - Gemini subtopic identification error on attempt 1: Expecting value: line 1 column 1 (char 0)
5667
+ 2025-03-04 15:00:43,665 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_8.jpg_r6_c0.png
5668
+ 2025-03-04 15:00:44,879 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_8.jpg_r6_c1.png
5669
+ 2025-03-04 15:00:45,862 [ERROR] __main__ - Gemini subtopic identification error on attempt 0: Expecting value: line 1 column 1 (char 0)
5670
+ 2025-03-04 15:00:47,337 [ERROR] __main__ - Gemini subtopic identification error on attempt 1: Expecting value: line 1 column 1 (char 0)
5671
+ 2025-03-04 15:00:47,338 [WARNING] __main__ - Cell image not found: /tmp/tmpwzp5zo9m.jpg_rows/row_7/col_0.png
5672
+ 2025-03-04 15:00:47,338 [INFO] __main__ - Processing table image: /topic-extraction/img_9.jpg, columns=three
5673
+ 2025-03-04 15:00:50,852 [WARNING] __main__ - Cell image not found: /tmp/tmp45kbg898.jpg_rows/row_0/col_0.png
5674
+ 2025-03-04 15:00:50,853 [WARNING] __main__ - Cell image not found: /tmp/tmp45kbg898.jpg_rows/row_0/col_1.png
5675
+ 2025-03-04 15:00:50,853 [WARNING] __main__ - Cell image not found: /tmp/tmp45kbg898.jpg_rows/row_0/col_2.png
5676
+ 2025-03-04 15:00:52,290 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_9.jpg_r1_c0.png
5677
+ 2025-03-04 15:00:53,354 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_9.jpg_r1_c1.png
5678
+ 2025-03-04 15:00:54,709 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_9.jpg_r1_c2.png
5679
+ 2025-03-04 15:00:55,877 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_9.jpg_r2_c0.png
5680
+ 2025-03-04 15:00:57,178 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_9.jpg_r2_c1.png
5681
+ 2025-03-04 15:00:58,304 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_9.jpg_r3_c0.png
5682
+ 2025-03-04 15:00:59,735 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_9.jpg_r3_c1.png
5683
+ 2025-03-04 15:01:00,944 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_9.jpg_r4_c0.png
5684
+ 2025-03-04 15:01:02,239 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_9.jpg_r4_c1.png
5685
+ 2025-03-04 15:01:03,416 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_9.jpg_r5_c0.png
5686
+ 2025-03-04 15:01:04,618 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_9.jpg_r5_c1.png
5687
+ 2025-03-04 15:01:05,434 [INFO] __main__ - Processing table image: /topic-extraction/img_10.jpg, columns=three
5688
+ 2025-03-04 15:01:08,588 [WARNING] __main__ - Cell image not found: /tmp/tmpqskyhmda.jpg_rows/row_0/col_0.png
5689
+ 2025-03-04 15:01:08,588 [WARNING] __main__ - Cell image not found: /tmp/tmpqskyhmda.jpg_rows/row_0/col_1.png
5690
+ 2025-03-04 15:01:08,855 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_10.jpg_r1_c0.png
5691
+ 2025-03-04 15:01:10,100 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_10.jpg_r1_c1.png
5692
+ 2025-03-04 15:01:11,458 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_10.jpg_r2_c0.png
5693
+ 2025-03-04 15:01:13,002 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_10.jpg_r3_c0.png
5694
+ 2025-03-04 15:01:14,421 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_10.jpg_r4_c0.png
5695
+ 2025-03-04 15:01:15,795 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_10.jpg_r5_c0.png
5696
+ 2025-03-04 15:01:16,778 [INFO] __main__ - Processing table image: /topic-extraction/img_11.jpg, columns=two
5697
+ 2025-03-04 15:01:19,849 [WARNING] __main__ - Cell image not found: /tmp/tmpragajvqv.jpg_rows/row_0/col_0.png
5698
+ 2025-03-04 15:01:20,292 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_11.jpg_r1_c0.png
5699
+ 2025-03-04 15:01:21,681 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_11.jpg_r2_c0.png
5700
+ 2025-03-04 15:01:23,001 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_11.jpg_r3_c0.png
5701
+ 2025-03-04 15:01:24,256 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_11.jpg_r4_c0.png
5702
+ 2025-03-04 15:01:25,614 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_11.jpg_r5_c0.png
5703
+ 2025-03-04 15:01:26,879 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_11.jpg_r6_c0.png
5704
+ 2025-03-04 15:01:28,027 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_11.jpg_r7_c0.png
5705
+ 2025-03-04 15:01:28,867 [INFO] __main__ - Processing table image: /topic-extraction/img_12.jpg, columns=three
5706
+ 2025-03-04 15:01:31,707 [WARNING] __main__ - Cell image not found: /tmp/tmptajrb9oq.jpg_rows/row_0/col_0.png
5707
+ 2025-03-04 15:01:31,708 [WARNING] __main__ - Cell image not found: /tmp/tmptajrb9oq.jpg_rows/row_0/col_1.png
5708
+ 2025-03-04 15:01:31,708 [WARNING] __main__ - Cell image not found: /tmp/tmptajrb9oq.jpg_rows/row_1/col_0.png
5709
+ 2025-03-04 15:01:31,708 [WARNING] __main__ - Cell image not found: /tmp/tmptajrb9oq.jpg_rows/row_1/col_1.png
5710
+ 2025-03-04 15:01:31,968 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_12.jpg_r2_c0.png
5711
+ 2025-03-04 15:01:33,379 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_12.jpg_r2_c1.png
5712
+ 2025-03-04 15:01:34,597 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_12.jpg_r3_c0.png
5713
+ 2025-03-04 15:01:35,923 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_12.jpg_r3_c1.png
5714
+ 2025-03-04 15:01:37,229 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_12.jpg_r4_c0.png
5715
+ 2025-03-04 15:01:38,254 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_12.jpg_r5_c0.png
5716
+ 2025-03-04 15:01:39,166 [INFO] __main__ - Processing table image: /topic-extraction/img_13.jpg, columns=three
5717
+ 2025-03-04 15:01:42,003 [WARNING] __main__ - Cell image not found: /tmp/tmpzd8rmysx.jpg_rows/row_0/col_0.png
5718
+ 2025-03-04 15:01:42,004 [WARNING] __main__ - Cell image not found: /tmp/tmpzd8rmysx.jpg_rows/row_0/col_1.png
5719
+ 2025-03-04 15:01:42,004 [WARNING] __main__ - Cell image not found: /tmp/tmpzd8rmysx.jpg_rows/row_1/col_0.png
5720
+ 2025-03-04 15:01:42,004 [WARNING] __main__ - Cell image not found: /tmp/tmpzd8rmysx.jpg_rows/row_1/col_1.png
5721
+ 2025-03-04 15:01:42,258 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_13.jpg_r2_c0.png
5722
+ 2025-03-04 15:01:43,581 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_13.jpg_r2_c1.png
5723
+ 2025-03-04 15:01:44,840 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_13.jpg_r3_c0.png
5724
+ 2025-03-04 15:01:46,192 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_13.jpg_r4_c0.png
5725
+ 2025-03-04 15:01:47,564 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_13.jpg_r5_c0.png
5726
+ 2025-03-04 15:01:48,735 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_13.jpg_r6_c0.png
5727
+ 2025-03-04 15:01:49,480 [INFO] __main__ - Processing table image: /topic-extraction/img_14.jpg, columns=three
5728
+ 2025-03-04 15:01:53,309 [WARNING] __main__ - Cell image not found: /tmp/tmp6agbobyu.jpg_rows/row_0/col_0.png
5729
+ 2025-03-04 15:01:53,310 [WARNING] __main__ - Cell image not found: /tmp/tmp6agbobyu.jpg_rows/row_0/col_1.png
5730
+ 2025-03-04 15:01:53,583 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_14.jpg_r1_c0.png
5731
+ 2025-03-04 15:01:54,959 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_14.jpg_r1_c1.png
5732
+ 2025-03-04 15:01:56,286 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_14.jpg_r2_c0.png
5733
+ 2025-03-04 15:01:57,618 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_14.jpg_r3_c0.png
5734
+ 2025-03-04 15:01:58,711 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_14.jpg_r4_c0.png
5735
+ 2025-03-04 15:01:59,972 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_14.jpg_r4_c1.png
5736
+ 2025-03-04 15:02:01,443 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_14.jpg_r5_c0.png
5737
+ 2025-03-04 15:02:02,711 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_14.jpg_r6_c0.png
5738
+ 2025-03-04 15:02:03,674 [INFO] __main__ - Processing table image: /topic-extraction/img_15.jpg, columns=three
5739
+ 2025-03-04 15:02:06,780 [WARNING] __main__ - Cell image not found: /tmp/tmp3lbuxp25.jpg_rows/row_0/col_0.png
5740
+ 2025-03-04 15:02:06,781 [WARNING] __main__ - Cell image not found: /tmp/tmp3lbuxp25.jpg_rows/row_0/col_1.png
5741
+ 2025-03-04 15:02:07,040 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_15.jpg_r1_c0.png
5742
+ 2025-03-04 15:02:08,455 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_15.jpg_r1_c1.png
5743
+ 2025-03-04 15:02:09,838 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_15.jpg_r2_c0.png
5744
+ 2025-03-04 15:02:11,221 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_15.jpg_r3_c0.png
5745
+ 2025-03-04 15:02:12,570 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_15.jpg_r4_c0.png
5746
+ 2025-03-04 15:02:13,800 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_15.jpg_r5_c0.png
5747
+ 2025-03-04 15:02:14,741 [INFO] __main__ - Processing table image: /topic-extraction/img_16.jpg, columns=three
5748
+ 2025-03-04 15:02:18,051 [WARNING] __main__ - Cell image not found: /tmp/tmpqve047e1.jpg_rows/row_0/col_0.png
5749
+ 2025-03-04 15:02:18,051 [WARNING] __main__ - Cell image not found: /tmp/tmpqve047e1.jpg_rows/row_0/col_1.png
5750
+ 2025-03-04 15:02:18,051 [WARNING] __main__ - Cell image not found: /tmp/tmpqve047e1.jpg_rows/row_1/col_0.png
5751
+ 2025-03-04 15:02:18,052 [WARNING] __main__ - Cell image not found: /tmp/tmpqve047e1.jpg_rows/row_1/col_1.png
5752
+ 2025-03-04 15:02:18,310 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_16.jpg_r2_c0.png
5753
+ 2025-03-04 15:02:19,484 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_16.jpg_r2_c1.png
5754
+ 2025-03-04 15:02:20,750 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_16.jpg_r3_c0.png
5755
+ 2025-03-04 15:02:21,962 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_16.jpg_r4_c0.png
5756
+ 2025-03-04 15:02:23,279 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_16.jpg_r4_c1.png
5757
+ 2025-03-04 15:02:24,677 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_16.jpg_r5_c0.png
5758
+ 2025-03-04 15:02:25,990 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_16.jpg_r6_c0.png
5759
+ 2025-03-04 15:02:27,144 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_16.jpg_r7_c0.png
5760
+ 2025-03-04 15:02:27,953 [INFO] __main__ - Processing table image: /topic-extraction/img_17.jpg, columns=three
5761
+ 2025-03-04 15:02:31,142 [WARNING] __main__ - Cell image not found: /tmp/tmp580zpmu1.jpg_rows/row_0/col_0.png
5762
+ 2025-03-04 15:02:31,142 [WARNING] __main__ - Cell image not found: /tmp/tmp580zpmu1.jpg_rows/row_0/col_1.png
5763
+ 2025-03-04 15:02:31,397 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_17.jpg_r1_c0.png
5764
+ 2025-03-04 15:02:32,685 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_17.jpg_r1_c1.png
5765
+ 2025-03-04 15:02:34,235 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_17.jpg_r2_c0.png
5766
+ 2025-03-04 15:02:35,330 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_17.jpg_r3_c0.png
5767
+ 2025-03-04 15:02:36,635 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_17.jpg_r3_c1.png
5768
+ 2025-03-04 15:02:37,985 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_17.jpg_r4_c0.png
5769
+ 2025-03-04 15:02:39,401 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_17.jpg_r5_c0.png
5770
+ 2025-03-04 15:02:40,763 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_17.jpg_r6_c0.png
5771
+ 2025-03-04 15:02:41,985 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_17.jpg_r7_c0.png
5772
+ 2025-03-04 15:02:42,875 [INFO] __main__ - Processing table image: /topic-extraction/img_18.jpg, columns=three
5773
+ 2025-03-04 15:02:43,771 [WARNING] __main__ - Cell image not found: /tmp/tmpccm4skpd.jpg_rows/row_0/col_0.png
5774
+ 2025-03-04 15:02:43,772 [WARNING] __main__ - Cell image not found: /tmp/tmpccm4skpd.jpg_rows/row_0/col_1.png
5775
+ 2025-03-04 15:02:43,772 [WARNING] __main__ - Cell image not found: /tmp/tmpccm4skpd.jpg_rows/row_1/col_0.png
5776
+ 2025-03-04 15:02:43,772 [WARNING] __main__ - Cell image not found: /tmp/tmpccm4skpd.jpg_rows/row_1/col_1.png
5777
+ 2025-03-04 15:02:44,032 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_18.jpg_r2_c0.png
5778
+ 2025-03-04 15:02:45,366 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_18.jpg_r2_c1.png
5779
+ 2025-03-04 15:02:46,585 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_18.jpg_r3_c0.png
5780
+ 2025-03-04 15:02:47,559 [INFO] __main__ - Processing table image: /topic-extraction/img_19.jpg, columns=three
5781
+ 2025-03-04 15:02:50,123 [WARNING] __main__ - Cell image not found: /tmp/tmpclhr29f1.jpg_rows/row_0/col_0.png
5782
+ 2025-03-04 15:02:50,124 [WARNING] __main__ - Cell image not found: /tmp/tmpclhr29f1.jpg_rows/row_0/col_1.png
5783
+ 2025-03-04 15:02:50,124 [WARNING] __main__ - Cell image not found: /tmp/tmpclhr29f1.jpg_rows/row_1/col_0.png
5784
+ 2025-03-04 15:02:50,124 [WARNING] __main__ - Cell image not found: /tmp/tmpclhr29f1.jpg_rows/row_1/col_1.png
5785
+ 2025-03-04 15:02:50,378 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_19.jpg_r2_c0.png
5786
+ 2025-03-04 15:02:51,859 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_19.jpg_r2_c1.png
5787
+ 2025-03-04 15:02:53,257 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_19.jpg_r3_c0.png
5788
+ 2025-03-04 15:02:54,584 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_19.jpg_r3_c1.png
5789
+ 2025-03-04 15:02:55,736 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_19.jpg_r4_c0.png
5790
+ 2025-03-04 15:02:56,672 [INFO] __main__ - Processing table image: /topic-extraction/img_20.jpg, columns=three
5791
+ 2025-03-04 15:03:00,454 [WARNING] __main__ - Cell image not found: /tmp/tmptx9dz9xc.jpg_rows/row_0/col_0.png
5792
+ 2025-03-04 15:03:00,454 [WARNING] __main__ - Cell image not found: /tmp/tmptx9dz9xc.jpg_rows/row_0/col_1.png
5793
+ 2025-03-04 15:03:00,737 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_20.jpg_r1_c0.png
5794
+ 2025-03-04 15:03:02,337 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_20.jpg_r1_c1.png
5795
+ 2025-03-04 15:03:03,839 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_20.jpg_r2_c0.png
5796
+ 2025-03-04 15:03:04,889 [INFO] __main__ - Processing table image: /topic-extraction/img_21.jpg, columns=three
5797
+ 2025-03-04 15:03:08,043 [WARNING] __main__ - Cell image not found: /tmp/tmp18_5p4lj.jpg_rows/row_0/col_0.png
5798
+ 2025-03-04 15:03:08,044 [WARNING] __main__ - Cell image not found: /tmp/tmp18_5p4lj.jpg_rows/row_0/col_1.png
5799
+ 2025-03-04 15:03:08,322 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_21.jpg_r1_c0.png
5800
+ 2025-03-04 15:03:09,913 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_21.jpg_r1_c1.png
5801
+ 2025-03-04 15:03:11,063 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_21.jpg_r2_c0.png
5802
+ 2025-03-04 15:03:12,387 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_21.jpg_r2_c1.png
5803
+ 2025-03-04 15:03:13,743 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_21.jpg_r3_c0.png
5804
+ 2025-03-04 15:03:14,671 [INFO] __main__ - Processing table image: /topic-extraction/img_22.jpg, columns=three
5805
+ 2025-03-04 15:03:17,999 [WARNING] __main__ - Cell image not found: /tmp/tmppc_cs35e.jpg_rows/row_0/col_0.png
5806
+ 2025-03-04 15:03:18,000 [WARNING] __main__ - Cell image not found: /tmp/tmppc_cs35e.jpg_rows/row_0/col_1.png
5807
+ 2025-03-04 15:03:18,271 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_22.jpg_r1_c0.png
5808
+ 2025-03-04 15:03:19,493 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_22.jpg_r1_c1.png
5809
+ 2025-03-04 15:03:20,669 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_22.jpg_r2_c0.png
5810
+ 2025-03-04 15:03:22,038 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_22.jpg_r2_c1.png
5811
+ 2025-03-04 15:03:23,431 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_22.jpg_r3_c0.png
5812
+ 2025-03-04 15:03:24,490 [WARNING] __main__ - Cell image not found: /tmp/tmppc_cs35e.jpg_rows/row_4/col_0.png
5813
+ 2025-03-04 15:03:24,491 [INFO] __main__ - Processing table image: /topic-extraction/img_23.jpg, columns=three
5814
+ 2025-03-04 15:03:27,293 [WARNING] __main__ - Cell image not found: /tmp/tmpk98o_fpp.jpg_rows/row_0/col_0.png
5815
+ 2025-03-04 15:03:27,294 [WARNING] __main__ - Cell image not found: /tmp/tmpk98o_fpp.jpg_rows/row_0/col_1.png
5816
+ 2025-03-04 15:03:27,553 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_23.jpg_r1_c0.png
5817
+ 2025-03-04 15:03:28,769 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_23.jpg_r1_c1.png
5818
+ 2025-03-04 15:03:29,940 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_23.jpg_r2_c0.png
5819
+ 2025-03-04 15:03:31,452 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_23.jpg_r2_c1.png
5820
+ 2025-03-04 15:03:32,738 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_23.jpg_r3_c0.png
5821
+ 2025-03-04 15:03:33,643 [INFO] __main__ - Processing table image: /topic-extraction/img_24.jpg, columns=three
5822
+ 2025-03-04 15:03:36,892 [WARNING] __main__ - Cell image not found: /tmp/tmpsdjidh_w.jpg_rows/row_0/col_0.png
5823
+ 2025-03-04 15:03:36,892 [WARNING] __main__ - Cell image not found: /tmp/tmpsdjidh_w.jpg_rows/row_0/col_1.png
5824
+ 2025-03-04 15:03:36,892 [WARNING] __main__ - Cell image not found: /tmp/tmpsdjidh_w.jpg_rows/row_1/col_0.png
5825
+ 2025-03-04 15:03:36,892 [WARNING] __main__ - Cell image not found: /tmp/tmpsdjidh_w.jpg_rows/row_1/col_1.png
5826
+ 2025-03-04 15:03:37,188 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_24.jpg_r2_c0.png
5827
+ 2025-03-04 15:03:38,642 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_24.jpg_r2_c1.png
5828
+ 2025-03-04 15:03:40,017 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_24.jpg_r3_c0.png
5829
+ 2025-03-04 15:03:41,095 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_24.jpg_r4_c0.png
5830
+ 2025-03-04 15:03:42,514 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_24.jpg_r4_c1.png
5831
+ 2025-03-04 15:03:43,481 [INFO] __main__ - Processing table image: /topic-extraction/img_25.jpg, columns=two
5832
+ 2025-03-04 15:03:46,397 [WARNING] __main__ - Cell image not found: /tmp/tmpt9roe876.jpg_rows/row_0/col_0.png
5833
+ 2025-03-04 15:03:46,809 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_25.jpg_r1_c0.png
5834
+ 2025-03-04 15:03:48,153 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_25.jpg_r2_c0.png
5835
+ 2025-03-04 15:03:49,855 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_25.jpg_r3_c0.png
5836
+ 2025-03-04 15:03:51,232 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_25.jpg_r4_c0.png
5837
+ 2025-03-04 15:03:52,577 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_25.jpg_r5_c0.png
5838
+ 2025-03-04 15:03:53,542 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_25.jpg_r6_c0.png
5839
+ 2025-03-04 15:03:54,702 [INFO] __main__ - Processing table image: /topic-extraction/img_26.jpg, columns=three
5840
+ 2025-03-04 15:03:57,292 [WARNING] __main__ - Cell image not found: /tmp/tmpkt4w7cqg.jpg_rows/row_0/col_0.png
5841
+ 2025-03-04 15:03:57,292 [WARNING] __main__ - Cell image not found: /tmp/tmpkt4w7cqg.jpg_rows/row_0/col_1.png
5842
+ 2025-03-04 15:03:57,547 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_26.jpg_r1_c0.png
5843
+ 2025-03-04 15:03:58,694 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_26.jpg_r1_c1.png
5844
+ 2025-03-04 15:04:00,096 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_26.jpg_r2_c0.png
5845
+ 2025-03-04 15:04:01,892 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_26.jpg_r3_c0.png
5846
+ 2025-03-04 15:04:03,198 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_26.jpg_r4_c0.png
5847
+ 2025-03-04 15:04:04,066 [INFO] __main__ - Processing table image: /topic-extraction/img_27.jpg, columns=three
5848
+ 2025-03-04 15:04:06,633 [WARNING] __main__ - Cell image not found: /tmp/tmp1z8ov49i.jpg_rows/row_0/col_0.png
5849
+ 2025-03-04 15:04:06,633 [WARNING] __main__ - Cell image not found: /tmp/tmp1z8ov49i.jpg_rows/row_0/col_1.png
5850
+ 2025-03-04 15:04:06,892 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_27.jpg_r1_c0.png
5851
+ 2025-03-04 15:04:08,314 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_27.jpg_r1_c1.png
5852
+ 2025-03-04 15:04:09,655 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_27.jpg_r2_c0.png
5853
+ 2025-03-04 15:04:10,910 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_27.jpg_r3_c0.png
5854
+ 2025-03-04 15:04:12,042 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_27.jpg_r4_c0.png
5855
+ 2025-03-04 15:04:13,234 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_27.jpg_r4_c1.png
5856
+ 2025-03-04 15:04:14,345 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_27.jpg_r5_c0.png
5857
+ 2025-03-04 15:04:15,180 [INFO] __main__ - Processing table image: /topic-extraction/img_28.jpg, columns=three
5858
+ 2025-03-04 15:04:18,179 [WARNING] __main__ - Cell image not found: /tmp/tmpsij1nmfi.jpg_rows/row_0/col_0.png
5859
+ 2025-03-04 15:04:18,179 [WARNING] __main__ - Cell image not found: /tmp/tmpsij1nmfi.jpg_rows/row_0/col_1.png
5860
+ 2025-03-04 15:04:18,363 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_28.jpg_r1_c0.png
5861
+ 2025-03-04 15:04:19,871 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_28.jpg_r1_c1.png
5862
+ 2025-03-04 15:04:21,379 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_28.jpg_r2_c0.png
5863
+ 2025-03-04 15:04:23,137 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_28.jpg_r2_c1.png
5864
+ 2025-03-04 15:04:24,801 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_28.jpg_r3_c0.png
5865
+ 2025-03-04 15:04:26,569 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_28.jpg_r3_c1.png
5866
+ 2025-03-04 15:04:28,289 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_28.jpg_r4_c0.png
5867
+ 2025-03-04 15:04:29,718 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_28.jpg_r4_c1.png
5868
+ 2025-03-04 15:04:31,009 [INFO] __main__ - Uploaded to S3: /topic-extraction/cells/img_28.jpg_r5_c0.png
5869
+ 2025-03-04 15:04:31,836 [INFO] __main__ - Final subtopics JSON saved locally at /home/user/app/pearson_json/subtopics.json
5870
+ 2025-03-04 15:04:32,192 [INFO] __main__ - GPU memory cleaned up.
5871
+ 2025-03-04 15:04:32,199 [INFO] __main__ - Processing completed successfully.