dedoc
/

paragraph_classifier

Model card Files Files and versions Community

nastyboget commited on Aug 7, 2024

Commit

0bb59f5

·

verified ·

1 Parent(s): 97c4b78

Update README.md

Files changed (1) hide show

README.md +24 -2

README.md CHANGED Viewed

@@ -1,7 +1,29 @@
 ---
 license: apache-2.0
 ---
-Training data are available at [link](https://huggingface.co/datasets/dedoc/paragraph_dataset/tree/main)
-Training script is [here](https://github.com/ispras/dedoc/blob/master/scripts/train/train_paragraph_classifier.py)

 ---
 license: apache-2.0
+datasets:
+- dedoc/paragraph_dataset
+language:
+- ru
+- en
+metrics:
+- f1
+- accuracy
 ---
+# Paragraph classifier
+The classifier is used for binary classification of text lines in PDF or scanned documents.
+For each document line, it determines:
+ * line is a beginning of a new paragraph or
+ * line is a continuation of the previous paragraph
+For each line, feature vector is formed based on line's text and formatting, please see
+`dedoc/structure_extractors/feature_extractors/paragraph_feature_extractor.py` in [dedoc](https://github.com/ispras/dedoc).
+* Training data are available at [the link](https://huggingface.co/datasets/dedoc/paragraph_dataset).
+* Training script is [here](https://github.com/ispras/dedoc/blob/master/scripts/train/train_paragraph_classifier.py).