Russian
English
nastyboget commited on
Commit
0bb59f5
·
verified ·
1 Parent(s): 97c4b78

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +24 -2
README.md CHANGED
@@ -1,7 +1,29 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
3
  ---
4
 
5
- Training data are available at [link](https://huggingface.co/datasets/dedoc/paragraph_dataset/tree/main)
6
 
7
- Training script is [here](https://github.com/ispras/dedoc/blob/master/scripts/train/train_paragraph_classifier.py)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ datasets:
4
+ - dedoc/paragraph_dataset
5
+ language:
6
+ - ru
7
+ - en
8
+ metrics:
9
+ - f1
10
+ - accuracy
11
  ---
12
 
13
+ # Paragraph classifier
14
 
15
+ The classifier is used for binary classification of text lines in PDF or scanned documents.
16
+
17
+ For each document line, it determines:
18
+
19
+ * line is a beginning of a new paragraph or
20
+
21
+ * line is a continuation of the previous paragraph
22
+
23
+ For each line, feature vector is formed based on line's text and formatting, please see
24
+ `dedoc/structure_extractors/feature_extractors/paragraph_feature_extractor.py` in [dedoc](https://github.com/ispras/dedoc).
25
+
26
+
27
+ * Training data are available at [the link](https://huggingface.co/datasets/dedoc/paragraph_dataset).
28
+
29
+ * Training script is [here](https://github.com/ispras/dedoc/blob/master/scripts/train/train_paragraph_classifier.py).