File size: 801 Bytes
aeb5fc4 0bb59f5 aeb5fc4 97c4b78 0bb59f5 97c4b78 0bb59f5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
---
license: apache-2.0
datasets:
- dedoc/paragraph_dataset
language:
- ru
- en
metrics:
- f1
- accuracy
---
# Paragraph classifier
The classifier is used for binary classification of text lines in PDF or scanned documents.
For each document line, it determines:
* line is a beginning of a new paragraph or
* line is a continuation of the previous paragraph
For each line, feature vector is formed based on line's text and formatting, please see
`dedoc/structure_extractors/feature_extractors/paragraph_feature_extractor.py` in [dedoc](https://github.com/ispras/dedoc).
* Training data are available at [the link](https://huggingface.co/datasets/dedoc/paragraph_dataset).
* Training script is [here](https://github.com/ispras/dedoc/blob/master/scripts/train/train_paragraph_classifier.py). |