Russian
English
nastyboget's picture
Update README.md
0bb59f5 verified
metadata
license: apache-2.0
datasets:
  - dedoc/paragraph_dataset
language:
  - ru
  - en
metrics:
  - f1
  - accuracy

Paragraph classifier

The classifier is used for binary classification of text lines in PDF or scanned documents.

For each document line, it determines:

  • line is a beginning of a new paragraph or

  • line is a continuation of the previous paragraph

For each line, feature vector is formed based on line's text and formatting, please see dedoc/structure_extractors/feature_extractors/paragraph_feature_extractor.py in dedoc.

  • Training data are available at the link.

  • Training script is here.