metadata

license: apache-2.0
datasets:
  - dedoc/paragraph_dataset
language:
  - ru
  - en
metrics:
  - f1
  - accuracy

Paragraph classifier

The classifier is used for binary classification of text lines in PDF or scanned documents.

For each document line, it determines:

line is a beginning of a new paragraph or
line is a continuation of the previous paragraph

For each line, feature vector is formed based on line's text and formatting, please see dedoc/structure_extractors/feature_extractors/paragraph_feature_extractor.py in dedoc.

Training data are available at the link.
Training script is here.