--- license: apache-2.0 datasets: - dedoc/paragraph_dataset language: - ru - en metrics: - f1 - accuracy --- # Paragraph classifier The classifier is used for binary classification of text lines in PDF or scanned documents. For each document line, it determines: * line is a beginning of a new paragraph or * line is a continuation of the previous paragraph For each line, feature vector is formed based on line's text and formatting, please see `dedoc/structure_extractors/feature_extractors/paragraph_feature_extractor.py` in [dedoc](https://github.com/ispras/dedoc). * Training data are available at [the link](https://huggingface.co/datasets/dedoc/paragraph_dataset). * Training script is [here](https://github.com/ispras/dedoc/blob/master/scripts/train/train_paragraph_classifier.py).