LemiSt/code-segmentor-distilbert

This is a distilbert-base-multilingual-cased-Model fine-tuned with a NER objective to tag tokens based on whether they belong to a code block or natural language text. The dataset of 78210 examples was generated by randomly combining code and text blocks from other permissively-licensed datasets, with some examples containing only code and some only regular text.

The model achieves the following stats on the validation set:

Metric	Value
Loss	0.0788
F1 Score	0.8619
Precision	0.8362
Recall	0.8893
Accuracy	0.9792