TurkuNLP
/

finerweb-quality-classifier

Model card Files Files and versions Community

finerweb-quality-classifier / README.md

erikhenriksson's picture

Update README.md

93d1635 verified 3 months ago

|

history blame contribute delete

2.63 kB

	---
	license: apache-2.0
	datasets:
	- TurkuNLP/finerweb-10bt
	language:
	- en
	base_model:
	- microsoft/deberta-v3-base
	---
	# Model Card for FinerWeb Line Quality Classifier

	This model is a DeBERTa-v3-base classifier trained to identify high and low-quality content in web text at the line level. It was developed as part of the FinerWeb-10BT project to enhance training data quality for language models.

	## Model Details

	### Model Description

	- Developed by: University of Turku (Erik Henriksson, Otto Tarkka, Filip Ginter) (*Equal contribution.)
	- Model type: Line-level text quality classifier
	- Language(s) (NLP): English
	- License: apache-2.0
	- Finetuned from model: microsoft/deberta-v3-base

	### Model Sources
	- Paper: https://arxiv.org/abs/2501.07314
	- Repository: https://github.com/TurkuNLP/finerweb-10bt

	## Uses

	### Direct Use

	The model is designed to classify text lines as either Clean (high-quality) or belonging to one of several low-quality categories. It outputs a quality score between 0 and 1 for each input line, where scores closer to 1 indicate higher quality content.

	### Out-of-Scope Use

	The model is specifically trained on English web text and may not perform well on other languages or specialized domains.

	## Training Details

	### Training Data

	The model was trained on a labeled dataset of 328,472 lines from 20,000 documents sampled from FineWeb. The data preparation involved:
	1. Initial line-level labeling by GPT-4o mini, which generated 547 unique descriptive labels
	2. Label refinement and grouping into 9 broader categories using OpenAI's o1-preview model
	3. Manual verification conducted only on a small sample (50 documents/726 lines) to assess inter-annotator agreement between human annotators and the LLM-generated labels

	The final dataset consisted of 86.24% Clean lines and 13.76% lines distributed across 8 low-quality categories.

	### Training Procedure

	#### Training Hyperparameters

	- Training regime: bfloat16 precision
	- Learning rate: 1e-5
	- Batch size: 16
	- Early stopping: Applied with patience of 5 based on evaluation loss
	- Maximum epochs: 5
	- Label smoothing: 0.1 applied to cross-entropy loss

	### Evaluation

	### Testing Data, Factors & Metrics

	#### Metrics

	The model was evaluated using:
	- Micro F1 score: 0.81
	- Macro F1 score: 0.66
	- Clean class metrics:
	- Precision: 0.88
	- Recall: 0.91
	- F1: 0.90

	## Technical Specifications

	### Compute Infrastructure

	#### Hardware
	Computational resources for this study were provided by CSC — IT Center for Science. Training was performed on a single A100 GPU.