TurkuNLP
/

finerweb-quality-classifier

Model card Files Files and versions Community

erikhenriksson commited on Jan 16

Commit

f7fc860

·

verified ·

1 Parent(s): ab06fd2

Update README.md

Files changed (1) hide show

README.md +71 -1

README.md CHANGED Viewed

@@ -6,4 +6,74 @@ language:
 - en
 base_model:
 - microsoft/deberta-v3-base
----

 - en
 base_model:
 - microsoft/deberta-v3-base
+---
+# Model Card for FinerWeb Line Quality Classifier
+This model is a DeBERTa-v3-base classifier trained to identify high and low-quality content in web text at the line level. It was developed as part of the FinerWeb-10BT project to enhance training data quality for language models.
+## Model Details
+### Model Description
+- **Developed by:** University of Turku (Erik Henriksson*, Otto Tarkka*, Filip Ginter) (*Equal contribution.)
+- **Model type:** Line-level text quality classifier
+- **Language(s) (NLP):** English
+- **License:** apache-2.0
+- **Finetuned from model:** microsoft/deberta-v3-base
+### Model Sources
+- **Paper:** arXiv:2501.07314v1 [cs.CL]
+- **Repository:** https://github.com/TurkuNLP/finerweb-10bt
+## Uses
+### Direct Use
+The model is designed to classify text lines as either Clean (high-quality) or belonging to one of several low-quality categories. It outputs a quality score between 0 and 1 for each input line, where scores closer to 1 indicate higher quality content.
+### Out-of-Scope Use
+The model is specifically trained on English web text and may not perform well on other languages or specialized domains. It should not be used as the sole determinant of text quality without human oversight.
+## Training Details
+### Training Data
+The model was trained on a labeled dataset of 328,472 lines from 20,000 documents sampled from FineWeb. The data preparation involved:
+1. Initial line-level labeling by GPT-4o mini, which generated 547 unique descriptive labels
+2. Label refinement and grouping into 9 broader categories using OpenAI's o1-preview model
+3. Manual verification conducted only on a small sample (50 documents/726 lines) to assess inter-annotator agreement between human annotators and the LLM-generated labels
+The final dataset consisted of 86.24% Clean lines and 13.76% lines distributed across 8 low-quality categories.
+### Training Procedure
+#### Training Hyperparameters
+- **Training regime:** bfloat16 precision
+- **Learning rate:** 1e-5
+- **Batch size:** 16
+- **Early stopping:** Applied with patience of 5 based on evaluation loss
+- **Maximum epochs:** 5
+- **Label smoothing:** 0.1 applied to cross-entropy loss
+### Evaluation
+### Testing Data, Factors & Metrics
+#### Metrics
+The model was evaluated using:
+- Micro F1 score: 0.81
+- Macro F1 score: 0.66
+- Clean class metrics:
+  - Precision: 0.88
+  - Recall: 0.91
+  - F1: 0.90
+## Technical Specifications
+### Compute Infrastructure
+#### Hardware
+Computational resources for this study were provided by CSC — IT Center for Science. Training was performed on a single A100 GPU.