|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- TurkuNLP/finerweb-10bt |
|
language: |
|
- en |
|
base_model: |
|
- microsoft/deberta-v3-base |
|
--- |
|
# Model Card for FinerWeb Line Quality Classifier |
|
|
|
This model is a DeBERTa-v3-base classifier trained to identify high and low-quality content in web text at the line level. It was developed as part of the FinerWeb-10BT project to enhance training data quality for language models. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
- **Developed by:** University of Turku (Erik Henriksson*, Otto Tarkka*, Filip Ginter) (*Equal contribution.) |
|
- **Model type:** Line-level text quality classifier |
|
- **Language(s) (NLP):** English |
|
- **License:** apache-2.0 |
|
- **Finetuned from model:** microsoft/deberta-v3-base |
|
|
|
### Model Sources |
|
- **Paper:** https://arxiv.org/abs/2501.07314 |
|
- **Repository:** https://github.com/TurkuNLP/finerweb-10bt |
|
|
|
## Uses |
|
|
|
### Direct Use |
|
|
|
The model is designed to classify text lines as either Clean (high-quality) or belonging to one of several low-quality categories. It outputs a quality score between 0 and 1 for each input line, where scores closer to 1 indicate higher quality content. |
|
|
|
### Out-of-Scope Use |
|
|
|
The model is specifically trained on English web text and may not perform well on other languages or specialized domains. |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
The model was trained on a labeled dataset of 328,472 lines from 20,000 documents sampled from FineWeb. The data preparation involved: |
|
1. Initial line-level labeling by GPT-4o mini, which generated 547 unique descriptive labels |
|
2. Label refinement and grouping into 9 broader categories using OpenAI's o1-preview model |
|
3. Manual verification conducted only on a small sample (50 documents/726 lines) to assess inter-annotator agreement between human annotators and the LLM-generated labels |
|
|
|
The final dataset consisted of 86.24% Clean lines and 13.76% lines distributed across 8 low-quality categories. |
|
|
|
### Training Procedure |
|
|
|
#### Training Hyperparameters |
|
|
|
- **Training regime:** bfloat16 precision |
|
- **Learning rate:** 1e-5 |
|
- **Batch size:** 16 |
|
- **Early stopping:** Applied with patience of 5 based on evaluation loss |
|
- **Maximum epochs:** 5 |
|
- **Label smoothing:** 0.1 applied to cross-entropy loss |
|
|
|
### Evaluation |
|
|
|
### Testing Data, Factors & Metrics |
|
|
|
#### Metrics |
|
|
|
The model was evaluated using: |
|
- Micro F1 score: 0.81 |
|
- Macro F1 score: 0.66 |
|
- Clean class metrics: |
|
- Precision: 0.88 |
|
- Recall: 0.91 |
|
- F1: 0.90 |
|
|
|
## Technical Specifications |
|
|
|
### Compute Infrastructure |
|
|
|
#### Hardware |
|
Computational resources for this study were provided by CSC — IT Center for Science. Training was performed on a single A100 GPU. |