Joblib
Safetensors
English
deberta-v2
erikhenriksson commited on
Commit
f7fc860
·
verified ·
1 Parent(s): ab06fd2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +71 -1
README.md CHANGED
@@ -6,4 +6,74 @@ language:
6
  - en
7
  base_model:
8
  - microsoft/deberta-v3-base
9
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  - en
7
  base_model:
8
  - microsoft/deberta-v3-base
9
+ ---
10
+ # Model Card for FinerWeb Line Quality Classifier
11
+
12
+ This model is a DeBERTa-v3-base classifier trained to identify high and low-quality content in web text at the line level. It was developed as part of the FinerWeb-10BT project to enhance training data quality for language models.
13
+
14
+ ## Model Details
15
+
16
+ ### Model Description
17
+
18
+ - **Developed by:** University of Turku (Erik Henriksson*, Otto Tarkka*, Filip Ginter) (*Equal contribution.)
19
+ - **Model type:** Line-level text quality classifier
20
+ - **Language(s) (NLP):** English
21
+ - **License:** apache-2.0
22
+ - **Finetuned from model:** microsoft/deberta-v3-base
23
+
24
+ ### Model Sources
25
+ - **Paper:** arXiv:2501.07314v1 [cs.CL]
26
+ - **Repository:** https://github.com/TurkuNLP/finerweb-10bt
27
+
28
+ ## Uses
29
+
30
+ ### Direct Use
31
+
32
+ The model is designed to classify text lines as either Clean (high-quality) or belonging to one of several low-quality categories. It outputs a quality score between 0 and 1 for each input line, where scores closer to 1 indicate higher quality content.
33
+
34
+ ### Out-of-Scope Use
35
+
36
+ The model is specifically trained on English web text and may not perform well on other languages or specialized domains. It should not be used as the sole determinant of text quality without human oversight.
37
+
38
+ ## Training Details
39
+
40
+ ### Training Data
41
+
42
+ The model was trained on a labeled dataset of 328,472 lines from 20,000 documents sampled from FineWeb. The data preparation involved:
43
+ 1. Initial line-level labeling by GPT-4o mini, which generated 547 unique descriptive labels
44
+ 2. Label refinement and grouping into 9 broader categories using OpenAI's o1-preview model
45
+ 3. Manual verification conducted only on a small sample (50 documents/726 lines) to assess inter-annotator agreement between human annotators and the LLM-generated labels
46
+
47
+ The final dataset consisted of 86.24% Clean lines and 13.76% lines distributed across 8 low-quality categories.
48
+
49
+ ### Training Procedure
50
+
51
+ #### Training Hyperparameters
52
+
53
+ - **Training regime:** bfloat16 precision
54
+ - **Learning rate:** 1e-5
55
+ - **Batch size:** 16
56
+ - **Early stopping:** Applied with patience of 5 based on evaluation loss
57
+ - **Maximum epochs:** 5
58
+ - **Label smoothing:** 0.1 applied to cross-entropy loss
59
+
60
+ ### Evaluation
61
+
62
+ ### Testing Data, Factors & Metrics
63
+
64
+ #### Metrics
65
+
66
+ The model was evaluated using:
67
+ - Micro F1 score: 0.81
68
+ - Macro F1 score: 0.66
69
+ - Clean class metrics:
70
+ - Precision: 0.88
71
+ - Recall: 0.91
72
+ - F1: 0.90
73
+
74
+ ## Technical Specifications
75
+
76
+ ### Compute Infrastructure
77
+
78
+ #### Hardware
79
+ Computational resources for this study were provided by CSC — IT Center for Science. Training was performed on a single A100 GPU.