tshasan commited on
Commit
ef6b410
·
verified ·
1 Parent(s): 9bada10

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +103 -0
README.md ADDED
@@ -0,0 +1,103 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-4.0
3
+ tags:
4
+ - multi-label-classification
5
+ - text-classification
6
+ - onnx
7
+ - web-classification
8
+ - firefox-ai
9
+ - preview
10
+ language:
11
+ - multilingual
12
+ datasets:
13
+ - tshasan/multi-label-web-classification
14
+ metrics:
15
+ - f1
16
+ - roc-auc
17
+ - hamming-loss
18
+ - precision
19
+ - recall
20
+ - jaccard
21
+ base_model: nomic-ai/modernbert-embed-base
22
+ pipeline_tag: text-classification
23
+ ---
24
+
25
+ # modernBERT-URLTITLE-classifier-preview
26
+
27
+ ## Model Overview
28
+
29
+ This is a **preview version** of a multi-label web classification model fine-tuned from `nomic-ai/modernbert-embed-base`. It classifies websites into multiple categories based on their URLs and titles. The model supports 10 labels: `News`, `Entertainment`, `Shop`, `Chat`, `Education`, `Government`, `Health`, `Technology`, `Work`, and `Travel`. It is designed to handle imbalanced datasets, treating "Uncategorized" labels as empty (no categories assigned).
30
+
31
+ - **Developed by**: Taimur Hasan
32
+ - **Model Type**: Multi-label Text Classification
33
+ - **Base Model**: `nomic-ai/modernbert-embed-base`
34
+ - **Language**: English
35
+ - **License**: MIT
36
+ - **Status**: Preview (under active development)
37
+
38
+ ## Intended Use
39
+
40
+ This model is designed to categorize websites using minimal input (URL and title).
41
+ Potential applications include:
42
+ - Web content categorization
43
+ - Recommendation systems
44
+ - Data enrichment for web scraping
45
+
46
+ ### Limitations
47
+ - **Preview Status**: Performance metrics are preliminary and may improve with further development.
48
+ - **Rare Labels**: Classification of underrepresented categories may require additional tuning.
49
+
50
+ ## Model Details
51
+
52
+ ### Architecture
53
+ - **Base Model**: `nomic-ai/modernbert-embed-base`
54
+ - **Fine-tuning**: Unfroze the last 4 encoder layers and the pooler
55
+ - **Problem Type**: Multi-label classification
56
+ - **Output Labels**: 10 (`News`, `Entertainment`, `Shop`, `Chat`, `Education`, `Government`, `Health`, `Technology`, `Work`, `Travel`)
57
+ - **Input Format**: Concatenated string: `"URL: {url} Title: {title}"`
58
+
59
+ ### Training Data
60
+ - **Dataset**: tshasan/multi-label-web-classification
61
+ - **Preprocessing**:
62
+ - "Uncategorized" labels mapped to empty (all labels set to 0)
63
+ - Rare labels (<5% occurrence) augmented using synonym replacement via `nlpaug`
64
+ - Oversampling applied to rare-label samples to address imbalance
65
+
66
+ ### Training Procedure
67
+ - **Framework**: Hugging Face Transformers
68
+ - **Optimizer**: AdamW (learning rate: 2e-5)
69
+ - **Batch Size**: 32 (for both training and evaluation)
70
+ - **Epochs**: 2 (with early stopping, patience=3)
71
+ - **Loss Function**: BCEWithLogitsLoss with class weights to handle imbalance
72
+ - **Hardware**: GPU (CUDA-enabled)
73
+ - **Metrics**: F1 (micro/macro), ROC-AUC, Hamming Loss, Precision, Recall, etc.
74
+
75
+ ## Evaluation
76
+
77
+ ### Metrics
78
+ Below are the performance metrics on the validation dataset:
79
+
80
+ | Metric | Value |
81
+ |--------------------|----------|
82
+ | Loss | 0.276 |
83
+ | Hamming Loss | 0.097 |
84
+ | Exact Match | 0.390 |
85
+ | Precision (Micro) | 0.903 |
86
+ | Recall (Micro) | 0.903 |
87
+ | F1 (Micro) | 0.903 |
88
+ | Precision (Macro) | 0.583 |
89
+ | Recall (Macro) | 0.490 |
90
+ | F1 (Macro) | 0.498 |
91
+ | Precision (Weighted) | 0.631 |
92
+ | Recall (Weighted) | 0.496 |
93
+ | F1 (Weighted) | 0.537 |
94
+ | ROC-AUC (Micro) | 0.859 |
95
+ | ROC-AUC (Macro) | 0.849 |
96
+ | PR-AUC (Micro) | 0.550 |
97
+ | PR-AUC (Macro) | 0.539 |
98
+ | Jaccard (Micro) | 0.822 |
99
+ | Jaccard (Macro) | 0.335 |
100
+ | Runtime (seconds) | 3.855 |
101
+ | Samples per Second| 739.794 |
102
+ | Steps per Second | 23.346 |
103
+ steps_per_second 23.346