tshasan
/

modernBERT-URLTITLE-classifier

ONNX

modernbert

Model card Files Files and versions Community

tshasan commited on Mar 11

Commit

ef6b410

verified ·

1 Parent(s): 9bada10

Create README.md

Browse files

Files changed (1) hide show

README.md +103 -0

README.md ADDED Viewed

	@@ -0,0 +1,103 @@

+---
+license: cc-by-4.0
+tags:
+- multi-label-classification
+- text-classification
+- onnx
+- web-classification
+- firefox-ai
+- preview
+language:
+- multilingual
+datasets:
+- tshasan/multi-label-web-classification
+metrics:
+- f1
+- roc-auc
+- hamming-loss
+- precision
+- recall
+- jaccard
+base_model: nomic-ai/modernbert-embed-base
+pipeline_tag: text-classification
+---
+# modernBERT-URLTITLE-classifier-preview
+## Model Overview
+This is a **preview version** of a multi-label web classification model fine-tuned from `nomic-ai/modernbert-embed-base`. It classifies websites into multiple categories based on their URLs and titles. The model supports 10 labels: `News`, `Entertainment`, `Shop`, `Chat`, `Education`, `Government`, `Health`, `Technology`, `Work`, and `Travel`. It is designed to handle imbalanced datasets, treating "Uncategorized" labels as empty (no categories assigned).
+- **Developed by**: Taimur Hasan
+- **Model Type**: Multi-label Text Classification
+- **Base Model**: `nomic-ai/modernbert-embed-base`
+- **Language**: English
+- **License**: MIT
+- **Status**: Preview (under active development)
+## Intended Use
+This model is designed to categorize websites using minimal input (URL and title).
+Potential applications include:
+- Web content categorization
+- Recommendation systems
+- Data enrichment for web scraping
+### Limitations
+- **Preview Status**: Performance metrics are preliminary and may improve with further development.
+- **Rare Labels**: Classification of underrepresented categories may require additional tuning.
+## Model Details
+### Architecture
+- **Base Model**: `nomic-ai/modernbert-embed-base`
+- **Fine-tuning**: Unfroze the last 4 encoder layers and the pooler
+- **Problem Type**: Multi-label classification
+- **Output Labels**: 10 (`News`, `Entertainment`, `Shop`, `Chat`, `Education`, `Government`, `Health`, `Technology`, `Work`, `Travel`)
+- **Input Format**: Concatenated string: `"URL: {url} Title: {title}"`
+### Training Data
+- **Dataset**: tshasan/multi-label-web-classification
+- **Preprocessing**:
+  - "Uncategorized" labels mapped to empty (all labels set to 0)
+  - Rare labels (<5% occurrence) augmented using synonym replacement via `nlpaug`
+  - Oversampling applied to rare-label samples to address imbalance
+### Training Procedure
+- **Framework**: Hugging Face Transformers
+- **Optimizer**: AdamW (learning rate: 2e-5)
+- **Batch Size**: 32 (for both training and evaluation)
+- **Epochs**: 2 (with early stopping, patience=3)
+- **Loss Function**: BCEWithLogitsLoss with class weights to handle imbalance
+- **Hardware**: GPU (CUDA-enabled)
+- **Metrics**: F1 (micro/macro), ROC-AUC, Hamming Loss, Precision, Recall, etc.
+## Evaluation
+### Metrics
+Below are the performance metrics on the validation dataset:
+| Metric             | Value    |
+|--------------------|----------|
+| Loss              | 0.276    |
+| Hamming Loss      | 0.097    |
+| Exact Match       | 0.390    |
+| Precision (Micro) | 0.903    |
+| Recall (Micro)    | 0.903    |
+| F1 (Micro)        | 0.903    |
+| Precision (Macro) | 0.583    |
+| Recall (Macro)    | 0.490    |
+| F1 (Macro)        | 0.498    |
+| Precision (Weighted) | 0.631 |
+| Recall (Weighted) | 0.496    |
+| F1 (Weighted)     | 0.537    |
+| ROC-AUC (Micro)   | 0.859    |
+| ROC-AUC (Macro)   | 0.849    |
+| PR-AUC (Micro)    | 0.550    |
+| PR-AUC (Macro)    | 0.539    |
+| Jaccard (Micro)   | 0.822    |
+| Jaccard (Macro)   | 0.335    |
+| Runtime (seconds) | 3.855    |
+| Samples per Second| 739.794  |
+| Steps per Second  | 23.346   |
+steps_per_second 23.346