DistilBERT-Based Quantized Model for Spam Detection

This repository hosts a quantized version of the DistilBERT model, fine-tuned for spam detection tasks. The model is optimized for efficient deployment, making it suitable for resource-constrained environments while maintaining high accuracy.

Model Details

Model Architecture: DistilBERT Base Uncased
Task: Binary Spam Detection
Dataset: Custom Spam Dataset (CSV format)
Quantization: Float16
Fine-tuning Framework: Hugging Face Transformers

Usage

Installation

pip install transformers torch datasets scikit-learn

Loading the Model

from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification
import torch

# Load quantized model
model_path = "quantized-model"
quantized_model = DistilBertForSequenceClassification.from_pretrained(model_path)
tokenizer = DistilBertTokenizerFast.from_pretrained(model_path)

quantized_model.eval()
quantized_model.half()

# Example inference
text = "Congratulations! You've won a $1000 Walmart gift card. Go to http://bit.ly/123456 to claim now."
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128)

with torch.no_grad():
    outputs = quantized_model(**inputs)
    predicted_class = torch.argmax(outputs.logits, dim=1).item()

label_map = {0: "Not Spam", 1: "Spam"}
print(f"Predicted Label: {label_map[predicted_class]}")

Performance Metrics

Accuracy: ~0.97
F1 Score: Optimized via early stopping and best model selection strategy

Fine-Tuning Details

Dataset

The dataset consists of labeled SMS/email messages as spam or not spam.

Training Configuration

Epochs: 2
Batch size: 16 (train), 64 (eval)
Learning rate: 3e-5
Evaluation strategy: per epoch
Early stopping: enabled
Mixed precision (fp16): enabled on GPU

Quantization

Post-training quantization was applied using PyTorch to reduce model size and improve inference speed.

Repository Structure

.
├── spam_model/          # Original trained model
├── quantized-model/     # Quantized model for deployment
├── tokenizer/           # Tokenizer files
├── README.md            # Documentation

Limitations

May not generalize well to messages with formats unseen during training.
Quantization might slightly impact accuracy.