# BERT Base Uncased Quantized Model for Spam Detection This repository hosts a quantized version of the BERT model, fine-tuned for spam detection tasks. The model has been optimized for efficient deployment while maintaining high accuracy, making it suitable for resource-constrained environments. ## Model Details - **Model Architecture:** BERT Base Uncased - **Task:** Spam Email Detection - **Dataset:** Hugging Face's `mail_spam_ham_dataset` and 'spam-mail' - **Quantization:** Float16 - **Fine-tuning Framework:** Hugging Face Transformers ## Usage ### Installation ```sh pip install transformers torch ``` ### Loading the Model ```python from transformers import BertTokenizer, BertForSequenceClassification import torch model_name = "AventIQ-AI/bert-spam-detection" tokenizer = BertTokenizer.from_pretrained(model_name) model = BertForSequenceClassification.from_pretrained(model_name) device = torch.device("cuda" if torch.cuda.is_available() else "cpu") def predict_spam_quantized(text): """Predicts whether a given text is spam (1) or ham (0) using the quantized BERT model.""" # Tokenize input text inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=512) # Move inputs to GPU (if available) inputs = {key: value.to(device) for key, value in inputs.items()} # Perform inference with torch.no_grad(): outputs = model(**inputs) # Get predicted label (0 = ham, 1 = spam) prediction = torch.argmax(outputs.logits, dim=1).item() return "Spam" if prediction == 1 else "Ham" # Sample test messages print(predict_spam_quantized("WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.")) # Expected output: Spam print(predict_spam_quantized("WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.")) # Expected output: Ham ``` ## 📊 Classification Report (Quantized Model - float16) | Metric | Class 0 (Non-Spam) | Class 1 (Spam) | Macro Avg | Weighted Avg | |------------|----------------|----------------|------------|--------------| | **Precision** | 1.00 | 0.98 | 0.99 | 0.99 | | **Recall** | 0.99 | 0.99 | 0.99 | 0.99 | | **F1-Score** | 0.99 | 0.99 | 0.99 | 0.99 | | **Accuracy** | **99%** | **99%** | **99%** | **99%** | ### 🔍 **Observations** ✅ **Precision:** High (1.00 for non-spam, 0.98 for spam) → **Few false positives** ✅ **Recall:** High (0.99 for both classes) → **Few false negatives** ✅ **F1-Score:** **Near-perfect balance** between precision & recall ## Fine-Tuning Details ### Dataset The Hugging Face's 'spam-mail' and 'mail_spam_ham_dataset' datasets are combined together and used, containing both spam and ham (non-spam) examples. ### Training - Number of epochs: 3 - Batch size: 8 - Evaluation strategy: epoch - Learning rate: 2e-5 ### Quantization Post-training quantization was applied using PyTorch's built-in quantization framework to reduce the model size and improve inference efficiency. ## Repository Structure ``` . ├── model/ # Contains the quantized model files ├── tokenizer_config/ # Tokenizer configuration and vocabulary files ├── model.safetensors/ # Fine Tuned Model ├── README.md # Model documentation ``` ## Limitations - The model may not generalize well to domains outside the fine-tuning dataset. - Quantization may result in minor accuracy degradation compared to full-precision models. ## Contributing Contributions are welcome! Feel free to open an issue or submit a pull request if you have suggestions or improvements.