ayushsinha commited on
Commit
d75fac7
Β·
verified Β·
1 Parent(s): 909491c

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +107 -0
README.md ADDED
@@ -0,0 +1,107 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # DistilBERT Base Uncased Quantized Model for Spam Detection
2
+
3
+ This repository hosts a quantized version of the DistilBERT model, fine-tuned for spam detection tasks. The model has been optimized for efficient deployment while maintaining high accuracy, making it suitable for resource-constrained environments.
4
+
5
+ ## Model Details
6
+
7
+ - **Model Architecture:** DistilBERT Base Uncased
8
+ - **Task:** Spam Detection
9
+ - **Dataset:** Hugging Face's `sms_spam`
10
+ - **Quantization:** BrainFloat16
11
+ - **Fine-tuning Framework:** Hugging Face Transformers
12
+
13
+ ## Usage
14
+
15
+ ### Installation
16
+
17
+ ```sh
18
+ pip install transformers torch
19
+ ```
20
+
21
+ ### Loading the Model
22
+
23
+ ```python
24
+ from transformers import DistilBertTokenizer, DistilBertForSequenceClassification, Trainer, TrainingArguments
25
+ import torch
26
+
27
+ model_name = "AventIQ-AI/distilbert-spam-detection"
28
+ tokenizer = DistilBertTokenizer.from_pretrained(model_name)
29
+ model = DistilBertForSequenceClassification.from_pretrained(model_name)
30
+
31
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
32
+
33
+ def predict_spam(text, model, tokenizer, device):
34
+ model.eval() # Set to evaluation mode
35
+ inputs = tokenizer(text, return_tensors="pt", padding="max_length", truncation=True, max_length=128).to(device)
36
+
37
+ with torch.no_grad():
38
+ outputs = model(**inputs)
39
+ probs = torch.softmax(outputs.logits, dim=-1)
40
+ pred_class = torch.argmax(probs).item()
41
+ return "Spam" if pred_class == 1 else "Not Spam"
42
+
43
+ # Sample test messages
44
+ test_messages = [
45
+ "Congratulations! You have won a lottery of $1,000,000. Claim now!", # Spam
46
+ "Hey, are we still meeting for dinner tonight?", # Not Spam
47
+ "URGENT: Your bank account is at risk! Click this link to secure it now.", # Spam
48
+ "Let's catch up this weekend. It’s been a while!", # Not Spam
49
+ "Exclusive offer! Get 50% off on your next purchase. Limited time only!", # Spam
50
+ ]
51
+
52
+ # Run inference on test messages
53
+ for i, msg in enumerate(test_messages):
54
+ prediction = predict_spam(msg, model, tokenizer, device)
55
+ print(f"Sample {i+1}: {msg} -> Prediction: {prediction}")
56
+ ```
57
+
58
+ ## πŸ“Š Classification Report (Quantized Model - bfloat16)
59
+
60
+ | Metric | Class 0 (Non-Spam) | Class 1 (Spam) | Macro Avg | Weighted Avg |
61
+ |------------|----------------|----------------|------------|--------------|
62
+ | **Precision** | 1.00 | 0.98 | 0.99 | 0.99 |
63
+ | **Recall** | 0.99 | 0.99 | 0.99 | 0.99 |
64
+ | **F1-Score** | 0.99 | 0.99 | 0.99 | 0.99 |
65
+ | **Accuracy** | **99%** | **99%** | **99%** | **99%** |
66
+
67
+ ### πŸ” **Observations**
68
+ βœ… **Precision:** High (1.00 for non-spam, 0.98 for spam) β†’ **Few false positives**
69
+ βœ… **Recall:** High (0.99 for both classes) β†’ **Few false negatives**
70
+ βœ… **F1-Score:** **Near-perfect balance** between precision & recall
71
+
72
+ ## Fine-Tuning Details
73
+
74
+ ### Dataset
75
+
76
+ The Hugging Face's `sms_spam` dataset was used, containing both spam and ham (non-spam) examples.
77
+
78
+ ### Training
79
+
80
+ - Number of epochs: 7
81
+ - Batch size: 16
82
+ - Evaluation strategy: epoch
83
+ - Learning rate: 5e-6
84
+
85
+ ### Quantization
86
+
87
+ Post-training quantization was applied using PyTorch's built-in quantization framework to reduce the model size and improve inference efficiency.
88
+
89
+ ## Repository Structure
90
+
91
+ ```
92
+ .
93
+ β”œβ”€β”€ model/ # Contains the quantized model files
94
+ β”œβ”€β”€ tokenizer_config/ # Tokenizer configuration and vocabulary files
95
+ β”œβ”€β”€ pytorch_model.bin/ # Fine Tuned Model
96
+ β”œβ”€β”€ README.md # Model documentation
97
+ ```
98
+
99
+ ## Limitations
100
+
101
+ - The model may not generalize well to domains outside the fine-tuning dataset.
102
+ - Quantization may result in minor accuracy degradation compared to full-precision models.
103
+
104
+ ## Contributing
105
+
106
+ Contributions are welcome! Feel free to open an issue or submit a pull request if you have suggestions or improvements.
107
+