developerPushkal commited on
Commit
0a374ec
Β·
verified Β·
1 Parent(s): c2e13f3

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +109 -0
README.md ADDED
@@ -0,0 +1,109 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # BERT Base Uncased Quantized Model for Spam Detection
2
+
3
+ This repository hosts a quantized version of the BERT model, fine-tuned for spam detection tasks. The model has been optimized for efficient deployment while maintaining high accuracy, making it suitable for resource-constrained environments.
4
+
5
+ ## Model Details
6
+
7
+ - **Model Architecture:** BERT Base Uncased
8
+ - **Task:** Spam Email Detection
9
+ - **Dataset:** Hugging Face's `mail_spam_ham_dataset` and 'spam-mail'
10
+ - **Quantization:** Float16
11
+ - **Fine-tuning Framework:** Hugging Face Transformers
12
+
13
+ ## Usage
14
+
15
+ ### Installation
16
+
17
+ ```sh
18
+ pip install transformers torch
19
+ ```
20
+
21
+ ### Loading the Model
22
+
23
+ ```python
24
+ from transformers import BertTokenizer, BertForSequenceClassification
25
+ import torch
26
+
27
+ model_name = "AventIQ-AI/bert-spam-detection"
28
+ tokenizer = BertTokenizer.from_pretrained(model_name)
29
+ model = BertForSequenceClassification.from_pretrained(model_name)
30
+
31
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
32
+
33
+ def predict_spam_quantized(text):
34
+ """Predicts whether a given text is spam (1) or ham (0) using the quantized BERT model."""
35
+
36
+ # Tokenize input text
37
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=512)
38
+
39
+ # Move inputs to GPU (if available)
40
+ inputs = {key: value.to(device) for key, value in inputs.items()}
41
+
42
+ # Perform inference
43
+ with torch.no_grad():
44
+ outputs = model(**inputs)
45
+
46
+ # Get predicted label (0 = ham, 1 = spam)
47
+ prediction = torch.argmax(outputs.logits, dim=1).item()
48
+
49
+ return "Spam" if prediction == 1 else "Ham"
50
+
51
+
52
+ # Sample test messages
53
+ print(predict_spam_quantized("WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only."))
54
+ # Expected output: Spam
55
+
56
+ print(predict_spam_quantized("WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only."))
57
+ # Expected output: Ham
58
+ ```
59
+
60
+ ## πŸ“Š Classification Report (Quantized Model - float16)
61
+
62
+ | Metric | Class 0 (Non-Spam) | Class 1 (Spam) | Macro Avg | Weighted Avg |
63
+ |------------|----------------|----------------|------------|--------------|
64
+ | **Precision** | 1.00 | 0.98 | 0.99 | 0.99 |
65
+ | **Recall** | 0.99 | 0.99 | 0.99 | 0.99 |
66
+ | **F1-Score** | 0.99 | 0.99 | 0.99 | 0.99 |
67
+ | **Accuracy** | **99%** | **99%** | **99%** | **99%** |
68
+
69
+ ### πŸ” **Observations**
70
+ βœ… **Precision:** High (1.00 for non-spam, 0.98 for spam) β†’ **Few false positives**
71
+ βœ… **Recall:** High (0.99 for both classes) β†’ **Few false negatives**
72
+ βœ… **F1-Score:** **Near-perfect balance** between precision & recall
73
+
74
+ ## Fine-Tuning Details
75
+
76
+ ### Dataset
77
+
78
+ The Hugging Face's 'spam-mail' and 'mail_spam_ham_dataset' datasets are combined together and used, containing both spam and ham (non-spam) examples.
79
+
80
+ ### Training
81
+
82
+ - Number of epochs: 3
83
+ - Batch size: 8
84
+ - Evaluation strategy: epoch
85
+ - Learning rate: 2e-5
86
+
87
+ ### Quantization
88
+
89
+ Post-training quantization was applied using PyTorch's built-in quantization framework to reduce the model size and improve inference efficiency.
90
+
91
+ ## Repository Structure
92
+
93
+ ```
94
+ .
95
+ β”œβ”€β”€ model/ # Contains the quantized model files
96
+ β”œβ”€β”€ tokenizer_config/ # Tokenizer configuration and vocabulary files
97
+ β”œβ”€β”€ model.safetensors/ # Fine Tuned Model
98
+ β”œβ”€β”€ README.md # Model documentation
99
+ ```
100
+
101
+ ## Limitations
102
+
103
+ - The model may not generalize well to domains outside the fine-tuning dataset.
104
+ - Quantization may result in minor accuracy degradation compared to full-precision models.
105
+
106
+ ## Contributing
107
+
108
+ Contributions are welcome! Feel free to open an issue or submit a pull request if you have suggestions or improvements.
109
+