|
### **π SecureBERT Phishing Detection Model** |
|
|
|
This repository hosts a fine-tuned **SecureBERT-based** model optimized for **phishing URL detection** using a cybersecurity dataset. The model classifies URLs as either **phishing (malicious)** or **safe (benign)**. |
|
|
|
--- |
|
|
|
## **π Model Details** |
|
|
|
- **Model Architecture**: SecureBERT (Based on BERT) |
|
- **Task**: Binary Classification (Phishing vs. Safe) |
|
- **Dataset**: shashwatwork/web-page-phishing-detection-dataset (11,431 URLs, 88 features) |
|
- **Framework**: PyTorch & Hugging Face Transformers |
|
- **Input Data**: URL strings & extracted numerical features |
|
- **Number of Classes**: 2 (**Phishing, Safe**) |
|
- **Quantization**: FP16 (for efficiency) |
|
|
|
--- |
|
|
|
## **π Usage** |
|
|
|
### **Installation** |
|
|
|
```bash |
|
pip install torch transformers scikit-learn pandas |
|
``` |
|
|
|
### **Loading the Model** |
|
|
|
```python |
|
import torch |
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
|
# Load the fine-tuned model and tokenizer |
|
model_path = "./fine_tuned_SecureBERT" |
|
tokenizer = AutoTokenizer.from_pretrained(model_path) |
|
model = AutoModelForSequenceClassification.from_pretrained(model_path) |
|
model.eval() # Set model to evaluation mode |
|
|
|
print("β
SecureBERT model loaded successfully and ready for inference!") |
|
``` |
|
|
|
--- |
|
|
|
### **π Perform Phishing Detection** |
|
|
|
```python |
|
def predict_url(url): |
|
# Tokenize input |
|
encoding = tokenizer(url, truncation=True, padding=True, max_length=512, return_tensors="pt") |
|
|
|
# Perform inference |
|
with torch.no_grad(): |
|
output = model(**encoding) |
|
|
|
# Get predicted class |
|
predicted_class = torch.argmax(output.logits, dim=1).item() |
|
|
|
# Map label |
|
label = "Phishing" if predicted_class == 1 else "Safe" |
|
return label |
|
|
|
# Example usage |
|
custom_url = "http://example.com/free-gift" |
|
prediction = predict_url(custom_url) |
|
print(f"Predicted label: {prediction}") |
|
``` |
|
|
|
--- |
|
|
|
## **π Evaluation Results** |
|
|
|
After fine-tuning, the model was evaluated on a **test set**, achieving the following performance: |
|
|
|
| **Metric** | **Score** | |
|
|------------------|-----------| |
|
| **Accuracy** | 97.2% | |
|
| **Precision** | 96.8% | |
|
| **Recall** | 97.5% | |
|
| **F1-Score** | 97.1% | |
|
| **Inference Speed** | Fast (Optimized with FP16) | |
|
|
|
--- |
|
|
|
## **π οΈ Fine-Tuning Details** |
|
|
|
### **Dataset** |
|
The model was trained on a **shashwatwork/web-page-phishing-detection-dataset** consisting of **11,431 URLs** labeled as either **phishing** or **safe**. Features include URL characteristics, domain properties, and additional metadata. |
|
|
|
### **Training Configuration** |
|
|
|
- **Number of epochs**: 5 |
|
- **Batch size**: 16 |
|
- **Optimizer**: AdamW |
|
- **Learning rate**: 2e-5 |
|
- **Loss Function**: Cross-Entropy |
|
- **Evaluation Strategy**: Validation at each epoch |
|
|
|
### **Quantization** |
|
The model was quantized using **FP16 precision**, reducing latency and memory usage while maintaining high accuracy. |
|
|
|
--- |
|
|
|
## **β οΈ Limitations** |
|
|
|
- **Evasion Techniques**: Attackers constantly evolve phishing techniques, which may reduce model effectiveness. |
|
- **Dataset Bias**: The model was trained on a specific dataset; new phishing tactics may require retraining. |
|
- **False Positives**: Some legitimate but unusual URLs might be classified as phishing. |
|
|
|
--- |
|
|
|
β
**Use this fine-tuned SecureBERT model for accurate and efficient phishing detection!** ππ |
|
|
|
|