File size: 3,453 Bytes

05b7d18

### **🔒 SecureBERT Phishing Detection Model**  

This repository hosts a fine-tuned **SecureBERT-based** model optimized for **phishing URL detection** using a cybersecurity dataset. The model classifies URLs as either **phishing (malicious)** or **safe (benign)**.  

---

## **📚 Model Details**  

- **Model Architecture**: SecureBERT (Based on BERT)  
- **Task**: Binary Classification (Phishing vs. Safe)  
- **Dataset**: shashwatwork/web-page-phishing-detection-dataset (11,431 URLs, 88 features)  
- **Framework**: PyTorch & Hugging Face Transformers  
- **Input Data**: URL strings & extracted numerical features  
- **Number of Classes**: 2 (**Phishing, Safe**)  
- **Quantization**: FP16 (for efficiency)  

---

## **🚀 Usage**  

### **Installation**  

```bash
pip install torch transformers scikit-learn pandas
```

### **Loading the Model**  

```python
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load the fine-tuned model and tokenizer
model_path = "./fine_tuned_SecureBERT"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)
model.eval()  # Set model to evaluation mode

print("✅ SecureBERT model loaded successfully and ready for inference!")
```

---

### **🔍 Perform Phishing Detection**  

```python
def predict_url(url):
    # Tokenize input
    encoding = tokenizer(url, truncation=True, padding=True, max_length=512, return_tensors="pt")
    
    # Perform inference
    with torch.no_grad():
        output = model(**encoding)
    
    # Get predicted class
    predicted_class = torch.argmax(output.logits, dim=1).item()
    
    # Map label
    label = "Phishing" if predicted_class == 1 else "Safe"
    return label

# Example usage
custom_url = "http://example.com/free-gift"
prediction = predict_url(custom_url)
print(f"Predicted label: {prediction}")
```

---

## **📊 Evaluation Results**  

After fine-tuning, the model was evaluated on a **test set**, achieving the following performance:  

| **Metric**        | **Score**  |
|------------------|-----------|
| **Accuracy**      | 97.2%     |
| **Precision**     | 96.8%     |
| **Recall**        | 97.5%     |
| **F1-Score**      | 97.1%     |
| **Inference Speed** | Fast (Optimized with FP16) |

---

## **🛠️ Fine-Tuning Details**  

### **Dataset**  
The model was trained on a **shashwatwork/web-page-phishing-detection-dataset** consisting of **11,431 URLs** labeled as either **phishing** or **safe**. Features include URL characteristics, domain properties, and additional metadata.  

### **Training Configuration**  

- **Number of epochs**: 5  
- **Batch size**: 16  
- **Optimizer**: AdamW  
- **Learning rate**: 2e-5  
- **Loss Function**: Cross-Entropy  
- **Evaluation Strategy**: Validation at each epoch  

### **Quantization**  
The model was quantized using **FP16 precision**, reducing latency and memory usage while maintaining high accuracy.  

---

## **⚠️ Limitations**  

- **Evasion Techniques**: Attackers constantly evolve phishing techniques, which may reduce model effectiveness.  
- **Dataset Bias**: The model was trained on a specific dataset; new phishing tactics may require retraining.  
- **False Positives**: Some legitimate but unusual URLs might be classified as phishing.  

---

✅ **Use this fine-tuned SecureBERT model for accurate and efficient phishing detection!** 🔒🚀