Create README.md
Browse files### **π BERT Phishing Detection Model**
This repository hosts a fine-tuned **BERT-based** model optimized for **phishing URL detection** using a cybersecurity dataset. The model classifies URLs as either **phishing (malicious)** or **safe (benign)**.
---
## **π Model Details**
- **Model Architecture**: BERT (Based on BERT)
- **Task**: Binary Classification (Phishing vs. Safe)
- **Dataset**: Custom cybersecurity dataset (11,431 URLs, 89 features)
- **Framework**: PyTorch & Hugging Face Transformers
- **Input Data**: URL strings & extracted numerical features
- **Number of Classes**: 2 (**Phishing, Safe**)
- **Quantization**: FP16 (for efficiency)
---
## **π Usage**
### **Installation**
```bash
pip install torch transformers scikit-learn pandas
```
### **Loading the Model**
```python
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
# Load the fine-tuned model and tokenizer
model_path = "./fine_tuned_SecureBERT"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)
model.eval() # Set model to evaluation mode
print("β
BERT model loaded successfully and ready for inference!")
```
---
### **π Perform Phishing Detection**
```python
def predict_url(url):
# Tokenize input
encoding = tokenizer(url, truncation=True, padding=True, max_length=512, return_tensors="pt")
# Perform inference
with torch.no_grad():
output = model(**encoding)
# Get predicted class
predicted_class = torch.argmax(output.logits, dim=1).item()
# Map label
label = "Phishing" if predicted_class == 1 else "Safe"
return label
# Example usage
custom_url = "http://example.com/free-gift"
prediction = predict_url(custom_url)
print(f"Predicted label: {prediction}")
```
---
## **π Evaluation Results**
After fine-tuning, the model was evaluated on a **test set**, achieving the following performance:
| **Metric** | **Score** |
|------------------|-----------|
| **Accuracy** | 97.2% |
| **Precision** | 96.8% |
| **Recall** | 97.5% |
| **F1-Score** | 97.1% |
| **Inference Speed** | Fast (Optimized with FP16) |
---
## **π οΈ Fine-Tuning Details**
### **Dataset**
The model was trained on kaggle's **shashwatwork/web-page-phishing-detection-dataset ** consisting of **11,431 URLs** labeled as either **phishing** or **safe**. Features include URL characteristics, domain properties, and additional metadata.
### **Training Configuration**
- **Number of epochs**: 5
- **Batch size**: 16
- **Optimizer**: AdamW
- **Learning rate**: 2e-5
- **Loss Function**: Cross-Entropy
- **Evaluation Strategy**: Validation at each epoch
### **Quantization**
The model was quantized using **FP16 precision**, reducing latency and memory usage while maintaining high accuracy.
---
## **β οΈ Limitations**
- **Evasion Techniques**: Attackers constantly evolve phishing techniques, which may reduce model effectiveness.
- **Dataset Bias**: The model was trained on a specific dataset; new phishing tactics may require retraining.
- **False Positives**: Some legitimate but unusual URLs might be classified as phishing.
---
β
**Use this fine-tuned SecureBERT model for accurate and efficient phishing detection!** ππ