Create README.md

### **🔒 BERT Phishing Detection Model**

This repository hosts a fine-tuned **BERT-based** model optimized for **phishing URL detection** using a cybersecurity dataset. The model classifies URLs as either **phishing (malicious)** or **safe (benign)**.

---

## **📚 Model Details**

- **Model Architecture**: BERT (Based on BERT)
- **Task**: Binary Classification (Phishing vs. Safe)
- **Dataset**: Custom cybersecurity dataset (11,431 URLs, 89 features)
- **Framework**: PyTorch & Hugging Face Transformers
- **Input Data**: URL strings & extracted numerical features
- **Number of Classes**: 2 (**Phishing, Safe**)
- **Quantization**: FP16 (for efficiency)

---

## **🚀 Usage**

### **Installation**

```bash
pip install torch transformers scikit-learn pandas
```

### **Loading the Model**

```python
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load the fine-tuned model and tokenizer
model_path = "./fine_tuned_SecureBERT"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)
model.eval() # Set model to evaluation mode

print("✅ BERT model loaded successfully and ready for inference!")
```

---

### **🔍 Perform Phishing Detection**

```python
def predict_url(url):
# Tokenize input
encoding = tokenizer(url, truncation=True, padding=True, max_length=512, return_tensors="pt")

# Perform inference
with torch.no_grad():
output = model(**encoding)

# Get predicted class
predicted_class = torch.argmax(output.logits, dim=1).item()

# Map label
label = "Phishing" if predicted_class == 1 else "Safe"
return label

# Example usage
custom_url = "http://example.com/free-gift"
prediction = predict_url(custom_url)
print(f"Predicted label: {prediction}")
```

---

## **📊 Evaluation Results**

After fine-tuning, the model was evaluated on a **test set**, achieving the following performance:

| **Metric** | **Score** |
|------------------|-----------|
| **Accuracy** | 97.2% |
| **Precision** | 96.8% |
| **Recall** | 97.5% |
| **F1-Score** | 97.1% |
| **Inference Speed** | Fast (Optimized with FP16) |

---

## **🛠️ Fine-Tuning Details**

### **Dataset**
The model was trained on kaggle's **shashwatwork/web-page-phishing-detection-dataset ** consisting of **11,431 URLs** labeled as either **phishing** or **safe**. Features include URL characteristics, domain properties, and additional metadata.

### **Training Configuration**

- **Number of epochs**: 5
- **Batch size**: 16
- **Optimizer**: AdamW
- **Learning rate**: 2e-5
- **Loss Function**: Cross-Entropy
- **Evaluation Strategy**: Validation at each epoch

### **Quantization**
The model was quantized using **FP16 precision**, reducing latency and memory usage while maintaining high accuracy.

---

## **⚠️ Limitations**

- **Evasion Techniques**: Attackers constantly evolve phishing techniques, which may reduce model effectiveness.
- **Dataset Bias**: The model was trained on a specific dataset; new phishing tactics may require retraining.
- **False Positives**: Some legitimate but unusual URLs might be classified as phishing.

---

✅ **Use this fine-tuned SecureBERT model for accurate and efficient phishing detection!** 🔒🚀

Files changed (1) hide show

README.md +0 -0

README.md ADDED Viewed

File without changes