Create README.md

05b7d18 verified 5 months ago

3.45 kB

	### 🔒 SecureBERT Phishing Detection Model

	This repository hosts a fine-tuned SecureBERT-based model optimized for phishing URL detection using a cybersecurity dataset. The model classifies URLs as either phishing (malicious) or safe (benign).

	---

	## 📚 Model Details

	- Model Architecture: SecureBERT (Based on BERT)
	- Task: Binary Classification (Phishing vs. Safe)
	- Dataset: shashwatwork/web-page-phishing-detection-dataset (11,431 URLs, 88 features)
	- Framework: PyTorch & Hugging Face Transformers
	- Input Data: URL strings & extracted numerical features
	- Number of Classes: 2 (Phishing, Safe)
	- Quantization: FP16 (for efficiency)

	---

	## 🚀 Usage

	### Installation

	```bash
	pip install torch transformers scikit-learn pandas
	```

	### Loading the Model

	```python
	import torch
	from transformers import AutoTokenizer, AutoModelForSequenceClassification

	# Load the fine-tuned model and tokenizer
	model_path = "./fine_tuned_SecureBERT"
	tokenizer = AutoTokenizer.from_pretrained(model_path)
	model = AutoModelForSequenceClassification.from_pretrained(model_path)
	model.eval() # Set model to evaluation mode

	print("✅ SecureBERT model loaded successfully and ready for inference!")
	```

	---

	### 🔍 Perform Phishing Detection

	```python
	def predict_url(url):
	# Tokenize input
	encoding = tokenizer(url, truncation=True, padding=True, max_length=512, return_tensors="pt")

	# Perform inference
	with torch.no_grad():
	output = model(**encoding)

	# Get predicted class
	predicted_class = torch.argmax(output.logits, dim=1).item()

	# Map label
	label = "Phishing" if predicted_class == 1 else "Safe"
	return label

	# Example usage
	custom_url = "http://example.com/free-gift"
	prediction = predict_url(custom_url)
	print(f"Predicted label: {prediction}")
	```

	---

	## 📊 Evaluation Results

	After fine-tuning, the model was evaluated on a test set, achieving the following performance:

	\| Metric \| Score \|
	\|------------------\|-----------\|
	\| Accuracy \| 97.2% \|
	\| Precision \| 96.8% \|
	\| Recall \| 97.5% \|
	\| F1-Score \| 97.1% \|
	\| Inference Speed \| Fast (Optimized with FP16) \|

	---

	## 🛠️ Fine-Tuning Details

	### Dataset
	The model was trained on a shashwatwork/web-page-phishing-detection-dataset consisting of 11,431 URLs labeled as either phishing or safe. Features include URL characteristics, domain properties, and additional metadata.

	### Training Configuration

	- Number of epochs: 5
	- Batch size: 16
	- Optimizer: AdamW
	- Learning rate: 2e-5
	- Loss Function: Cross-Entropy
	- Evaluation Strategy: Validation at each epoch

	### Quantization
	The model was quantized using FP16 precision, reducing latency and memory usage while maintaining high accuracy.

	---

	## ⚠️ Limitations

	- Evasion Techniques: Attackers constantly evolve phishing techniques, which may reduce model effectiveness.
	- Dataset Bias: The model was trained on a specific dataset; new phishing tactics may require retraining.
	- False Positives: Some legitimate but unusual URLs might be classified as phishing.

	---

	✅ Use this fine-tuned SecureBERT model for accurate and efficient phishing detection! 🔒🚀

	### 🔒 SecureBERT Phishing Detection Model

	This repository hosts a fine-tuned SecureBERT-based model optimized for phishing URL detection using a cybersecurity dataset. The model classifies URLs as either phishing (malicious) or safe (benign).

	---

	## 📚 Model Details

	- Model Architecture: SecureBERT (Based on BERT)
	- Task: Binary Classification (Phishing vs. Safe)
	- Dataset: shashwatwork/web-page-phishing-detection-dataset (11,431 URLs, 88 features)
	- Framework: PyTorch & Hugging Face Transformers
	- Input Data: URL strings & extracted numerical features
	- Number of Classes: 2 (Phishing, Safe)
	- Quantization: FP16 (for efficiency)

	---

	## 🚀 Usage

	### Installation

	```bash
	pip install torch transformers scikit-learn pandas
	```

	### Loading the Model

	```python
	import torch
	from transformers import AutoTokenizer, AutoModelForSequenceClassification

	# Load the fine-tuned model and tokenizer
	model_path = "./fine_tuned_SecureBERT"
	tokenizer = AutoTokenizer.from_pretrained(model_path)
	model = AutoModelForSequenceClassification.from_pretrained(model_path)
	model.eval() # Set model to evaluation mode

	print("✅ SecureBERT model loaded successfully and ready for inference!")
	```

	---

	### 🔍 Perform Phishing Detection

	```python
	def predict_url(url):
	# Tokenize input
	encoding = tokenizer(url, truncation=True, padding=True, max_length=512, return_tensors="pt")

	# Perform inference
	with torch.no_grad():
	output = model(**encoding)

	# Get predicted class
	predicted_class = torch.argmax(output.logits, dim=1).item()

	# Map label
	label = "Phishing" if predicted_class == 1 else "Safe"
	return label

	# Example usage
	custom_url = "http://example.com/free-gift"
	prediction = predict_url(custom_url)
	print(f"Predicted label: {prediction}")
	```

	---

	## 📊 Evaluation Results

	After fine-tuning, the model was evaluated on a test set, achieving the following performance:

	\| Metric \| Score \|
	\|------------------\|-----------\|
	\| Accuracy \| 97.2% \|
	\| Precision \| 96.8% \|
	\| Recall \| 97.5% \|
	\| F1-Score \| 97.1% \|
	\| Inference Speed \| Fast (Optimized with FP16) \|

	---

	## 🛠️ Fine-Tuning Details

	### Dataset
	The model was trained on a shashwatwork/web-page-phishing-detection-dataset consisting of 11,431 URLs labeled as either phishing or safe. Features include URL characteristics, domain properties, and additional metadata.

	### Training Configuration

	- Number of epochs: 5
	- Batch size: 16
	- Optimizer: AdamW
	- Learning rate: 2e-5
	- Loss Function: Cross-Entropy
	- Evaluation Strategy: Validation at each epoch

	### Quantization
	The model was quantized using FP16 precision, reducing latency and memory usage while maintaining high accuracy.

	---

	## ⚠️ Limitations

	- Evasion Techniques: Attackers constantly evolve phishing techniques, which may reduce model effectiveness.
	- Dataset Bias: The model was trained on a specific dataset; new phishing tactics may require retraining.
	- False Positives: Some legitimate but unusual URLs might be classified as phishing.

	---

	✅ Use this fine-tuned SecureBERT model for accurate and efficient phishing detection! 🔒🚀