Transformer
This is a transformers model fine tuned for malicious URL detection. Given a FQDN URL it outputs probability of it to be malicious identifying common suspicious pattern.
Model Details
Model Description
- Developed by: Anvilogic
- Model Type: Transformer
- Maximum Sequence Length: 512 tokens
- Output Dimensionality: 768 tokens
- Finetuned from model: distilbert
- Language(s) (NLP): Multilingual
- License: MIT
Full Model Architecture
DistilBERT:
name: "distilbert-base-cased"
params:
layers: 6
hidden_size: 768
attention_heads: 12
ff_dim: 3072
max_seq_len: 512
vocab_size: 28996
total_params: 66M
activation: "gelu"
Usage
Direct Usage
First install the Transformers library:
pip install -U transformers
Then you can load this model and run inference.
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
# Load pre-trained model and tokenizer
model_name = "Anvilogic/URLGuardian"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2) # Adjust `num_labels` based on your task
# Example sentences
sentences = ["paypal.com.secure-login.xyz","bit.ly/fake-login"]
# Tokenize inputs
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
# Run inference
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits # Raw predictions
predictions = torch.argmax(logits, dim=-1) # Convert to class labels
# Print results
print(predictions.tolist()) # Example output: [1, 0] (Assuming 1 = Positive, 0 = Negative)
Downstream Usage
This dataset enables real-time malicious URL detection with lightweight models, supporting large-scale inference for phishing prevention and cybersecurity monitoring.
Training Details
Framework Versions
- Python: 3.10.14
- Transformers: 4.49.0
- PyTorch: 2.2.2
- Tokenizers: 0.20.3
Training Data
The model was fine-tuned using Anvilogic/URL-Guardian-Dataset, which contains URL as well as their labels. The dataset was filtered and converted to the parquet format for efficient processing.
Training Procedure
The model was optimized using BCELoss
Training Hyperparameters
- Model Architecture: encoder fine-tuned from distilbert
- Batch Size: 32
- Epochs: 3
- Learning Rate: 2e-5
- Warmup Steps: 100
Evaluation
In the final evaluation after training, the model achieved the following metrics on the test set:
Binary Classification Evaluator
Accuracy : 0.9744
F1 Score : 0.9742
Precision : 0.9771
Recall : 0.9712
Average Precision : 0.9962
These results indicate the model's high performance in identifying maliciosu URLs, with strong precision and recall scores that make it well-suited for cybersecurity applications.
- Downloads last month
- 3