Transformer

This is a transformers model fine tuned for malicious URL detection. Given a FQDN URL it outputs probability of it to be malicious identifying common suspicious pattern.

Model Details

Model Description

Developed by: Anvilogic
Model Type: Transformer
Maximum Sequence Length: 512 tokens
Output Dimensionality: 768 tokens
Finetuned from model: distilbert
Language(s) (NLP): Multilingual
License: MIT

Full Model Architecture

DistilBERT:
  name: "distilbert-base-cased"
  params:
    layers: 6
    hidden_size: 768
    attention_heads: 12
    ff_dim: 3072
    max_seq_len: 512
    vocab_size: 28996
    total_params: 66M
    activation: "gelu"

Usage

Direct Usage

First install the Transformers library:

pip install -U transformers

Then you can load this model and run inference.

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
# Load pre-trained model and tokenizer
model_name = "Anvilogic/URLGuardian"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)  # Adjust `num_labels` based on your task
# Example sentences
sentences = ["paypal.com.secure-login.xyz","bit.ly/fake-login"]
# Tokenize inputs
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
# Run inference
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits  # Raw predictions
    predictions = torch.argmax(logits, dim=-1)  # Convert to class labels
# Print results
print(predictions.tolist())  # Example output: [1, 0] (Assuming 1 = Positive, 0 = Negative)

Downstream Usage

This dataset enables real-time malicious URL detection with lightweight models, supporting large-scale inference for phishing prevention and cybersecurity monitoring.

Training Details

Framework Versions

Python: 3.10.14
Transformers: 4.49.0
PyTorch: 2.2.2
Tokenizers: 0.20.3

Training Data

The model was fine-tuned using Anvilogic/URL-Guardian-Dataset, which contains URL as well as their labels. The dataset was filtered and converted to the parquet format for efficient processing.

Training Procedure

The model was optimized using BCELoss

Training Hyperparameters

Model Architecture: encoder fine-tuned from distilbert
Batch Size: 32
Epochs: 3
Learning Rate: 2e-5
Warmup Steps: 100

Evaluation

In the final evaluation after training, the model achieved the following metrics on the test set:

Binary Classification Evaluator

Accuracy : 0.9744
F1 Score : 0.9742
Precision : 0.9771
Recall : 0.9712
Average Precision : 0.9962

These results indicate the model's high performance in identifying maliciosu URLs, with strong precision and recall scores that make it well-suited for cybersecurity applications.

Anvilogic
/

URLGuardian

You need to agree to share your contact information to access this model