You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Transformer

This is a transformers model fine tuned for malicious URL detection. Given a FQDN URL it outputs probability of it to be malicious identifying common suspicious pattern.

Model Details

Model Description

  • Developed by: Anvilogic
  • Model Type: Transformer
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 768 tokens
  • Finetuned from model: distilbert
  • Language(s) (NLP): Multilingual
  • License: MIT

Full Model Architecture

DistilBERT:
  name: "distilbert-base-cased"
  params:
    layers: 6
    hidden_size: 768
    attention_heads: 12
    ff_dim: 3072
    max_seq_len: 512
    vocab_size: 28996
    total_params: 66M
    activation: "gelu"

Usage

Direct Usage

First install the Transformers library:

pip install -U transformers

Then you can load this model and run inference.

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
# Load pre-trained model and tokenizer
model_name = "Anvilogic/URLGuardian"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)  # Adjust `num_labels` based on your task
# Example sentences
sentences = ["paypal.com.secure-login.xyz","bit.ly/fake-login"]
# Tokenize inputs
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
# Run inference
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits  # Raw predictions
    predictions = torch.argmax(logits, dim=-1)  # Convert to class labels
# Print results
print(predictions.tolist())  # Example output: [1, 0] (Assuming 1 = Positive, 0 = Negative)

Downstream Usage

This dataset enables real-time malicious URL detection with lightweight models, supporting large-scale inference for phishing prevention and cybersecurity monitoring.

Training Details

Framework Versions

  • Python: 3.10.14
  • Transformers: 4.49.0
  • PyTorch: 2.2.2
  • Tokenizers: 0.20.3

Training Data

The model was fine-tuned using Anvilogic/URL-Guardian-Dataset, which contains URL as well as their labels. The dataset was filtered and converted to the parquet format for efficient processing.

Training Procedure

The model was optimized using BCELoss

Training Hyperparameters

  • Model Architecture: encoder fine-tuned from distilbert
  • Batch Size: 32
  • Epochs: 3
  • Learning Rate: 2e-5
  • Warmup Steps: 100

Evaluation

In the final evaluation after training, the model achieved the following metrics on the test set:

Binary Classification Evaluator

Accuracy : 0.9744
F1 Score : 0.9742
Precision : 0.9771
Recall : 0.9712
Average Precision : 0.9962

These results indicate the model's high performance in identifying maliciosu URLs, with strong precision and recall scores that make it well-suited for cybersecurity applications.

Downloads last month
3
Safetensors
Model size
65.8M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Anvilogic/URLGuardian

Finetuned
(271)
this model