RoBERTa based Spam Message Detection

Spam messages frequently carry malicious links or phishing attempts posing significant threats to both organizations and their users. By choosing our RoBERTa-based spam message detection system, organizations can greatly enhance their security infrastructure. Our system effectively detects and filters out spam messages, adding an extra layer of security that safeguards organizations against potential financial losses, legal consequences, and reputational harm.

Found this model useful:

Your feedback is important and would help keep this relevent.

Metrics

Loss Accuracy(0.9906) Precision(0.9971) / Recall(0.9934) Confusion Matrix
Train / Validation Validation Validation Testing Set

Model Output

  • 0 is ham
  • 1 is spam

Dataset

https://huggingface.co/datasets/mshenoda/spam-messages

The dataset is composed of messages labeled by ham or spam, merged from three data sources:

  1. SMS Spam Collection https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
  2. Telegram Spam Ham https://huggingface.co/datasets/thehamkercat/telegram-spam-ham/tree/main
  3. Enron Spam: https://huggingface.co/datasets/SetFit/enron_spam/tree/main (only used message column and labels)

The prepare script for enron is available at https://github.com/mshenoda/roberta-spam/tree/main/data/enron. The data is split 80% train 10% validation, and 10% test sets; the scripts used to split and merge of the three data sources are available at: https://github.com/mshenoda/roberta-spam/tree/main/data/utils.

Dataset Class Distribution

Training 80% Validation 10% Testing 10%
Class Distribution Class Distribution Class Distribution

Architecture

The model is fine tuned RoBERTa

roberta-base: https://huggingface.co/roberta-base

paper: https://arxiv.org/abs/1907.11692

Code

https://github.com/mshenoda/roberta-spam

Downloads last month
75,548
Safetensors
Model size
125M params
Tensor type
I64
ยท
F32
ยท
Inference API

Dataset used to train mshenoda/roberta-spam

Spaces using mshenoda/roberta-spam 2