ggrizzly
/

roBERTa-spam-detection

Text Classification

Transformers

English

Inference Endpoints

Model card Files Files and versions Community

ggrizzly commited on Oct 28, 2024

Commit

f92b969

verified ·

1 Parent(s): d10bbc5

Update README.md

Browse files

Files changed (1) hide show

README.md +11 -15

README.md CHANGED Viewed

@@ -25,17 +25,12 @@ results:
 library_name: transformers
 ---
 # Is Spam all we need? A RoBERTa Based Approach To Spam Detection
-### Intro
-This is based on [this](https://huggingface.co/mshenoda/roberta-spam) huggingFace model, but instead of fine-tuning it on all the data sources that the original author had, I only finetuned using the [telegram](https://huggingface.co/datasets/thehamkercat/telegram-spam-ham) and [enron](https://huggingface.co/datasets/SetFit/enron_spam) datasets.
-The idea behind this was a more diversified data source, preventing overfitting to the original distribution.
-This was done for an interview project, so if you find this by chance... hopefully it helps you too, but know there's **definitely** better resources out there... and that this was done in the span of one evening.
-This was fine-tuned by replicating the [sentiment analysis](https://huggingface.co/docs/transformers/main/en/model_doc/roberta#resources) Google collab example.
-### Metrics
 **Accuracy**: 0.9503
 Thrilling, I know, I also just got the chills.
@@ -54,23 +49,25 @@ The dataset used for testing was the original kaggle competition (as part of the
 1. SMS Spam Collection https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
-### Dataset Class Distribution
 |          | Total | Training       | Testing     |
 |:--------:|:-----:|:--------------:|:-----------:|
 | Counts   | 59267 | 53693 (90.6% ) | 5574 (9.4%) |
 |          | Total | Spam          | Ham           | Set   | % Total |
 |:--------:|:-----:|:-------------:|:-------------:|:-----:|:-------:|
 | Enron    | 33345 | 16852 (50.5%) | 16493 (49.5%) | Train |   56.2% |
 | Telegram | 20348 |  6011 (29.5%) | 14337 (70.5%) | Train |   43.8% |
 | SMS      |  5574 |   747 (13.5%) |  4827 (86.5%) | Test  |    100% |
 ## Architecture
 The model is fine tuned RoBERTa
@@ -80,5 +77,4 @@ roberta-base: https://huggingface.co/roberta-base
 paper: https://arxiv.org/abs/1907.11692
 ## Code
 TODO: Include Jupyter / code on Github

 library_name: transformers
 ---
 # Is Spam all we need? A RoBERTa Based Approach To Spam Detection
+## Intro
+This is based on [this](https://huggingface.co/mshenoda/roberta-spam) huggingFace model, but instead of fine-tuning it on all the data sources that the original author had, I only finetuned using the [telegram](https://huggingface.co/datasets/thehamkercat/telegram-spam-ham) and [enron](https://huggingface.co/datasets/SetFit/enron_spam) datasets. The idea behind this was a more diversified data source, preventing overfitting to the original distribution. This was fine-tuned by replicating the [sentiment analysis](https://huggingface.co/docs/transformers/main/en/model_doc/roberta#resources) Google collab example (see Code section for my version of it).
+**NOTE**: This was done for an interview project, so if you find this by chance... hopefully it helps you too, but know there's **definitely** better resources out there... and that this was done in the span of one evening.
+## Metrics
 **Accuracy**: 0.9503
 Thrilling, I know, I also just got the chills.
 1. SMS Spam Collection https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
+## Dataset Class Distribution
 |          | Total | Training       | Testing     |
 |:--------:|:-----:|:--------------:|:-----------:|
 | Counts   | 59267 | 53693 (90.6% ) | 5574 (9.4%) |
 |          | Total | Spam          | Ham           | Set   | % Total |
 |:--------:|:-----:|:-------------:|:-------------:|:-----:|:-------:|
 | Enron    | 33345 | 16852 (50.5%) | 16493 (49.5%) | Train |   56.2% |
 | Telegram | 20348 |  6011 (29.5%) | 14337 (70.5%) | Train |   43.8% |
 | SMS      |  5574 |   747 (13.5%) |  4827 (86.5%) | Test  |    100% |
+|          | Distribution of number of characters per class label (100 bins) | Distribution of number of words per class label (100 bins) |
+|:--------:|:---------------------------------------------------------------:|:----------------------------------------------------------:|
+|   SMS    | ![image/png ](https://cdn-uploads.huggingface.co/production/uploads/644ef0eb1565b54e4a656946/OjLvujmQyeQPlowW5lI5A.png) | ![image/png](https://cdn-uploads.huggingface.co/production/uploads/644ef0eb1565b54e4a656946/RFs92xoeIUDAsry6T1Ec4.png) |
+|  Enron (limiting a few outliers) | ![image/png](https://cdn-uploads.huggingface.co/production/uploads/644ef0eb1565b54e4a656946/Gd7le3W2U05DaQtjb971o.png) | ![image/png](https://cdn-uploads.huggingface.co/production/uploads/644ef0eb1565b54e4a656946/A40RySWIPWAcwSyKGh-rm.png) |
+| Telegram | ![image/png](https://cdn-uploads.huggingface.co/production/uploads/644ef0eb1565b54e4a656946/ZqMEzunZbhwqOkBUpzv81.png) | ![image/png](https://cdn-uploads.huggingface.co/production/uploads/644ef0eb1565b54e4a656946/v0Y3MRgXUjRUX0prULu0v.png) |
+^ Note the tails, very interesting distributions. But more so, good to see [Benford's law](https://en.wikipedia.org/wiki/Benford's_law) is alive and well in these.
 ## Architecture
 The model is fine tuned RoBERTa
 paper: https://arxiv.org/abs/1907.11692
 ## Code
 TODO: Include Jupyter / code on Github