Update README.md
Browse files
README.md
CHANGED
@@ -25,17 +25,12 @@ results:
|
|
25 |
library_name: transformers
|
26 |
---
|
27 |
# Is Spam all we need? A RoBERTa Based Approach To Spam Detection
|
|
|
|
|
28 |
|
29 |
-
|
30 |
-
This is based on [this](https://huggingface.co/mshenoda/roberta-spam) huggingFace model, but instead of fine-tuning it on all the data sources that the original author had, I only finetuned using the [telegram](https://huggingface.co/datasets/thehamkercat/telegram-spam-ham) and [enron](https://huggingface.co/datasets/SetFit/enron_spam) datasets.
|
31 |
|
32 |
-
|
33 |
-
|
34 |
-
This was done for an interview project, so if you find this by chance... hopefully it helps you too, but know there's **definitely** better resources out there... and that this was done in the span of one evening.
|
35 |
-
|
36 |
-
This was fine-tuned by replicating the [sentiment analysis](https://huggingface.co/docs/transformers/main/en/model_doc/roberta#resources) Google collab example.
|
37 |
-
|
38 |
-
### Metrics
|
39 |
**Accuracy**: 0.9503
|
40 |
Thrilling, I know, I also just got the chills.
|
41 |
|
@@ -54,23 +49,25 @@ The dataset used for testing was the original kaggle competition (as part of the
|
|
54 |
|
55 |
1. SMS Spam Collection https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
|
56 |
|
57 |
-
|
58 |
|
59 |
| | Total | Training | Testing |
|
60 |
|:--------:|:-----:|:--------------:|:-----------:|
|
61 |
| Counts | 59267 | 53693 (90.6% ) | 5574 (9.4%) |
|
62 |
|
63 |
-
|
64 |
-
|
65 |
-
|
66 |
| | Total | Spam | Ham | Set | % Total |
|
67 |
|:--------:|:-----:|:-------------:|:-------------:|:-----:|:-------:|
|
68 |
| Enron | 33345 | 16852 (50.5%) | 16493 (49.5%) | Train | 56.2% |
|
69 |
| Telegram | 20348 | 6011 (29.5%) | 14337 (70.5%) | Train | 43.8% |
|
70 |
| SMS | 5574 | 747 (13.5%) | 4827 (86.5%) | Test | 100% |
|
71 |
|
|
|
|
|
|
|
|
|
|
|
72 |
|
73 |
-
|
74 |
|
75 |
## Architecture
|
76 |
The model is fine tuned RoBERTa
|
@@ -80,5 +77,4 @@ roberta-base: https://huggingface.co/roberta-base
|
|
80 |
paper: https://arxiv.org/abs/1907.11692
|
81 |
|
82 |
## Code
|
83 |
-
|
84 |
TODO: Include Jupyter / code on Github
|
|
|
25 |
library_name: transformers
|
26 |
---
|
27 |
# Is Spam all we need? A RoBERTa Based Approach To Spam Detection
|
28 |
+
## Intro
|
29 |
+
This is based on [this](https://huggingface.co/mshenoda/roberta-spam) huggingFace model, but instead of fine-tuning it on all the data sources that the original author had, I only finetuned using the [telegram](https://huggingface.co/datasets/thehamkercat/telegram-spam-ham) and [enron](https://huggingface.co/datasets/SetFit/enron_spam) datasets. The idea behind this was a more diversified data source, preventing overfitting to the original distribution. This was fine-tuned by replicating the [sentiment analysis](https://huggingface.co/docs/transformers/main/en/model_doc/roberta#resources) Google collab example (see Code section for my version of it).
|
30 |
|
31 |
+
**NOTE**: This was done for an interview project, so if you find this by chance... hopefully it helps you too, but know there's **definitely** better resources out there... and that this was done in the span of one evening.
|
|
|
32 |
|
33 |
+
## Metrics
|
|
|
|
|
|
|
|
|
|
|
|
|
34 |
**Accuracy**: 0.9503
|
35 |
Thrilling, I know, I also just got the chills.
|
36 |
|
|
|
49 |
|
50 |
1. SMS Spam Collection https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
|
51 |
|
52 |
+
## Dataset Class Distribution
|
53 |
|
54 |
| | Total | Training | Testing |
|
55 |
|:--------:|:-----:|:--------------:|:-----------:|
|
56 |
| Counts | 59267 | 53693 (90.6% ) | 5574 (9.4%) |
|
57 |
|
|
|
|
|
|
|
58 |
| | Total | Spam | Ham | Set | % Total |
|
59 |
|:--------:|:-----:|:-------------:|:-------------:|:-----:|:-------:|
|
60 |
| Enron | 33345 | 16852 (50.5%) | 16493 (49.5%) | Train | 56.2% |
|
61 |
| Telegram | 20348 | 6011 (29.5%) | 14337 (70.5%) | Train | 43.8% |
|
62 |
| SMS | 5574 | 747 (13.5%) | 4827 (86.5%) | Test | 100% |
|
63 |
|
64 |
+
| | Distribution of number of characters per class label (100 bins) | Distribution of number of words per class label (100 bins) |
|
65 |
+
|:--------:|:---------------------------------------------------------------:|:----------------------------------------------------------:|
|
66 |
+
| SMS | ![image/png ](https://cdn-uploads.huggingface.co/production/uploads/644ef0eb1565b54e4a656946/OjLvujmQyeQPlowW5lI5A.png) | ![image/png](https://cdn-uploads.huggingface.co/production/uploads/644ef0eb1565b54e4a656946/RFs92xoeIUDAsry6T1Ec4.png) |
|
67 |
+
| Enron (limiting a few outliers) | ![image/png](https://cdn-uploads.huggingface.co/production/uploads/644ef0eb1565b54e4a656946/Gd7le3W2U05DaQtjb971o.png) | ![image/png](https://cdn-uploads.huggingface.co/production/uploads/644ef0eb1565b54e4a656946/A40RySWIPWAcwSyKGh-rm.png) |
|
68 |
+
| Telegram | ![image/png](https://cdn-uploads.huggingface.co/production/uploads/644ef0eb1565b54e4a656946/ZqMEzunZbhwqOkBUpzv81.png) | ![image/png](https://cdn-uploads.huggingface.co/production/uploads/644ef0eb1565b54e4a656946/v0Y3MRgXUjRUX0prULu0v.png) |
|
69 |
|
70 |
+
^ Note the tails, very interesting distributions. But more so, good to see [Benford's law](https://en.wikipedia.org/wiki/Benford's_law) is alive and well in these.
|
71 |
|
72 |
## Architecture
|
73 |
The model is fine tuned RoBERTa
|
|
|
77 |
paper: https://arxiv.org/abs/1907.11692
|
78 |
|
79 |
## Code
|
|
|
80 |
TODO: Include Jupyter / code on Github
|