Text Classification
Transformers
English
Inference Endpoints
ggrizzly commited on
Commit
f92b969
·
verified ·
1 Parent(s): d10bbc5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -15
README.md CHANGED
@@ -25,17 +25,12 @@ results:
25
  library_name: transformers
26
  ---
27
  # Is Spam all we need? A RoBERTa Based Approach To Spam Detection
 
 
28
 
29
- ### Intro
30
- This is based on [this](https://huggingface.co/mshenoda/roberta-spam) huggingFace model, but instead of fine-tuning it on all the data sources that the original author had, I only finetuned using the [telegram](https://huggingface.co/datasets/thehamkercat/telegram-spam-ham) and [enron](https://huggingface.co/datasets/SetFit/enron_spam) datasets.
31
 
32
- The idea behind this was a more diversified data source, preventing overfitting to the original distribution.
33
-
34
- This was done for an interview project, so if you find this by chance... hopefully it helps you too, but know there's **definitely** better resources out there... and that this was done in the span of one evening.
35
-
36
- This was fine-tuned by replicating the [sentiment analysis](https://huggingface.co/docs/transformers/main/en/model_doc/roberta#resources) Google collab example.
37
-
38
- ### Metrics
39
  **Accuracy**: 0.9503
40
  Thrilling, I know, I also just got the chills.
41
 
@@ -54,23 +49,25 @@ The dataset used for testing was the original kaggle competition (as part of the
54
 
55
  1. SMS Spam Collection https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
56
 
57
- ### Dataset Class Distribution
58
 
59
  | | Total | Training | Testing |
60
  |:--------:|:-----:|:--------------:|:-----------:|
61
  | Counts | 59267 | 53693 (90.6% ) | 5574 (9.4%) |
62
 
63
-
64
-
65
-
66
  | | Total | Spam | Ham | Set | % Total |
67
  |:--------:|:-----:|:-------------:|:-------------:|:-----:|:-------:|
68
  | Enron | 33345 | 16852 (50.5%) | 16493 (49.5%) | Train | 56.2% |
69
  | Telegram | 20348 | 6011 (29.5%) | 14337 (70.5%) | Train | 43.8% |
70
  | SMS | 5574 | 747 (13.5%) | 4827 (86.5%) | Test | 100% |
71
 
 
 
 
 
 
72
 
73
-
74
 
75
  ## Architecture
76
  The model is fine tuned RoBERTa
@@ -80,5 +77,4 @@ roberta-base: https://huggingface.co/roberta-base
80
  paper: https://arxiv.org/abs/1907.11692
81
 
82
  ## Code
83
-
84
  TODO: Include Jupyter / code on Github
 
25
  library_name: transformers
26
  ---
27
  # Is Spam all we need? A RoBERTa Based Approach To Spam Detection
28
+ ## Intro
29
+ This is based on [this](https://huggingface.co/mshenoda/roberta-spam) huggingFace model, but instead of fine-tuning it on all the data sources that the original author had, I only finetuned using the [telegram](https://huggingface.co/datasets/thehamkercat/telegram-spam-ham) and [enron](https://huggingface.co/datasets/SetFit/enron_spam) datasets. The idea behind this was a more diversified data source, preventing overfitting to the original distribution. This was fine-tuned by replicating the [sentiment analysis](https://huggingface.co/docs/transformers/main/en/model_doc/roberta#resources) Google collab example (see Code section for my version of it).
30
 
31
+ **NOTE**: This was done for an interview project, so if you find this by chance... hopefully it helps you too, but know there's **definitely** better resources out there... and that this was done in the span of one evening.
 
32
 
33
+ ## Metrics
 
 
 
 
 
 
34
  **Accuracy**: 0.9503
35
  Thrilling, I know, I also just got the chills.
36
 
 
49
 
50
  1. SMS Spam Collection https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
51
 
52
+ ## Dataset Class Distribution
53
 
54
  | | Total | Training | Testing |
55
  |:--------:|:-----:|:--------------:|:-----------:|
56
  | Counts | 59267 | 53693 (90.6% ) | 5574 (9.4%) |
57
 
 
 
 
58
  | | Total | Spam | Ham | Set | % Total |
59
  |:--------:|:-----:|:-------------:|:-------------:|:-----:|:-------:|
60
  | Enron | 33345 | 16852 (50.5%) | 16493 (49.5%) | Train | 56.2% |
61
  | Telegram | 20348 | 6011 (29.5%) | 14337 (70.5%) | Train | 43.8% |
62
  | SMS | 5574 | 747 (13.5%) | 4827 (86.5%) | Test | 100% |
63
 
64
+ | | Distribution of number of characters per class label (100 bins) | Distribution of number of words per class label (100 bins) |
65
+ |:--------:|:---------------------------------------------------------------:|:----------------------------------------------------------:|
66
+ | SMS | ![image/png ](https://cdn-uploads.huggingface.co/production/uploads/644ef0eb1565b54e4a656946/OjLvujmQyeQPlowW5lI5A.png) | ![image/png](https://cdn-uploads.huggingface.co/production/uploads/644ef0eb1565b54e4a656946/RFs92xoeIUDAsry6T1Ec4.png) |
67
+ | Enron (limiting a few outliers) | ![image/png](https://cdn-uploads.huggingface.co/production/uploads/644ef0eb1565b54e4a656946/Gd7le3W2U05DaQtjb971o.png) | ![image/png](https://cdn-uploads.huggingface.co/production/uploads/644ef0eb1565b54e4a656946/A40RySWIPWAcwSyKGh-rm.png) |
68
+ | Telegram | ![image/png](https://cdn-uploads.huggingface.co/production/uploads/644ef0eb1565b54e4a656946/ZqMEzunZbhwqOkBUpzv81.png) | ![image/png](https://cdn-uploads.huggingface.co/production/uploads/644ef0eb1565b54e4a656946/v0Y3MRgXUjRUX0prULu0v.png) |
69
 
70
+ ^ Note the tails, very interesting distributions. But more so, good to see [Benford's law](https://en.wikipedia.org/wiki/Benford's_law) is alive and well in these.
71
 
72
  ## Architecture
73
  The model is fine tuned RoBERTa
 
77
  paper: https://arxiv.org/abs/1907.11692
78
 
79
  ## Code
 
80
  TODO: Include Jupyter / code on Github