Update README.md
Browse files
README.md
CHANGED
@@ -6,9 +6,10 @@ Spam messages frequently carry malicious links or phishing attempts posing signi
|
|
6 |
|
7 |
## Dataset
|
8 |
The dataset is composed of messages labeled by ham or spam, merged from three data sources:
|
9 |
-
|
10 |
-
|
11 |
-
|
|
|
12 |
|
13 |
The prepare script for enron is available at https://github.com/mshenoda/roberta-spam/tree/main/data/enron.
|
14 |
The data is split 80% train 10% validation, and 10% test sets; the scripts used to split and merge of the three data sources are available at: https://github.com/mshenoda/roberta-spam/tree/main/data/utils.
|
|
|
6 |
|
7 |
## Dataset
|
8 |
The dataset is composed of messages labeled by ham or spam, merged from three data sources:
|
9 |
+
|
10 |
+
1. SMS Spam Collection https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
|
11 |
+
2. Telegram Spam Ham https://huggingface.co/datasets/thehamkercat/telegram-spam-ham/tree/main
|
12 |
+
3. Enron Spam: https://huggingface.co/datasets/SetFit/enron_spam/tree/main (only used message column and labels)
|
13 |
|
14 |
The prepare script for enron is available at https://github.com/mshenoda/roberta-spam/tree/main/data/enron.
|
15 |
The data is split 80% train 10% validation, and 10% test sets; the scripts used to split and merge of the three data sources are available at: https://github.com/mshenoda/roberta-spam/tree/main/data/utils.
|