File size: 2,376 Bytes
5ee59c7
ada45ca
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28bd825
ada45ca
 
 
 
39757a1
 
1d6d48d
6e11957
eda3c87
6e11957
6b2dc18
1d6d48d
ada45ca
a2b22c1
 
 
48e4179
d4e4e9a
 
ada45ca
9318d1d
1d6d48d
7abb94b
 
9d8f216
ada45ca
 
25b4af4
 
 
 
c0804ea
ada45ca
 
 
 
48e4179
b544376
ada45ca
25b4af4
b544376
25b4af4
 
 
 
 
 
 
 
b544376
25b4af4
 
 
 
 
 
b544376
25b4af4
 
 
ed160f2
 
 
1d6d48d
ada45ca
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
# SVM Model with TF-IDF
This repository provides a pre-trained Support Vector Machine (SVM) model for text classification using Term Frequency-Inverse Document Frequency (TF-IDF). The repository also includes utilities for data preprocessing and feature extraction:

## Start:
<br>Open your terminal. 
<br> Clone the repo by using the following command:
```
git clone https://huggingface.co/CIS5190abcd/svm
```
<br> Go to the svm directory using following command:
```
cd svm
```
<br> Run ```ls``` to check the files inside svm folder. Make sure ```tfidf.py```, ```svm.py``` and ```data_cleaning.py``` are existing in this directory. If not, run the folloing commands:
```
git checkout origin/main -- tfidf.py
git checkout origin/main -- svm.py
git checkout origin/main -- data_cleaning.py
```
<br> Rerun ```ls```, double check all the required files(```tfidf.py```, ```svm.py``` and ```data_cleaning.py```) are existing. Should look like this:

![image/png](https://cdn-uploads.huggingface.co/production/uploads/6755cffd784ff7ea9db10bd4/O9K5zYm7TKiIg9cYZpV1x.png)
<br> keep inside the svm directory until ends. 

## Installation
<br>Before running the code, ensure you have all the required libraries installed:

```python
pip install nltk beautifulsoup4 scikit-learn pandas datasets fsspec huggingface_hub
```
<br> Go to Python which can be opened directly in terminal by typing the following command:
```
python
```
<br> Download necessary NTLK resources for preprocessing. 

```python
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
```
<br> After downloading all the required packages, **do not** exit. 


## How to use:
Training a new dataset with existing SVM model, follow the steps below:

- Clean the Dataset
```python
from data_cleaning import clean
import pandas as pd
import nltk
nltk.download('stopwords')
```
<br> You can replace with any datasets you want by changing the file name inside ```pd.read_csv()```.
```python

df = pd.read_csv("hf://datasets/CIS5190abcd/headlines_test/test_cleaned_headlines.csv")


cleaned_df = clean(df)

```

- Extract TF-IDF Features
```python
from tfidf import tfidf


X_new_tfidf = tfidf.transform(cleaned_df['title'])

```

- Make Predictions
```python

from svm import svm_model

```
```exit()``` if you want to leave python.

```cd ..```if you want to exit svm directory.