File size: 2,337 Bytes
5ee59c7
ada45ca
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28bd825
ada45ca
 
 
 
39757a1
 
1d6d48d
6e11957
23ff33d
6e11957
a2b22c1
1d6d48d
ada45ca
a2b22c1
 
 
48e4179
d4e4e9a
 
ada45ca
1d6d48d
7abb94b
 
9d8f216
ada45ca
 
25b4af4
 
 
 
c0804ea
ada45ca
 
 
 
48e4179
ada45ca
 
25b4af4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a6e25cf
25b4af4
 
 
ed160f2
 
 
1d6d48d
ada45ca
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
# SVM Model with TF-IDF
This repository provides a pre-trained Support Vector Machine (SVM) model for text classification using Term Frequency-Inverse Document Frequency (TF-IDF). The repository also includes utilities for data preprocessing and feature extraction:

## Start:
<br>Open your terminal. 
<br> Clone the repo by using the following command:
```
git clone https://huggingface.co/CIS5190abcd/svm
```
<br> Go to the svm directory using following command:
```
cd svm
```
<br> Run ```ls``` to check the files inside svm folder. Make sure ```tfidf.py```, ```svm.py``` and ```data_cleaning.py``` are existing in this directory. If not, run the folloing commands:
```
git checkout origin/main -- tfidf.py
git checkout origin/main -- svm.py
git checkout origin/main -- data_cleaning.py
```
<br> Rerun ```ls```, double check all the required files(```tfidf.py```, ```svm.py``` and ```data_cleaning.py```) are existing. Should look like this:

![image/png](https://cdn-uploads.huggingface.co/production/uploads/6755cffd784ff7ea9db10bd4/O9K5zYm7TKiIg9cYZpV1x.png)
<br> keep inside the svm directory until ends. 

## Installation
<br>Before running the code, ensure you have all the required libraries installed:

```python
pip install nltk beautifulsoup4 scikit-learn pandas datasets
```
<br> Go to Python.
```
python
```
<br> Download necessary NTLK resources for preprocessing. 

```python
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
```
<br> After downloading all the required packages, **do not** exit. 


## How to use:
Training a new dataset with existing SVM model, follow the steps below:

- Clean the Dataset
```python
from data_cleaning import clean
import pandas as pd
import nltk
nltk.download('stopwords')
```
<br> You can replace with any datasets you want by changing the file name inside ```pd.read_csv()```.
```python
# Load your data
df = pd.read_csv("hf://datasets/CIS5190abcd/headlines_test/test_cleaned_headlines.csv")

# Clean the data
cleaned_df = clean(df)

```

- Extract TF-IDF Features
```python
from tfidf import tfidf

# Transform the cleaned dataset
X_new_tfidf = tfidf.transform(cleaned_df['title'])

```

- Make Predictions
```python
# Make predictions.
from svm import svm_model

```
```exit()``` if you want to leave python.

```cd ..```if you want to exit svm directory.