File size: 2,670 Bytes
5ee59c7
9d8f216
39757a1
 
1d6d48d
6e11957
 
 
39757a1
1d6d48d
 
 
 
 
 
39757a1
9d8f216
 
 
 
 
 
 
1d6d48d
 
7044552
39757a1
1d6d48d
b98218f
7044552
1d6d48d
b98218f
 
1d6d48d
 
 
9d8f216
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25b4af4
 
 
 
 
c0804ea
25b4af4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
66e6819
 
 
25b4af4
 
1d6d48d
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
# SVM Model with TF-IDF
This repository provides a pre-trained Support Vector Machine (SVM) model for text classification using Term Frequency-Inverse Document Frequency (TF-IDF). The repository also includes utilities for data preprocessing and feature extraction.: 
## Installation
<br>Before running the code, ensure you have all the required libraries installed:

```python
pip install nltk beautifulsoup4 scikit-learn pandas
```
<br> Download necessary NTLK resources for preprocessing. 
```python
import nltk
nltk.download('stopwords')
nltk.download('wordnet')

```
# How to Use:
1. Data Cleaning
<br> The data_cleaning.py file contains a clean() function to preprocess the input dataset:
- Removes HTML tags.
- Removes non-alphanumeric characters and extra spaces.
- Converts text to lowercase.
- Removes stopwords.
- Lemmatizes words.

```python
from data_cleaning import clean
import pandas as pd

# Load your data
df = pd.read_csv('test_data_random_subset.csv')

# Clean the data
cleaned_df = clean(df)

```

2. TF-IDF Feature Extraction
<br> The tfidf.py file contains the TF-IDF vectorization logic. It converts cleaned text data into numerical features suitable for training and testing the SVM model.
```python
from tfidf import tfidf

# Apply TF-IDF vectorization
X_train_tfidf = tfidf.fit_transform(X_train['title'])
X_test_tfidf = tfidf.transform(X_test['title'])
```
3. Training and Testing the SVM Model
<br> The svm.py file contains the logic for training and testing the SVM model. It uses the TF-IDF-transformed features to classify text data.
```python
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report

# Train the SVM model
svm_model = SVC(kernel='linear', random_state=42)
svm_model.fit(X_train_tfidf, y_train)

# Predict and evaluate
y_pred = svm_model.predict(X_test_tfidf)
accuracy = accuracy_score(y_test, y_pred)
print(f"SVM Accuracy: {accuracy:.4f}")
print(classification_report(y_test, y_pred))
```

4. Training a new dataset with pre-trained model
<br>To test a new dataset, follow the steps below:

- Clean the Dataset
```python
from data_cleaning import clean
import pandas as pd

# Load your dataset
df = pd.read_csv('test_data_random_subset.csv')

# Clean the data
cleaned_df = clean(df)

```

- Extract TF-IDF Features
```python
from tfidf import tfidf

# Transform the cleaned dataset
X_new_tfidf = tfidf.transform(cleaned_df['title'])

```

- Make Predictions
```python
from svm import svm_model

# Make predictions
predictions = svm_model.predict(X_new_tfidf)

# Calculate accuracy score.
accuracy = accuracy_score(y_new, predictions)
print(f"Accuracy Score: {accuracy:.4f}")

```