SVM Model with TF-IDF
This repository provides a pre-trained Support Vector Machine (SVM) model for text classification using Term Frequency-Inverse Document Frequency (TF-IDF). The repository also includes utilities for data preprocessing and feature extraction:
There are two ways to test our model:
1.Colab (can see the file for how the Colab looks like)
Start
Download all the files.
Copy all the codes below into Colab
Before running the code, ensure you have all the required libraries installed:
pip install nltk beautifulsoup4 scikit-learn pandas datasets fsspec huggingface_hub
Download necessary NTLK resources for preprocessing.
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
Clean the Dataset
from data_cleaning import clean
import pandas as pd
import nltk
nltk.download('stopwords')
You can replace with any datasets you want by changing the file name inside pd.read_csv()
.
df = pd.read_csv("hf://datasets/CIS5190abcd/headlines_test/test_cleaned_headlines.csv")
cleaned_df = clean(df)
- Extract TF-IDF Features
from tfidf import tfidf
X_new_tfidf = tfidf.transform(cleaned_df['title'])
- Make Predictions
from svm import svm_model
2. Termial
Start:
Open your terminal.
Clone the repo by using the following command:
git clone https://huggingface.co/CIS5190abcd/svm
Go to the svm directory using following command:
cd svm
Run ls
to check the files inside svm folder. Make sure tfidf.py
, svm.py
and data_cleaning.py
are existing in this directory. If not, run the folloing commands:
git checkout origin/main -- tfidf.py
git checkout origin/main -- svm.py
git checkout origin/main -- data_cleaning.py
Rerun ls
, double check all the required files(tfidf.py
, svm.py
and data_cleaning.py
) are existing. Should look like this:
keep inside the svm directory until ends.
Installation
Before running the code, ensure you have all the required libraries installed:
pip install nltk beautifulsoup4 scikit-learn pandas datasets fsspec huggingface_hub
Go to Python which can be opened directly in terminal by typing the following command:
python
Download necessary NTLK resources for preprocessing.
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
After downloading all the required packages, do not exit.
How to use:
Training a new dataset with existing SVM model, follow the steps below:
- Clean the Dataset
from data_cleaning import clean
import pandas as pd
import nltk
nltk.download('stopwords')
You can replace with any datasets you want by changing the file name inside pd.read_csv()
.
df = pd.read_csv("hf://datasets/CIS5190abcd/headlines_test/test_cleaned_headlines.csv")
cleaned_df = clean(df)
- Extract TF-IDF Features
from tfidf import tfidf
X_new_tfidf = tfidf.transform(cleaned_df['title'])
- Make Predictions
from svm import svm_model
exit()
if you want to leave python.
cd ..
if you want to exit svm directory.