# SVM Model with TF-IDF
Step by step instruction:
1. install required packages:
Before running the code, install required packages.
pip install nltk beautifulsoup4 scikit-learn pandas
Download necessary packages.
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from bs4 import BeautifulSoup
import re
import pandas as pd
from sklearn.svm import SVC
2. File Description
- config.json: Configuration file for model and dataset parameters.
- ml.py: Python script containing the machine learning pipeline.
- model.pkl: Trained SVM model saved as a pickle file.
- tfidf.pkl: TF-IDF vectorizer saved as a pickle file.
- README.md: Documentation for the repository.
3. Data Cleaning
The clean() function performs data preprocessing to prepare the input data for training. This includes:
- Removing HTML tags using BeautifulSoup.
- Removing non-alphanumeric characters and extra spaces.
- Converting text to lowercase.
- Removing stopwords using NLTK.
- Lemmatizing words using WordNetLemmatizer.
def clean(df):
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
cleaned_headlines = []
for headline in df['title']:
headline = BeautifulSoup(headline, 'html.parser').get_text()
headline = re.sub(r'[^a-zA-Z0-9\s]', '', headline)
headline = re.sub(r'\s+', ' ', headline).strip()
headline = headline.lower()
words = headline.split()
words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
cleaned_headline = ' '.join(words)
df['title'] = cleaned_headlines
df.drop_duplicates(subset=['title'], inplace=True)
return df
3. run the SVM model
svm_model = SVC(kernel='linear', random_state=42)
svm_model.fit(X_train_tfidf, y_train)
y_pred = svm_model.predict(X_test_tfidf)
accuracy = accuracy_score(y_test, y_pred)
print(f"Random Forest Accuracy: {accuracy:.4f}")
print(classification_report(y_test, y_pred))