# SVM Model with TF-IDF Step by step instruction: 1. install required packages:
Before running the code, install required packages. ```python pip install nltk beautifulsoup4 scikit-learn pandas ```
Download necessary packages. ```python import nltk nltk.download('stopwords') nltk.download('wordnet') from nltk.corpus import stopwords from nltk.stem import WordNetLemmatizer from bs4 import BeautifulSoup import re import pandas as pd from sklearn.svm import SVC ``` 2. File Description - config.json: Configuration file for model and dataset parameters. - ml.py: Python script containing the machine learning pipeline. - model.pkl: Trained SVM model saved as a pickle file. - tfidf.pkl: TF-IDF vectorizer saved as a pickle file. - README.md: Documentation for the repository. 3. Data Cleaning
The clean() function performs data preprocessing to prepare the input data for training. This includes: - Removing HTML tags using BeautifulSoup. - Removing non-alphanumeric characters and extra spaces. - Converting text to lowercase. - Removing stopwords using NLTK. - Lemmatizing words using WordNetLemmatizer. ```python def clean(df): stop_words = set(stopwords.words('english')) lemmatizer = WordNetLemmatizer() cleaned_headlines = [] for headline in df['title']: headline = BeautifulSoup(headline, 'html.parser').get_text() headline = re.sub(r'[^a-zA-Z0-9\s]', '', headline) headline = re.sub(r'\s+', ' ', headline).strip() headline = headline.lower() words = headline.split() words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words] cleaned_headline = ' '.join(words) cleaned_headlines.append(cleaned_headline) df['title'] = cleaned_headlines df.drop_duplicates(subset=['title'], inplace=True) return df ``` 3. run the SVM model ```python svm_model = SVC(kernel='linear', random_state=42) svm_model.fit(X_train_tfidf, y_train) y_pred = svm_model.predict(X_test_tfidf) accuracy = accuracy_score(y_test, y_pred) print(f"Random Forest Accuracy: {accuracy:.4f}") print(classification_report(y_test, y_pred)) ```