# SVM Model with TF-IDF
Step by step instruction: 
1. install required packages:
<br>Before running the code, install required packages.

```python
pip install nltk beautifulsoup4 scikit-learn pandas
```
<br> Download necessary packages. 
```python
import nltk
nltk.download('stopwords')
nltk.download('wordnet')

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from bs4 import BeautifulSoup
import re
import pandas as pd
from sklearn.svm import SVC
```

2. File Description
- config.json: Configuration file for model and dataset parameters.
- ml.py: Python script containing the machine learning pipeline.
- model.pkl: Trained SVM model saved as a pickle file.
- tfidf.pkl: TF-IDF vectorizer saved as a pickle file.
- README.md: Documentation for the repository.

3. Data Cleaning
<br> The clean() function performs data preprocessing to prepare the input data for training. This includes:
- Removing HTML tags using BeautifulSoup.
- Removing non-alphanumeric characters and extra spaces.
- Converting text to lowercase.
- Removing stopwords using NLTK.
- Lemmatizing words using WordNetLemmatizer.


```python
def clean(df):
    stop_words = set(stopwords.words('english'))
    lemmatizer = WordNetLemmatizer()
    cleaned_headlines = []

    for headline in df['title']:
        headline = BeautifulSoup(headline, 'html.parser').get_text()
        headline = re.sub(r'[^a-zA-Z0-9\s]', '', headline)
        headline = re.sub(r'\s+', ' ', headline).strip()
        headline = headline.lower()

        words = headline.split()
        words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]

        cleaned_headline = ' '.join(words)
        cleaned_headlines.append(cleaned_headline)

    df['title'] = cleaned_headlines
    df.drop_duplicates(subset=['title'], inplace=True)

    return df
```

3. run the SVM model
```python
svm_model = SVC(kernel='linear', random_state=42)
svm_model.fit(X_train_tfidf, y_train)
y_pred = svm_model.predict(X_test_tfidf)
accuracy = accuracy_score(y_test, y_pred)
print(f"Random Forest Accuracy: {accuracy:.4f}")
print(classification_report(y_test, y_pred))
```