CIS5190abcd
/

svm

yitingliii commited on Dec 13, 2024

Commit

39757a1

verified ·

1 Parent(s): 7044552

Update README.md

Files changed (1) hide show

README.md CHANGED Viewed

@@ -1,36 +1,35 @@
 # SVM Model with TF-IDF
 Step by step instruction:
-1. install required packages:
-<br>Before running the code, install required packages.
 ```python
 pip install nltk beautifulsoup4 scikit-learn pandas
 ```
-<br> Download necessary packages.
 ```python
 import nltk
 nltk.download('stopwords')
 nltk.download('wordnet')
-from nltk.corpus import stopwords
-from nltk.stem import WordNetLemmatizer
-from bs4 import BeautifulSoup
-import re
-import pandas as pd
-from sklearn.svm import SVC
 ```
-2. Data Cleaning
-<br> The clean() function performs data preprocessing to prepare the input data for training. This includes:
-- Removing HTML tags using BeautifulSoup.
-- Removing non-alphanumeric characters and extra spaces.
-- Converting text to lowercase.
-- Removing stopwords using NLTK.
-- Lemmatizing words using WordNetLemmatizer.
 ```python
 from data_cleaning import clean
 # Load your data
 df = pd.read_csv('test_data_random_subset.csv')
@@ -40,13 +39,5 @@ cleaned_df = clean(df)
 ```
-3. run the SVM model
-```python
-svm_model = SVC(kernel='linear', random_state=42)
-svm_model.fit(X_train_tfidf, y_train)
-y_pred = svm_model.predict(X_test_tfidf)
-accuracy = accuracy_score(y_test, y_pred)
-print(f"Random Forest Accuracy: {accuracy:.4f}")
-print(classification_report(y_test, y_pred))
-```

 # SVM Model with TF-IDF
 Step by step instruction:
+## Installation
+<br>Before running the code, ensure you have all the required libraries installed:
 ```python
 pip install nltk beautifulsoup4 scikit-learn pandas
 ```
+<br> Download necessary NTLK resources for preprocessing.
 ```python
 import nltk
 nltk.download('stopwords')
 nltk.download('wordnet')
 ```
+# How to Use:
+1. Pre-Trained Model and Vectorizer
+<br> The repository includes:
+- model.pkl : The pre-trained SVM model
+- tfidf.pkl: The saved TF-IDF vectorizer used to transform the text data.
+2. Testing a new dataset
+<br> To test the model with the new dataset, follow these steps:
+- Step 1: Prepare the dataset:
+<br> Ensure the dataset is in CVS format and has three columns: title, outlet and labels. title column containing the text data to be classified.
+- Step 2: Preprocess the Data
+<br>Use the clean() function from data_cleaning.py to preprocess the text data:
 ```python
 from data_cleaning import clean
+import pandas as pd
 # Load your data
 df = pd.read_csv('test_data_random_subset.csv')
 ```
+- Step 3: Load the pre-trained model and TF-IDF Vectorizer