yitingliii commited on
Commit
39757a1
·
verified ·
1 Parent(s): 7044552

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +16 -25
README.md CHANGED
@@ -1,36 +1,35 @@
1
  # SVM Model with TF-IDF
2
  Step by step instruction:
3
- 1. install required packages:
4
- <br>Before running the code, install required packages.
5
 
6
  ```python
7
  pip install nltk beautifulsoup4 scikit-learn pandas
8
  ```
9
- <br> Download necessary packages.
10
  ```python
11
  import nltk
12
  nltk.download('stopwords')
13
  nltk.download('wordnet')
14
 
15
- from nltk.corpus import stopwords
16
- from nltk.stem import WordNetLemmatizer
17
- from bs4 import BeautifulSoup
18
- import re
19
- import pandas as pd
20
- from sklearn.svm import SVC
21
  ```
 
 
 
 
 
22
 
23
- 2. Data Cleaning
24
- <br> The clean() function performs data preprocessing to prepare the input data for training. This includes:
25
- - Removing HTML tags using BeautifulSoup.
26
- - Removing non-alphanumeric characters and extra spaces.
27
- - Converting text to lowercase.
28
- - Removing stopwords using NLTK.
29
- - Lemmatizing words using WordNetLemmatizer.
30
 
 
 
31
 
32
  ```python
33
  from data_cleaning import clean
 
34
 
35
  # Load your data
36
  df = pd.read_csv('test_data_random_subset.csv')
@@ -40,13 +39,5 @@ cleaned_df = clean(df)
40
 
41
  ```
42
 
43
- 3. run the SVM model
44
- ```python
45
- svm_model = SVC(kernel='linear', random_state=42)
46
- svm_model.fit(X_train_tfidf, y_train)
47
- y_pred = svm_model.predict(X_test_tfidf)
48
- accuracy = accuracy_score(y_test, y_pred)
49
- print(f"Random Forest Accuracy: {accuracy:.4f}")
50
- print(classification_report(y_test, y_pred))
51
- ```
52
 
 
1
  # SVM Model with TF-IDF
2
  Step by step instruction:
3
+ ## Installation
4
+ <br>Before running the code, ensure you have all the required libraries installed:
5
 
6
  ```python
7
  pip install nltk beautifulsoup4 scikit-learn pandas
8
  ```
9
+ <br> Download necessary NTLK resources for preprocessing.
10
  ```python
11
  import nltk
12
  nltk.download('stopwords')
13
  nltk.download('wordnet')
14
 
 
 
 
 
 
 
15
  ```
16
+ # How to Use:
17
+ 1. Pre-Trained Model and Vectorizer
18
+ <br> The repository includes:
19
+ - model.pkl : The pre-trained SVM model
20
+ - tfidf.pkl: The saved TF-IDF vectorizer used to transform the text data.
21
 
22
+ 2. Testing a new dataset
23
+ <br> To test the model with the new dataset, follow these steps:
24
+ - Step 1: Prepare the dataset:
25
+ <br> Ensure the dataset is in CVS format and has three columns: title, outlet and labels. title column containing the text data to be classified.
 
 
 
26
 
27
+ - Step 2: Preprocess the Data
28
+ <br>Use the clean() function from data_cleaning.py to preprocess the text data:
29
 
30
  ```python
31
  from data_cleaning import clean
32
+ import pandas as pd
33
 
34
  # Load your data
35
  df = pd.read_csv('test_data_random_subset.csv')
 
39
 
40
  ```
41
 
42
+ - Step 3: Load the pre-trained model and TF-IDF Vectorizer
 
 
 
 
 
 
 
 
43