|
--- |
|
|
|
|
|
{} |
|
--- |
|
|
|
# Model Card for Model ID |
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
NOTE: This is NOT our final model. This is one of the secondary models that we explored in developing our final model. The final model is in the GBTrees Repository on HuggingFace. |
|
|
|
## Model Details |
|
This model classifies news headlines as either NBC or Fox News. |
|
|
|
### Model Description |
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
|
|
|
|
|
|
- **Developed by:** Jack Bader, Kaiyuan Wang, Pairan Xu |
|
- **Taks:** Binary classification (NBC News vs. Fox News) |
|
- **Preprocessing:** TF-IDF vectorization applied to the text data |
|
- stop_words = "english" |
|
- max_features = 1000 |
|
- **Model type:** Random Forest |
|
- **Freamwork:** Scikit-learn |
|
- |
|
#### Metrics |
|
|
|
<!-- These are the evaluation metrics being used, ideally with a description of why. --> |
|
|
|
- Accuracy Score |
|
|
|
### Model Evaluation |
|
```python |
|
import pandas as pd |
|
import joblib |
|
from huggingface_hub import hf_hub_download |
|
from sklearn.feature_extraction.text import TfidfVectorizer |
|
from sklearn.metrics import classification_report |
|
|
|
# Mount to drive |
|
from google.colab import drive |
|
drive.mount('/content/drive') |
|
|
|
# Load test set |
|
test_df = pd.read_csv("/content/drive/MyDrive/test_data_random_subset.csv", encoding="Windows-1252") |
|
|
|
# Log in w/ huggingface token |
|
# Token can be found in repo as Token.docx |
|
!huggingface-cli login |
|
|
|
# Download the model |
|
model = hf_hub_download(repo_id = "CIS5190FinalProj/RandomForest", filename = "best_rf_model.pkl") |
|
|
|
# Download the vectorizer |
|
tfidf_vectorizer = hf_hub_download(repo_id = "CIS5190FinalProj/RandomForest", filename = "tfidf_vectorizer.pkl") |
|
|
|
# Load the model |
|
pipeline = joblib.load(model) |
|
|
|
# Load the vectorizer |
|
tfidf_vectorizer = joblib.load(tfidf_vectorizer) |
|
|
|
# Extract the headlines from the test set |
|
X_test = test_df['title'] |
|
|
|
# Apply transformation to the headlines into numerical features |
|
X_test_transformed = tfidf_vectorizer.transform(X_test) |
|
|
|
# Make predictions using the pipeline |
|
y_pred = pipeline.predict(X_test_transformed) |
|
|
|
# Extract 'labels' as target |
|
y_test = test_df['label'] |
|
|
|
# Print classification report |
|
print(classification_report(y_test, y_pred)) |