CIS5190FinalProj
/

RandomForest

Model card Files Files and versions Community

RandomForest / README.md

jbader14's picture

Update README.md

e6a2552 verified 7 months ago

|

history blame contribute delete

2.35 kB

	---
	# For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
	# Doc / guide: https://huggingface.co/docs/hub/model-cards
	{}
	---

	# Model Card for Model ID

	<!-- Provide a quick summary of what the model is/does. -->

	NOTE: This is NOT our final model. This is one of the secondary models that we explored in developing our final model. The final model is in the GBTrees Repository on HuggingFace.

	## Model Details
	This model classifies news headlines as either NBC or Fox News.

	### Model Description

	<!-- Provide a longer summary of what this model is. -->



	- Developed by: Jack Bader, Kaiyuan Wang, Pairan Xu
	- Taks: Binary classification (NBC News vs. Fox News)
	- Preprocessing: TF-IDF vectorization applied to the text data
	- stop_words = "english"
	- max_features = 1000
	- Model type: Random Forest
	- Freamwork: Scikit-learn
	-
	#### Metrics

	<!-- These are the evaluation metrics being used, ideally with a description of why. -->

	- Accuracy Score

	### Model Evaluation
	```python
	import pandas as pd
	import joblib
	from huggingface_hub import hf_hub_download
	from sklearn.feature_extraction.text import TfidfVectorizer
	from sklearn.metrics import classification_report

	# Mount to drive
	from google.colab import drive
	drive.mount('/content/drive')

	# Load test set
	test_df = pd.read_csv("/content/drive/MyDrive/test_data_random_subset.csv", encoding="Windows-1252")

	# Log in w/ huggingface token
	# Token can be found in repo as Token.docx
	!huggingface-cli login

	# Download the model
	model = hf_hub_download(repo_id = "CIS5190FinalProj/RandomForest", filename = "best_rf_model.pkl")

	# Download the vectorizer
	tfidf_vectorizer = hf_hub_download(repo_id = "CIS5190FinalProj/RandomForest", filename = "tfidf_vectorizer.pkl")

	# Load the model
	pipeline = joblib.load(model)

	# Load the vectorizer
	tfidf_vectorizer = joblib.load(tfidf_vectorizer)

	# Extract the headlines from the test set
	X_test = test_df['title']

	# Apply transformation to the headlines into numerical features
	X_test_transformed = tfidf_vectorizer.transform(X_test)

	# Make predictions using the pipeline
	y_pred = pipeline.predict(X_test_transformed)

	# Extract 'labels' as target
	y_test = test_df['label']

	# Print classification report
	print(classification_report(y_test, y_pred))