Spaces:
Sleeping
Sleeping
File size: 43,365 Bytes
022acf4 |
|
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"# Imports\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"import ipywidgets as widgets\n",
"from wordcloud import WordCloud, STOPWORDS"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Title</th>\n",
" <th>Publisher</th>\n",
" <th>DateTime</th>\n",
" <th>Link</th>\n",
" <th>Category</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Chainlink (LINK) Falters, Hedera (HBAR) Wobble...</td>\n",
" <td>Analytics Insight</td>\n",
" <td>2023-08-30T06:54:49Z</td>\n",
" <td>https://news.google.com/articles/CBMibGh0dHBzO...</td>\n",
" <td>Business</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Funds punished for owning too few Nvidia share...</td>\n",
" <td>ZAWYA</td>\n",
" <td>2023-08-30T07:15:59Z</td>\n",
" <td>https://news.google.com/articles/CBMigwFodHRwc...</td>\n",
" <td>Business</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Crude oil prices stalled as hedge funds sold: ...</td>\n",
" <td>ZAWYA</td>\n",
" <td>2023-08-30T07:31:31Z</td>\n",
" <td>https://news.google.com/articles/CBMibGh0dHBzO...</td>\n",
" <td>Business</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Grayscale's Bitcoin Win Is Still Only Half the...</td>\n",
" <td>Bloomberg</td>\n",
" <td>2023-08-30T10:38:40Z</td>\n",
" <td>https://news.google.com/articles/CBMib2h0dHBzO...</td>\n",
" <td>Business</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>I'm a Home Shopping Editor, and These Are the ...</td>\n",
" <td>Better Homes & Gardens</td>\n",
" <td>2023-08-30T11:00:00Z</td>\n",
" <td>https://news.google.com/articles/CBMiPWh0dHBzO...</td>\n",
" <td>Business</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Title Publisher \\\n",
"0 Chainlink (LINK) Falters, Hedera (HBAR) Wobble... Analytics Insight \n",
"1 Funds punished for owning too few Nvidia share... ZAWYA \n",
"2 Crude oil prices stalled as hedge funds sold: ... ZAWYA \n",
"3 Grayscale's Bitcoin Win Is Still Only Half the... Bloomberg \n",
"4 I'm a Home Shopping Editor, and These Are the ... Better Homes & Gardens \n",
"\n",
" DateTime Link \\\n",
"0 2023-08-30T06:54:49Z https://news.google.com/articles/CBMibGh0dHBzO... \n",
"1 2023-08-30T07:15:59Z https://news.google.com/articles/CBMigwFodHRwc... \n",
"2 2023-08-30T07:31:31Z https://news.google.com/articles/CBMibGh0dHBzO... \n",
"3 2023-08-30T10:38:40Z https://news.google.com/articles/CBMib2h0dHBzO... \n",
"4 2023-08-30T11:00:00Z https://news.google.com/articles/CBMiPWh0dHBzO... \n",
"\n",
" Category \n",
"0 Business \n",
"1 Business \n",
"2 Business \n",
"3 Business \n",
"4 Business "
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Data Ingestion\n",
"df = pd.read_csv(\"../dataset/news_dataset.csv\")\n",
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Text(0.5, 1.0, 'Category Distribution')"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "",
"text/plain": [
"<Figure size 1000x500 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Distribution Bar plot (Count plot)\n",
"plt.figure(figsize=(10, 5))\n",
"sns.barplot(x=df[\"Category\"].value_counts().index, y=df[\"Category\"].value_counts())\n",
"plt.ylabel(\"Number of News\")\n",
"plt.title(\"Category Distribution\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**There's no extreme data imbalance except \"Health\" and \"Science\" news are almost half the \"Sports\" (majority) news.**"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "8368a3df9eea413b99d2d0c5876fbcf6",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"interactive(children=(Dropdown(description='category', options=('Business', 'Entertainment', 'Headlines', 'Hea…"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Word cloud\n",
"categories = df[\"Category\"].unique().tolist()\n",
"\n",
"\n",
"@widgets.interact(category=categories)\n",
"def display_categotical_plots(category=categories[0]):\n",
" subset = df[df[\"Category\"] == category].sample(n=100, random_state=42)\n",
" text = subset[\"Title\"].values\n",
" cloud = WordCloud(stopwords=STOPWORDS, background_color=\"black\", collocations=False, width=600, height=400).generate(\" \".join(text))\n",
" plt.axis(\"off\")\n",
" plt.imshow(cloud)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**From the word cloud we can immediately draw one insight about the redundant key words like \"New\" which is coming a lot in different categories.**</br>\n",
"We can also see some action verbs, adjectives, adverbs which need to be removed to some extent before training the model.**</br>\n",
"Other than that the word cloud seems very intuitive to what the respective categorical tag/name is.</br></br>\n",
"We can also see the \"Headlines\" category contains mixed words (will be mixed as it can be a ground breaking news of any category), so we'll hold out those data instances as a test set without targets just to analyze the number of headlines with different categories."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "news_venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.13"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
|