{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Comparison of Twitter's model of viral tweets with other tweets"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this notebook, we try to identify common features of Twitter's identified viral tweets found on the topic page \"Viral Tweets\".\n",
    "\n",
    "We also experiment to find if other tweets that have not figured on that topic page, can also be labeled as viral based on these common features. This should help homogeinize the data (those that are viral and those that are not) when training the model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import seaborn as sns\n",
    "import numpy as np\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline\n",
    "\n",
    "import plotly.express as px\n",
    "import plotly.graph_objects as go\n",
    "from plotly.subplots import make_subplots\n",
    "\n",
    "from helper.text_preprocessing import clear_reply_mentions\n",
    "\n",
    "from tqdm import tqdm\n",
    "\n",
    "#pd.set_option('display.max_rows', None)\n",
    "pd.set_option('display.max_columns', None)\n",
    "\n",
    "DATA_PATH = \"../../data\"\n",
    "VIRAL_TWEETS_PATH = f\"{DATA_PATH}/new/viral\"\n",
    "COVID_TWEETS_PATH = f\"{DATA_PATH}/new/covid\"\n",
    "\n",
    "PROCESSED_PATH_VIRAL = f'{DATA_PATH}/new/processed/viral'\n",
    "PROCESSED_PATH_COVID = f'{DATA_PATH}/new/processed/covid'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 0. Preprocessing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "viral_dataset = pd.read_parquet(f\"{VIRAL_TWEETS_PATH}/all_tweets.parquet.gzip\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "covid_dataset = pd.read_parquet(f\"{COVID_TWEETS_PATH}/all_tweets.parquet.gzip\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "covid_users = pd.read_parquet(f\"{COVID_TWEETS_PATH}/users.parquet.gzip\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Keep only original tweets from **covid dataset**. Viral dataset doesn't have retweets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def is_retweeted(referenced_tweets):\n",
    "    for x in referenced_tweets:\n",
    "        if x['type'] == 'retweeted':\n",
    "            return True\n",
    "    return False\n",
    "\n",
    "# Keep only original tweets\n",
    "referenced = covid_dataset.loc[~covid_dataset.referenced_tweets.isna()].copy()\n",
    "referenced.loc[:, 'is_retweet'] = referenced.referenced_tweets.apply(is_retweeted)\n",
    "retweeted = referenced[referenced.is_retweet]\n",
    "retweeted"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "original_covid_tweets = covid_dataset[~covid_dataset.id.isin(retweeted.id)]\n",
    "original_covid_tweets.to_parquet(f\"{COVID_TWEETS_PATH}/all_original_tweets.parquet.gzip\", index=False, compression=\"gzip\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Clear reply mentions at the beginning of tweets texts\n",
    "original_covid_tweets.loc[:, \"text\"] = original_covid_tweets.text.apply(clear_reply_mentions)\n",
    "viral_dataset.loc[:, \"text\"] = viral_dataset.text.apply(clear_reply_mentions)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Exploration"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.1 - General Exploration"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "viral_dataset = pd.read_parquet(f\"{VIRAL_TWEETS_PATH}/all_tweets.parquet.gzip\")\n",
    "viral_users = pd.read_parquet(f\"{VIRAL_TWEETS_PATH}/users.parquet.gzip\")\n",
    "viral_tweets_ids = pd.read_parquet(f\"{VIRAL_TWEETS_PATH}/viral_tweets_ids.parquet.gzip\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "original_covid_tweets = pd.read_parquet(f\"{COVID_TWEETS_PATH}/all_original_tweets.parquet.gzip\")\n",
    "covid_users = pd.read_parquet(f\"{COVID_TWEETS_PATH}/users.parquet.gzip\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "display(\"--- VIRAL DATASET ---\")\n",
    "\n",
    "display(f\"{len(viral_tweets_ids)} viral tweets collected\")\n",
    "display(f\"{len(viral_users)} viral users\")\n",
    "display(f\"{len(viral_dataset)} all tweets collected\")\n",
    "\n",
    "display(\"---  COVID DATASET ---\")\n",
    "\n",
    "display(f\"{len(original_covid_tweets)} original (not retweeted) covid tweets collected\")\n",
    "display(f\"{len(original_covid_tweets.author_id.unique())} covid users collected\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# REMOVE THIS WHEN DONE COLLECTION (WARNING NOT NECESSARILY)\n",
    "viral_dataset['viral'] = viral_dataset.id.isin(viral_tweets_ids.id)\n",
    "\n",
    "#viral_tweets = all_tweets[all_tweets.id.isin(viral_tweets.id)]\n",
    "#viral_tweets\n",
    "\n",
    "len(viral_dataset[viral_dataset.viral])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- merge tweets with user info"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "covid_users.columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "user_columns = ['author_id', 'followers_count', 'following_count', 'tweet_count', 'protected', 'verified', 'username']\n",
    "viral_dataset_with_users = viral_dataset.merge(viral_users.rename(columns={'id': 'author_id'})[user_columns], on='author_id')\n",
    "covid_dataset_with_users = original_covid_tweets.merge(covid_users.rename(columns={'id': 'author_id'})[user_columns], on='author_id')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1.1.1 - Correlation between public metrics"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Pearson Correlation between the different public metrics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "public_metrics = ['retweet_count', 'like_count', 'reply_count', 'quote_count', 'followers_count', 'following_count']\n",
    "display(viral_dataset_with_users[public_metrics].corr())\n",
    "display(covid_dataset_with_users[public_metrics].corr())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "px.scatter(viral_dataset, x='like_count', y='retweet_count')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1.1.2 - Exploring retweet count of viral vs non viral tweets"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since we have a large number of tweets to plot, we'll only sample a few from each user"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_largest_n(all_tweets, by='retweet_count', n=100):\n",
    "    '''Get the largest 100 tweets by retweet count for every user\n",
    "    '''\n",
    "    top_n_per_user = all_tweets.groupby(by='author_id')[by].nlargest(n=100).reset_index(level=0, drop=True)\n",
    "    tweets_for_plot = all_tweets[all_tweets.index.isin(top_n_per_user.index)].reset_index()\n",
    "    return tweets_for_plot"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tweets_plot_df = get_largest_n(viral_dataset, by='retweet_count')\n",
    "fig = px.scatter(tweets_plot_df, x=tweets_plot_df.index, y='retweet_count', color='viral')\n",
    "\n",
    "fig.update_layout(title_text=\"Viral Dataset Scatter plot of the retweet count for the top 100 tweets per user\", xaxis_title=\"Index\", yaxis_title=\"retweet count\")\n",
    "\n",
    "fig.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "covid_tweets_plot_df = original_covid_tweets.sort_values(by='retweet_count', ascending=False)[:10000]\n",
    "fig = px.scatter(covid_tweets_plot_df, x=covid_tweets_plot_df.reset_index().index, y='retweet_count')\n",
    "\n",
    "fig.update_layout(title_text=\"Covid Dataset Scatter plot of retweet count sorted by retweet count on a 10000 sample\", xaxis_title=\"Index\", yaxis_title=\"retweet count\")\n",
    "\n",
    "fig.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Finding**: Viral tweets identified by twitter are by no means more viral than other tweets tweeted by the same users. Are users who have tweeted viral tweets (as identified by Twitter) likely to have tweeted other viral tweets?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get the ratio for each tweet's retweet count wrt to the mean retweet count of the user's tweets\n",
    "# Again since we're retrieved 3200 tweets per user, we're only taking the average over that\n",
    "users_avg_retweets = viral_dataset.groupby(by='author_id').agg(mean_retweets=('retweet_count', 'mean'))\n",
    "tweets_merged_avg_retweets = viral_dataset.merge(right=users_avg_retweets, left_on='author_id', right_index=True)\n",
    "tweets_merged_avg_retweets['ratio_avg_retweets'] = tweets_merged_avg_retweets['retweet_count'] / tweets_merged_avg_retweets['mean_retweets']\n",
    "tweets_merged_avg_retweets_sorted = tweets_merged_avg_retweets.sort_values(by='ratio_avg_retweets').reset_index()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tweets_plot_df = get_largest_n(tweets_merged_avg_retweets_sorted, by='ratio_avg_retweets')\n",
    "\n",
    "fig = px.scatter(tweets_plot_df, x=tweets_plot_df.index, y='ratio_avg_retweets', color='viral')\n",
    "\n",
    "fig.update_layout(title_text=\"Scatter plot of the tweets sorted by the ratio #retweets/(the mean user avg #retweets)\", xaxis_title=\"Index\", yaxis_title=\"ratio\")\n",
    "\n",
    "fig.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Finding**: Cleaner separation. Viral tweets, as expected, are on the other end of the spectrum. However other tweets in the same range could qualify as viral as well. These tweets should be identified as viral by the Twitter model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.2 Finding the right threshold for virality"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1.2.0 - Relabel viral tweets in the viral dataset by correcting the initial virality threshold (ONLY IN OLD PAPER SUBMITTED BY STUDENT)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's observe the retweet count of a user based on the tweet date."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sample_user = viral_users.id[10]\n",
    "author_tweets = viral_dataset[viral_dataset.author_id == sample_user]\n",
    "fig = px.scatter(author_tweets, x='created_at', y='retweet_count', color='viral')\n",
    "\n",
    "fig.update_layout(title_text=\"Scatter plot of the retweet count wrt to the tweet date for a single user\")\n",
    "\n",
    "fig.show() "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Finding**: The above graph of a user's retweet count wrt the tweet date, shows that the viral tweets taken from the Twitter \"Viral Tweets\" topic page, have been taken at certain points in time. **Other tweets with higher retweet counts** may have been on that Topic page at different points in time as well. In any case, they **should be qualified as viral all the same**."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One quick fix for that is, for each user, mark as viral all tweets that have higher retweet count than the viral tweet we scraped for that user. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get the minimum retweet count out of the viral tweets for each user\n",
    "min_retweet_count_by_user = viral_dataset[viral_dataset.viral].groupby(by='author_id')[['retweet_count']].min()\n",
    "\n",
    "# Set as viral any tweet that has a retweet count higher or equal to the user's minimum retweet count we just computed\n",
    "viral_dataset_labeled = viral_dataset.merge(min_retweet_count_by_user, left_on='author_id', right_index=True, suffixes=(None, \"_user_viral_threshold\"))\n",
    "viral_dataset_labeled['viral'] = viral_dataset_labeled['retweet_count'] >= viral_dataset_labeled['retweet_count_user_viral_threshold']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Save this result \n",
    "#viral_dataset_labeled.to_parquet(f'{PROCESSED_PATH_VIRAL}/all_tweets.parquet.gzip', compression='gzip')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "display(f\"Number of identified viral tweets increased from {len(viral_tweets_ids)} to {len(viral_dataset_labeled[viral_dataset_labeled.viral])}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another problem we're facing is that we're **missing historical data** on the number of followers of a user. So we cannot use the metric of:\n",
    "$ \\frac{\\#retweets}{\\#followers}$ effectively. That's why we came up with the other metric: $\\frac{\\#retweets}{mean(\\#retweets)}$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1.2.1 Applying the virality followers metric to both datasets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Applying the first metric on the covid dataset\n",
    "covid_dataset_with_users['virality_followers'] = covid_dataset_with_users['retweet_count'] / covid_dataset_with_users['followers_count'].astype(\"float64\")\n",
    "# Handle division by zero if user has 0 followers\n",
    "covid_dataset_with_users['virality_followers'] = covid_dataset_with_users.virality_followers.replace({np.inf: 0.0})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "len(covid_dataset_with_users[(covid_dataset_with_users['virality_followers'] > 0.8)])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Applying the second metric on the viral dataset\n",
    "viral_dataset_with_users['virality_followers'] = viral_dataset_with_users['retweet_count'] / viral_dataset_with_users['followers_count'].astype(\"float64\")\n",
    "# Handle division by zero if user has 0 followers\n",
    "viral_dataset_with_users['virality_followers'] = viral_dataset_with_users.virality_followers.replace({np.inf: 0.0})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "len(viral_dataset_with_users[(viral_dataset_with_users['virality_followers'] > 1)])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1.2.2 Applying the virality avg retweets metric to viral dataset "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "viral_users_retweet_statistics = viral_dataset_with_users.groupby(by='author_id').retweet_count.agg(['min', 'mean', 'max'])\n",
    "viral_users_retweet_statistics = viral_users_retweet_statistics.rename(columns={\"min\": \"min_user_retweets\", \"max\": \"max_user_retweets\", \"mean\": \"mean_user_retweets\"})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "viral_dataset_with_users = viral_dataset_with_users.merge(viral_users_retweet_statistics, on='author_id')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Applying the first metric on the viral dataset\n",
    "viral_dataset_with_users['virality_avg_retweets'] = viral_dataset_with_users['retweet_count'] / viral_dataset_with_users['mean_user_retweets'].astype(\"float64\")\n",
    "# Handle division by zero if user has 0 followers\n",
    "viral_dataset_with_users['virality_avg_retweets'] = viral_dataset_with_users.virality_avg_retweets.replace({np.inf: 0.0})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "len(viral_dataset_with_users[(viral_dataset_with_users['virality_avg_retweets'] > 1)])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1.2.3 How many tweets are covered by metric 1?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "temp = viral_dataset_with_users[viral_dataset_with_users.virality_followers > 0]\n",
    "temp_2 = viral_dataset_with_users[viral_dataset_with_users.virality_avg_retweets > 0]\n",
    "viral_temp = viral_dataset_with_users[viral_dataset_with_users.viral]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig = px.ecdf(viral_dataset_with_users[viral_dataset_with_users.viral], x='virality_followers')\n",
    "\n",
    "# TODO: percentage y axis\n",
    "# TODO: Only take the scraped tweets\n",
    "fig.update_layout(title_text=\"Percentage of viral tweets recognized by Metric 1: number of followers\", xaxis_title=\"Metric 1: virality_followers\", yaxis_title=\"Percentage\")\n",
    "\n",
    "fig.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig1 = sns.displot(temp, x='virality_followers', kind='ecdf')\n",
    "\n",
    "plt.xscale('log')\n",
    "plt.title(\"Proportion of tweets labeled as viral as function of Metric 1: number of followers (logscale)\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "temp_2 = viral_dataset_with_users[viral_dataset_with_users.virality_avg_retweets > 0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig = px.ecdf(viral_temp, x='virality_avg_retweets')\n",
    "\n",
    "fig.update_layout(title_text=\"Percentage of viral tweets recognized by Metric 2 avg retweets\", xaxis_title=\"Metric 2: avg retweets\", yaxis_title=\"Percentage\")\n",
    "\n",
    "fig.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig = sns.displot(temp_2, x='virality_avg_retweets', kind='ecdf')\n",
    "\n",
    "plt.xscale('log')\n",
    "plt.title(\"Proportion of tweets labeled as viral as function of Metric 2: avg retweets (logscale)\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "TODO: Plot the percentage of viral tweets labeled vs the # of new tweets labeled wrt to the varying threshold of the metric we use. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#viral_dataset_with_users = viral_dataset_with_users.groupby(by='virality_followers').count()\n",
    "viral_dataset_with_users = pd.read_parquet(f\"{PROCESSED_PATH_VIRAL}/all_tweets.parquet.gzip\")\n",
    "# Applying the second metric on the viral dataset\n",
    "viral_dataset_with_users['virality_followers'] = viral_dataset_with_users['retweet_count'] / viral_dataset_with_users['followers_count'].astype(\"float64\")\n",
    "# Handle division by zero if user has 0 followers\n",
    "viral_dataset_with_users['virality_followers'] = viral_dataset_with_users.virality_followers.replace({np.inf: 0.0})\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "viral_dataset_with_users"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "viral_dataset_with_users_truncated = viral_dataset_with_users[viral_dataset_with_users.virality_followers > 0.1]\n",
    "#viral_dataset_with_users['viral_metric_1'] = viral_dataset_with_users['']\n",
    "len(viral_dataset_with_users_truncated)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ready_to_plot = viral_dataset_with_users_truncated.copy()\n",
    "ready_to_plot['viral'] = ready_to_plot['viral'].replace({False: None})\n",
    "ready_to_plot = ready_to_plot.groupby(by='virality_followers').count()[['text', 'viral']].cumsum().rename(columns={'text':'tweets'})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig = px.line(ready_to_plot, x='viral', y='tweets', hover_data=[ready_to_plot.index])#, log_y=True)\n",
    "\n",
    "fig.update_layout(title_text=\"Line plot of #viral tweets labeled as viral vs # new tweets labeled as viral by varying threshold of Metric 1 (#followers)\", xaxis_title=\"Number of viral tweets labeled as viral\", yaxis_title=\"Number of new tweets labeled as viral\")\n",
    "fig.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ready_to_plot = viral_dataset_with_users_truncated.copy()\n",
    "ready_to_plot['viral'] = ready_to_plot['viral'].replace({False: None})\n",
    "ready_to_plot = ready_to_plot.groupby(by='virality_followers').count()[['text', 'viral']].cumsum().rename(columns={'text':'tweets'})\n",
    "ready_to_plot['tweets'] = len(viral_dataset_with_users) - ready_to_plot.tweets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig = px.line(ready_to_plot, x='viral', y='tweets', hover_data=[ready_to_plot.index])#, log_y=True)\n",
    "\n",
    "fig.update_layout(title_text=\"Line plot of #viral tweets labeled as viral vs # new tweets labeled as viral by varying threshold of Metric 1 (#followers)\", xaxis_title=\"Number of viral tweets labeled as viral\", yaxis_title=\"Number of new tweets labeled as viral\")\n",
    "fig.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "'''\n",
    "tempo3 = tempo2.copy()\n",
    "tempo3['viral'] = tempo3['viral'].replace({False: None})\n",
    "tempo3 = tempo3.groupby(by='virality_followers').count()[['text', 'viral']].rename(columns={'text':'tweets'})\n",
    "tempo3['viral_cumsum'] = tempo3.viral.cumsum()\n",
    "tempo3\n",
    "'''"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "min_threshold = viral_dataset_with_users.virality_followers.min()\n",
    "max_threshold = viral_dataset_with_users.virality_followers.max()\n",
    "display(f\"sampling from {min_threshold} to {max_threshold}\")\n",
    "thresholds_space = np.linspace(min_threshold, max_threshold, num=10000)\n",
    "\n",
    "number_of_viral_tweets = len(viral_dataset_with_users[viral_dataset_with_users.viral]) \n",
    "\n",
    "percentages_of_viral_covered = []\n",
    "nb_of_tweets_labeled_as_viral = []\n",
    "\n",
    "for i in thresholds_space:\n",
    "    new_tweets_labeled = viral_dataset_with_users[viral_dataset_with_users.virality_followers >= i]\n",
    "    percentage_of_viral_covered = len(new_tweets_labeled[new_tweets_labeled.viral]) / number_of_viral_tweets\n",
    "    nb_of_tweets_labeled_as_viral.append(len(new_tweets_labeled))\n",
    "    percentages_of_viral_covered.append(percentage_of_viral_covered)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "result_to_plot = pd.DataFrame({'percentage_of_viral_covered':percentages_of_viral_covered, 'nb_of_tweets_labeled_as_viral':nb_of_tweets_labeled_as_viral, 'thresholds': thresholds_space})\n",
    "\n",
    "px.scatter(\n",
    "    result_to_plot,\n",
    "    x='percentage_of_viral_covered',\n",
    "    y='nb_of_tweets_labeled_as_viral', log_y=True, hover_name='thresholds')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "result_to_plot.to_csv('new_tweets_labeled_vs_percentage_of_viral.csv', index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1.2.4 Comparing several metrics wrt distributions of viral tweets covered"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def plot_distribution_for_metric(\n",
    "    df, metric='virality_followers', num_experiments=1000, generate_thresholds_from_viral_quantiles=True, min_threshold=None, max_threshold=None, remove_duplicates=True, output_filename=None):\n",
    "    viral_tweets = df[df.viral]\n",
    "    number_of_viral_tweets = len(viral_tweets)\n",
    "    \n",
    "    if not generate_thresholds_from_viral_quantiles: \n",
    "        # If not, generate a linear space of the thresholds between min and max of the metric values\n",
    "        if not min_threshold:\n",
    "            min_threshold = df[metric].min()\n",
    "        if not max_threshold:\n",
    "            max_threshold = df[metric].max()\n",
    "        display(f\"sampling from {min_threshold} to {max_threshold}\")\n",
    "        thresholds_space = np.linspace(min_threshold, max_threshold, num=num_experiments)\n",
    "    else:\n",
    "        # Take quantiles of metric for different percentages of viral tweets covered (from 0 to 100)\n",
    "        thresholds_space = viral_tweets[metric].quantile([i / 100 for i in range(101)]) \n",
    "        display(f\"sampling from {thresholds_space.min()} to {thresholds_space.max()}\")\n",
    "\n",
    "    percentages_of_viral_covered = []\n",
    "    nb_of_tweets_labeled_as_viral = []\n",
    "\n",
    "    for i in thresholds_space:\n",
    "        new_tweets_labeled = df[df[metric] >= i]\n",
    "        percentage_of_viral_covered = len(new_tweets_labeled[new_tweets_labeled.viral]) / number_of_viral_tweets\n",
    "        nb_of_tweets_labeled_as_viral.append(len(new_tweets_labeled))\n",
    "        percentages_of_viral_covered.append(percentage_of_viral_covered)\n",
    "    \n",
    "    results_to_plot = pd.DataFrame({\n",
    "        f'percentage_of_viral_covered_{metric}':percentages_of_viral_covered,\n",
    "        f'nb_of_tweets_labeled_as_viral_{metric}':nb_of_tweets_labeled_as_viral,\n",
    "        f'thresholds_{metric}': thresholds_space})\n",
    "\n",
    "    #if remove_duplicates:\n",
    "    #    results_to_plot = results_to_plot.sort_values(by='nb_of_tweets_labeled_as_viral').drop_duplicates(subset=['percentage_of_viral_covered'], keep='first')\n",
    "\n",
    "    # Discard rows where 100% of viral tweets are covered\n",
    "    #results_to_plot = results_to_plot[results_to_plot.percentage_of_viral_covered < 1.0]\n",
    "    # TODO: take min of 100% coverage\n",
    "\n",
    "    fig = px.scatter(\n",
    "        results_to_plot,\n",
    "        x=f'percentage_of_viral_covered_{metric}',\n",
    "        y=f'nb_of_tweets_labeled_as_viral_{metric}', hover_name=f'thresholds_{metric}')#log_y=True, trendline='ols' \n",
    "\n",
    "    fig.update_layout(title_text=f\"Percentage of viral covered vs new tweets labeled as viral according to varying metric {metric}\")\n",
    "    fig.show()\n",
    "\n",
    "    display(f\"Result length {len(results_to_plot)}\")\n",
    "    if not output_filename:\n",
    "        output_filename = metric\n",
    "    results_to_plot.to_csv(f'{output_filename}_viral_covered_vs_new_tweets_labeled.csv', index=False)   \n",
    "    \n",
    "    return results_to_plot"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#viral_dataset_with_users = viral_dataset_with_users.groupby(by='virality_followers').count()\n",
    "METRIC_1 = 'virality_followers'\n",
    "viral_dataset_with_users = pd.read_parquet(f\"{PROCESSED_PATH_VIRAL}/all_tweets.parquet.gzip\")\n",
    "# Applying the second metric on the viral dataset\n",
    "viral_dataset_with_users[METRIC_1] = viral_dataset_with_users['retweet_count'] / viral_dataset_with_users['followers_count'].astype(\"float64\")\n",
    "# Handle division by zero if user has 0 followers\n",
    "viral_dataset_with_users[METRIC_1] = viral_dataset_with_users[METRIC_1].replace({np.inf: 0.0})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_1 = plot_distribution_for_metric(viral_dataset_with_users, metric='virality_followers', num_experiments=10000)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Metric 2: retweet / user avg retweets\n",
    "METRIC_2 = 'virality_avg_retweets'\n",
    "viral_users_retweet_statistics = viral_dataset_with_users.groupby(by='author_id').retweet_count.agg(['min', 'mean', 'max', 'median'])\n",
    "viral_users_retweet_statistics = viral_users_retweet_statistics.rename(columns={\n",
    "    \"min\": \"min_user_retweets\", \"max\": \"max_user_retweets\", \"mean\": \"mean_user_retweets\", \"median\": \"median_user_retweets\"})\n",
    "\n",
    "viral_dataset_with_users = viral_dataset_with_users.merge(viral_users_retweet_statistics, on='author_id')\n",
    "\n",
    "viral_dataset_with_users[METRIC_2] = viral_dataset_with_users['retweet_count'] / viral_dataset_with_users['mean_user_retweets'].astype(\"float64\")\n",
    "# Handle division by zero if user has 0 followers\n",
    "viral_dataset_with_users[METRIC_2] = viral_dataset_with_users[METRIC_2].replace({np.inf: 0.0})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_2 = plot_distribution_for_metric(viral_dataset_with_users, metric='virality_avg_retweets', num_experiments=10000)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Metric 3: Minimum retweet count (Hard threshold)\n",
    "METRIC_3 = 'retweet_count'\n",
    "\n",
    "viral_tweets = viral_dataset_with_users[viral_dataset_with_users.viral]\n",
    "min_viral_retweet_count = viral_tweets.retweet_count.min()\n",
    "max_viral_retweet_count = viral_tweets.retweet_count.max()\n",
    "\n",
    "df_3 = plot_distribution_for_metric(\n",
    "    viral_dataset_with_users, metric=METRIC_3, num_experiments=10000,\n",
    "    min_threshold=min_viral_retweet_count, max_threshold=max_viral_retweet_count, generate_thresholds_from_viral_quantiles=False,\n",
    "    output_filename='hard_threshold')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Metric 4 from Maldonado paper 'Virality Prediction for News Tweets Using RoBERTa'\n",
    "def roberta_paper_metric(x):\n",
    "    g = x['retweet_count'] + x['like_count']\n",
    "    h = x['followers_count'] - x['following_count']\n",
    "    A = 10\n",
    "\n",
    "    r = max(x['retweet_count'], 1)\n",
    "    f = max(x['like_count'], 1)\n",
    "    w = max(x['followers_count'], 1)\n",
    "    d = max(x['following_count'], 1)\n",
    "    h = max(h, 1)\n",
    "\n",
    "    num = g * d * (A * r + f)\n",
    "    denom = w * r * (A * d + h)\n",
    "    #if denom == 0:\n",
    "    #    return 0\n",
    "    return num / denom"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "METRIC_4 = 'roberta_paper_metric'\n",
    "viral_dataset_with_users[METRIC_4] = viral_dataset_with_users.apply(lambda x: roberta_paper_metric(x), axis='columns')\n",
    "\n",
    "df_4 = plot_distribution_for_metric(\n",
    "    viral_dataset_with_users, metric=METRIC_4, num_experiments=100000)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "METRIC_5 = 'virality_retweet_percentile_per_user'\n",
    "\n",
    "# Take only tweets with positive retweet count, otherwise the quantiles will be very heavy-tailed\n",
    "#tweets_with_retweets = viral_dataset_with_users[viral_dataset_with_users.retweet_count > 0]\n",
    "\n",
    "viral_tweets = viral_dataset_with_users[viral_dataset_with_users.viral]\n",
    "percentiles = [i/100 for i in range(101)]\n",
    "number_of_viral_tweets = len(viral_tweets)\n",
    "\n",
    "percentages_of_viral_covered = []\n",
    "nb_of_tweets_labeled_as_viral = []\n",
    "\n",
    "for i in tqdm(percentiles):\n",
    "    temp = viral_dataset_with_users.groupby(by='author_id')[['retweet_count']].quantile(i).rename(columns={'retweet_count': f'percentile_{i}'})\n",
    "    temp = viral_dataset_with_users.merge(temp, on='author_id')\n",
    "\n",
    "    new_tweets_labeled = temp[temp['retweet_count'] >= temp[f'percentile_{i}']]\n",
    "    percentage_of_viral_covered = len(new_tweets_labeled[new_tweets_labeled.viral]) / number_of_viral_tweets\n",
    "    nb_of_tweets_labeled_as_viral.append(len(new_tweets_labeled))\n",
    "    percentages_of_viral_covered.append(percentage_of_viral_covered)\n",
    "\n",
    "df_5 = pd.DataFrame({\n",
    "    f'percentage_of_viral_covered_{METRIC_5}':percentages_of_viral_covered,\n",
    "    f'nb_of_tweets_labeled_as_viral_{METRIC_5}':nb_of_tweets_labeled_as_viral,\n",
    "    f'thresholds_{METRIC_5}': percentiles})\n",
    "\n",
    "fig = px.scatter(\n",
    "    df_5,\n",
    "    x=f'percentage_of_viral_covered_{METRIC_5}',\n",
    "    y=f'nb_of_tweets_labeled_as_viral_{METRIC_5}', hover_name=f'thresholds_{METRIC_5}')#log_y=True, trendline='ols' \n",
    "\n",
    "fig.update_layout(title_text=f\"Percentage of viral covered vs new tweets labeled as viral according to varying metric {METRIC_5}\")\n",
    "fig.show()\n",
    "\n",
    "display(f\"Result length {len(df_5)}\")\n",
    "df_5.to_csv(f'{METRIC_5}_viral_covered_vs_new_tweets_labeled.csv', index=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Metric 6: Median\n",
    "METRIC_6 = 'virality_median_retweets'\n",
    "\n",
    "positive_median_dataset = viral_dataset_with_users[viral_dataset_with_users['median_user_retweets'] > 0].copy()\n",
    "positive_median_dataset.loc[:, METRIC_6] = positive_median_dataset['retweet_count'] / positive_median_dataset['median_user_retweets'].astype(\"float64\")\n",
    "# Handle division by zero if user has 0 followers\n",
    "positive_median_dataset.loc[:, METRIC_6] = positive_median_dataset[METRIC_6].replace({np.inf: 0.0, np.nan:0.0})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_6 = plot_distribution_for_metric(\n",
    "    positive_median_dataset, metric=METRIC_6, num_experiments=10000, remove_duplicates=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# log(retweet_counts) / followers_count\n",
    "METRIC_7 = 'log_retweets_over_followers'\n",
    "\n",
    "positive_retweet_and_follower_count = viral_dataset_with_users[(viral_dataset_with_users.retweet_count > 0) & (viral_dataset_with_users.followers_count > 0)].copy()\n",
    "\n",
    "positive_retweet_and_follower_count.loc[:, METRIC_7] = (np.log(positive_retweet_and_follower_count['retweet_count']) / positive_retweet_and_follower_count['followers_count']).astype(\"float64\")\n",
    "positive_retweet_and_follower_count.loc[:, METRIC_7] = positive_retweet_and_follower_count[METRIC_7].replace({np.inf: 0.0, np.nan:0.0})\n",
    "\n",
    "df_7 = plot_distribution_for_metric(\n",
    "    positive_retweet_and_follower_count, metric=METRIC_7, num_experiments=10000, remove_duplicates=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "METRIC_8 = 'retweets_over_log_followers'\n",
    "\n",
    "positive_retweet_and_follower_count.loc[:, METRIC_8] = (positive_retweet_and_follower_count['retweet_count'] / np.log(positive_retweet_and_follower_count['followers_count'])).astype(\"float64\")\n",
    "positive_retweet_and_follower_count.loc[:, METRIC_8] = positive_retweet_and_follower_count[METRIC_8].replace({np.inf: 0.0, np.nan:0.0})\n",
    "\n",
    "df_8 = plot_distribution_for_metric(\n",
    "    positive_retweet_and_follower_count, metric=METRIC_8, num_experiments=10000, remove_duplicates=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "METRIC_9 = 'log_retweets_over_log_followers'\n",
    "\n",
    "positive_retweet_and_follower_count.loc[:, METRIC_9] = (np.log(positive_retweet_and_follower_count['retweet_count']) / np.log(positive_retweet_and_follower_count['followers_count'])).astype(\"float64\")\n",
    "positive_retweet_and_follower_count.loc[:, METRIC_9] = positive_retweet_and_follower_count[METRIC_9].replace({np.inf: 0.0, np.nan:0.0})\n",
    "\n",
    "df_9 = plot_distribution_for_metric(\n",
    "    positive_retweet_and_follower_count, metric=METRIC_9, num_experiments=10000, remove_duplicates=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "final_result = pd.concat([df_1, df_2, df_3, df_4, df_5, df_6, df_7, df_8, df_9], axis=1)\n",
    "final_result.to_csv('final_result_viral_coverage.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.3 Viral Dataset Exploration: Comparison between viral and non viral tweets using other features "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: Only take viral tweets from scraped. Since sentiment is already computed on the other dataset, we relabel dataset viral by checking if in scraped ids \n",
    "# (DONE)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "viral_dataset_labeled = pd.read_parquet(f'{PROCESSED_PATH_VIRAL}/all_tweets.parquet.gzip')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "display(f\"{len(viral_dataset_labeled[viral_dataset_labeled.viral])} viral tweets out of {len(viral_dataset_labeled)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1.3.1 - Language"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "languages_aggregates = viral_dataset_labeled.groupby(by='lang', as_index=False)[['id']].count().rename(columns={'id': 'count'})\n",
    "languages_aggregates = languages_aggregates.sort_values(by='count', ascending=False)\n",
    "languages_aggregates.loc[languages_aggregates['count'] < 10000, 'lang'] = 'Other Languages'\n",
    "fig = px.pie(languages_aggregates, values='count', names='lang', title='Distribution of Tweets languages')\n",
    "\n",
    "fig.update_layout(\n",
    "    autosize=False,\n",
    "    width=500,\n",
    "    height=500\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.crosstab(index = viral_dataset_labeled['lang'] == 'en', columns=viral_dataset_labeled['viral']) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1.3.2 - Media"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Has media\n",
    "labels = [\"Media\", \"No Media\"]\n",
    "viral_has_media = len(viral_dataset_labeled[(viral_dataset_labeled.viral == True) & (viral_dataset_labeled.has_media == True)])\n",
    "viral_no_media = len(viral_dataset_labeled[(viral_dataset_labeled.viral == True) & (viral_dataset_labeled.has_media == False)])\n",
    "normal_has_media = len(viral_dataset_labeled[(viral_dataset_labeled.viral == False) & (viral_dataset_labeled.has_media == True)])\n",
    "normal_no_media = len(viral_dataset_labeled[(viral_dataset_labeled.viral == False) & (viral_dataset_labeled.has_media == False)])\n",
    "\n",
    "\n",
    "# Create subplots: use 'domain' type for Pie subplot\n",
    "fig = make_subplots(rows=1, cols=2, specs=[[{'type':'domain'}, {'type':'domain'}]])\n",
    "fig.add_trace(go.Pie(labels=labels, values=[viral_has_media, viral_no_media], name=\"Viral with Media\"),\n",
    "              1, 1)\n",
    "fig.add_trace(go.Pie(labels=labels, values=[normal_has_media, normal_no_media], name=\"Tweet with Media\"),\n",
    "              1, 2)\n",
    "\n",
    "# Use `hole` to create a donut-like pie chart\n",
    "fig.update_traces(hole=.4, hoverinfo=\"label+percent+name\")\n",
    "\n",
    "fig.update_layout(\n",
    "    width=1000,\n",
    "    height=500,\n",
    "    title_text=\"Percentage of tweets with some kind of media\",\n",
    "    # Add annotations in the center of the donut pies.\n",
    "    annotations=[dict(text='Viral', x=0.18, y=0.5, font_size=20, showarrow=False),\n",
    "                 dict(text='Non-Viral', x=0.82, y=0.5, font_size=20, showarrow=False)])\n",
    "fig.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Calculating the p-value between the target `viral` and `has_media`\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scipy.stats import chi2_contingency \n",
    "\n",
    "# Calculating the p-value\n",
    "contingency_media = pd.crosstab(index = viral_dataset_labeled['has_media'], columns=viral_dataset_labeled['viral']) \n",
    "display(contingency_media)\n",
    "# Display with percentages\n",
    "display(pd.crosstab(index = viral_dataset_labeled['has_media'], columns=viral_dataset_labeled['viral'], normalize='columns') )\n",
    "\n",
    "c, p, dof, expected = chi2_contingency(contingency_media) \n",
    "display(f'p-value {p}')\n",
    "c, p, dof, expected"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Finding**: Viral tweets have more chance of having some kind of media (Video, Image, GIF..) embedded than non viral tweets."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1.3.2 - Context annotations (Topics)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "viral_tweets_topic_domains = viral_dataset_labeled[viral_dataset_labeled.viral == True] \\\n",
    "    .explode('topic_domains')   \\\n",
    "    .dropna(axis=0, subset=['topic_domains'])  \\\n",
    "    .topic_domains   \n",
    "\n",
    "tweets_topic_domains = viral_dataset_labeled[viral_dataset_labeled.viral == False] \\\n",
    "    .explode('topic_domains')   \\\n",
    "    .dropna(axis=0, subset=['topic_domains'])  \\\n",
    "    .topic_domains\n",
    "\n",
    "viral_topics_domains_sorted = viral_tweets_topic_domains.groupby(viral_tweets_topic_domains).count().sort_values(ascending=False)\n",
    "tweet_topics_domains_sorted = tweets_topic_domains.groupby(tweets_topic_domains).count().sort_values(ascending=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pickle\n",
    "\n",
    "with open(f'{DATA_PATH}/topic_domains.pickle', 'rb') as handle:\n",
    "    topic_domains = pickle.load(handle)\n",
    "\n",
    "top_10_viral_topic_domains = viral_topics_domains_sorted[:10]\n",
    "top_10_tweet_topic_domains = tweet_topics_domains_sorted[:10]\n",
    "\n",
    "display(f\"Top 10 topic domains in viral tweets: \\n {[topic_domains.get(x)['name'] for x in top_10_viral_topic_domains.index.values]}\")\n",
    "display(f\"Top 10 topic domains in general tweets: \\n {[topic_domains.get(x)['name'] for x in top_10_tweet_topic_domains.index.values]}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "viral_labels = [topic_domains.get(x)['name'] for x in top_10_viral_topic_domains.index.values]\n",
    "non_viral_labels = [topic_domains.get(x)['name'] for x in top_10_tweet_topic_domains.index.values]\n",
    "\n",
    "# Create subplots: use 'domain' type for Pie subplot\n",
    "fig2 = make_subplots(rows=1, cols=2, specs=[[{'type':'domain'}, {'type':'domain'}]])\n",
    "fig2.add_trace(go.Pie(labels=viral_labels, values=top_10_viral_topic_domains.values, name=\"Viral Tweet Topic domain\"),\n",
    "              1, 1)\n",
    "fig2.add_trace(go.Pie(labels=non_viral_labels, values=top_10_tweet_topic_domains.values, name=\"Non-Viral Tweet Topic domain\"),\n",
    "              1, 2)\n",
    "\n",
    "# Use `hole` to create a donut-like pie chart\n",
    "fig2.update_traces(hole=.4, hoverinfo=\"label+percent+name\")\n",
    "\n",
    "fig2.update_layout(\n",
    "    width=1000,\n",
    "    height=500,\n",
    "    title_text=\"Top 10 topic domains for viral vs non-viral tweets\",\n",
    "    # Add annotations in the center of the donut pies.\n",
    "    annotations=[dict(text='Viral', x=0.18, y=0.5, font_size=20, showarrow=False),\n",
    "                 dict(text='Non-Viral', x=0.82, y=0.5, font_size=20, showarrow=False)])\n",
    "fig2.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1.3.3 - Tweet Length"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "viral_dataset_labeled.loc[:, 'tweet_length'] = viral_dataset_labeled.text.apply(len)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "display(viral_dataset_labeled[['tweet_length', 'retweet_count']].corr())\n",
    "\n",
    "avg_tweet_length_viral = viral_dataset_labeled[viral_dataset_labeled.viral].tweet_length.mean()\n",
    "avg_tweet_length_non_viral = viral_dataset_labeled[~viral_dataset_labeled.viral].tweet_length.mean()\n",
    "\n",
    "display(f'viral avg tweet length: {avg_tweet_length_viral} \\n non-viral avg tweet length: {avg_tweet_length_non_viral}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Some tweets are replies to others so **mentions are automatically inserted at the beginning of the tweet**, but they do not count in the Twitter max character count, so we should discard them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "viral_dataset_labeled.loc[:, \"text\"] = viral_dataset_labeled.text.apply(clear_reply_mentions)\n",
    "viral_dataset_labeled.loc[:, 'tweet_length'] = viral_dataset_labeled.text.apply(len)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "display(viral_dataset_labeled[['tweet_length', 'retweet_count']].corr())\n",
    "\n",
    "avg_tweet_length_viral = viral_dataset_labeled[viral_dataset_labeled.viral].tweet_length.mean()\n",
    "avg_tweet_length_non_viral = viral_dataset_labeled[~viral_dataset_labeled.viral].tweet_length.mean()\n",
    "\n",
    "display(f'viral avg tweet length: {avg_tweet_length_viral} \\n non-viral avg tweet length: {avg_tweet_length_non_viral}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Calculating the welch’s t-test (scipy t-test) for continuous variable `tweet_length`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scipy.stats import ttest_ind\n",
    "\n",
    "ttest_ind(viral_dataset_labeled[viral_dataset_labeled.viral].tweet_length, viral_dataset_labeled[~viral_dataset_labeled.viral].tweet_length, equal_var=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1.3.4 - Sentiment "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For the sentiment analysis, we used huggingface's [default sentiment analysis model](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english?text=I+like+you.+I+love+you). We instantiate a huggingface pipeline using that default model, and we pass the tweets text to it, outputting a **label** (e.g. POSITIVE, NEGATIVE) alongside a **confidence score**. This will only be applied to english tweets.\n",
    "\n",
    "**NOTE**: Feel free to skip the following cells if you already have the processed data. Sentiment analysis takes some time (around 2 hours on the whole data). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from transformers import pipeline\n",
    "\n",
    "# Device = 0 means it will use the Cuda at index 0\n",
    "sentiment_classifier = pipeline(\"sentiment-analysis\", device=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This will only be applied to **english tweets**. All the viral tweets we scraped are in English, so we won't be losing viral data when filtering."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "english_viral_dataset = viral_dataset_labeled[viral_dataset_labeled.lang == 'en']\n",
    "english_viral_dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we use the pandas `apply` function, with `result_type` to *expand*, so that the sentiment scores and label will be output into different columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "applied = english_viral_dataset.apply(lambda x: sentiment_classifier(x.text)[0], axis=1, result_type='expand')\n",
    "#pd.concat([small_test_set, applied], axis='columns')\n",
    "applied"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sentiment_features = pd.concat([english_viral_dataset, applied], axis=1)\n",
    "sentiment_features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sentiment_features = sentiment_features.rename(columns={\"label\": \"sentiment\", \"score\": \"sentiment_score\"})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sentiment_features.to_parquet(f\"{PROCESSED_PATH_VIRAL}/all_english_tweets_with_users_with_sentiment.parquet.gzip\", index=False, compression=\"gzip\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Get the processed data already"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sentiment_features = pd.read_parquet(f\"{PROCESSED_PATH_VIRAL}/all_english_tweets_with_users_with_sentiment.parquet.gzip\")\n",
    "display(f\"{len(sentiment_features[sentiment_features.viral])} viral tweets out of {len(sentiment_features)}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Tweets with sentiment scores over 70%\n",
    "display(f\"Tweets with sentiment analysis confidence scores above 0.7: {len(sentiment_features[sentiment_features.sentiment_score > 0.7])}\")\n",
    "display(f\"{len(sentiment_features[sentiment_features.sentiment == 'POSITIVE'])} positive tweets\")\n",
    "display(f\"{len(sentiment_features[sentiment_features.sentiment == 'NEGATIVE'])} negative tweets\")\n",
    "\n",
    "confident_sentiment_tweets = sentiment_features[sentiment_features.sentiment_score > 0.7]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We keep only retweeted tweets to pan out tweets with zero retweets with little utility.\n",
    "#retweeted_tweets = confident_sentiment_tweets[confident_sentiment_tweets.retweet_count > 0]\n",
    "\n",
    "labels = [\"Positive\", \"Negative\"]\n",
    "viral_positive = len(confident_sentiment_tweets[(confident_sentiment_tweets.viral == True) & (confident_sentiment_tweets.sentiment == 'POSITIVE')])\n",
    "viral_negative = len(confident_sentiment_tweets[(confident_sentiment_tweets.viral == True) & (confident_sentiment_tweets.sentiment == 'NEGATIVE')])\n",
    "normal_positive = len(confident_sentiment_tweets[(confident_sentiment_tweets.viral == False) & (confident_sentiment_tweets.sentiment == 'POSITIVE')])\n",
    "normal_negative = len(confident_sentiment_tweets[(confident_sentiment_tweets.viral == False) & (confident_sentiment_tweets.sentiment == 'NEGATIVE')])\n",
    "\n",
    "\n",
    "# Create subplots: use 'domain' type for Pie subplot\n",
    "fig = make_subplots(rows=1, cols=2, specs=[[{'type':'domain'}, {'type':'domain'}]])\n",
    "fig.add_trace(go.Pie(labels=labels, values=[viral_positive, viral_negative], name=\"Positive Viral Tweets\"),\n",
    "              1, 1)\n",
    "fig.add_trace(go.Pie(labels=labels, values=[normal_positive, normal_negative], name=\"Positive Non-Viral Tweets\"),\n",
    "              1, 2)\n",
    "\n",
    "# Use `hole` to create a donut-like pie chart\n",
    "fig.update_traces(hole=.4, hoverinfo=\"label+percent+name\")\n",
    "\n",
    "fig.update_layout(\n",
    "    width=1000,\n",
    "    height=500,\n",
    "    title_text=\"Distribution of positive and negative sentiment in viral vs non-viral tweets\",\n",
    "    # Add annotations in the center of the donut pies.\n",
    "    annotations=[dict(text='Viral', x=0.18, y=0.5, font_size=20, showarrow=False),\n",
    "                 dict(text='Non-Viral', x=0.82, y=0.5, font_size=20, showarrow=False)])\n",
    "fig.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Calculating the p-value between the target `viral` and positive sentiment\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scipy.stats import chi2_contingency \n",
    "\n",
    "#confident_sentiment_tweets.loc[:, 'is_positive'] = confident_sentiment_tweets.sentiment == 'POSITIVE'\n",
    "\n",
    "# Calculating the p-value\n",
    "contingency_sentiment = pd.crosstab(index = confident_sentiment_tweets['sentiment'], columns=confident_sentiment_tweets['viral']) \n",
    "# Display with percentages\n",
    "contingency_sentiment_normalized_percentage = pd.crosstab(\n",
    "    index = confident_sentiment_tweets['sentiment'], columns=confident_sentiment_tweets['viral'], normalize='columns') \n",
    "display(contingency_sentiment_normalized_percentage)\n",
    "\n",
    "c, p, dof, expected = chi2_contingency(contingency_sentiment) \n",
    "display(f'p-value {p}')\n",
    "c, p, dof, expected"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Calculating the p-value between the target `viral` and negative sentiment\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scipy.stats import chi2_contingency \n",
    "\n",
    "confident_sentiment_tweets.loc[:, 'is_negative'] = confident_sentiment_tweets.sentiment == 'NEGATIVE'\n",
    "\n",
    "# Calculating the p-value\n",
    "contingency_negative_sentiment = pd.crosstab(index = confident_sentiment_tweets['is_negative'], columns=confident_sentiment_tweets['viral']) \n",
    "# Display with percentages\n",
    "contingency_negative_sentiment_normalized_percentage = pd.crosstab(\n",
    "    index = confident_sentiment_tweets['is_negative'], columns=confident_sentiment_tweets['viral'], normalize='columns') \n",
    "display(contingency_negative_sentiment_normalized_percentage)\n",
    "\n",
    "c, p, dof, expected = chi2_contingency(contingency_negative_sentiment) \n",
    "display(f'p-value {p}')\n",
    "c, p, dof, expected"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "'''\n",
    "import spacy\n",
    "import vaderSentiment\n",
    "from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer\n",
    "\n",
    "nlp = spacy.load(\"en_core_web_sm\")\n",
    "\n",
    "spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS\n",
    "print('Number of stop words: %d' % len(spacy_stopwords))\n",
    "print('First ten stop words:',list(spacy_stopwords)[:10])\n",
    "'''"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "'''\n",
    "# Remove new lines \n",
    "remove_new_lines = lambda x: \" \".join(x.split())\n",
    "viral_dataset_labeled['processed_text'] = viral_dataset_labeled['text'].apply(remove_new_lines)\n",
    "\n",
    "\n",
    "english_tweets = viral_dataset_labeled[viral_dataset_labeled.lang == 'en']\n",
    "'''"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1.3.5 - Number of hashtags "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "viral_dataset_labeled.loc[:, \"nb_of_hashtags\"] = viral_dataset_labeled.hashtags.apply(lambda x: len(x) if np.all(x) else 0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "labels = [\"Hashtags\", \"No Hashtags\"]\n",
    "viral_has_hashtags = len(viral_dataset_labeled[(viral_dataset_labeled.viral) & (viral_dataset_labeled.nb_of_hashtags >= 1)])\n",
    "viral_no_hashtags = len(viral_dataset_labeled[(viral_dataset_labeled.viral) & (viral_dataset_labeled.nb_of_hashtags == 0)])\n",
    "normal_has_hashtags = len(viral_dataset_labeled[(~viral_dataset_labeled.viral) & (viral_dataset_labeled.nb_of_hashtags >= 1)])\n",
    "normal_no_hashtags = len(viral_dataset_labeled[(~viral_dataset_labeled.viral) & (viral_dataset_labeled.nb_of_hashtags == 0)])\n",
    "\n",
    "\n",
    "# Create subplots: use 'domain' type for Pie subplot\n",
    "fig = make_subplots(rows=1, cols=2, specs=[[{'type':'domain'}, {'type':'domain'}]])\n",
    "fig.add_trace(go.Pie(labels=labels, values=[viral_has_hashtags, viral_no_hashtags], name=\"Viral with Hashtags\"),\n",
    "              1, 1)\n",
    "fig.add_trace(go.Pie(labels=labels, values=[normal_has_hashtags, normal_no_hashtags], name=\"Tweet with No Hashtags\"),\n",
    "              1, 2)\n",
    "\n",
    "# Use `hole` to create a donut-like pie chart\n",
    "fig.update_traces(hole=.4, hoverinfo=\"label+percent+name\")\n",
    "\n",
    "fig.update_layout(\n",
    "    width=1000,\n",
    "    height=500,\n",
    "    title_text=\"Percentage of tweets with hashtags\",\n",
    "    # Add annotations in the center of the donut pies.\n",
    "    annotations=[dict(text='Viral', x=0.18, y=0.5, font_size=20, showarrow=False),\n",
    "                 dict(text='Non-Viral', x=0.82, y=0.5, font_size=20, showarrow=False)])\n",
    "fig.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Calculating the p-value between the target `viral` and `has_hashtags`\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scipy.stats import chi2_contingency \n",
    "\n",
    "viral_dataset_labeled['has_hashtags'] = viral_dataset_labeled.nb_of_hashtags >= 1\n",
    "\n",
    "# Calculating the p-value\n",
    "contingency_has_hashtags = pd.crosstab(index = viral_dataset_labeled['has_hashtags'], columns=viral_dataset_labeled['viral']) \n",
    "# Display with percentages\n",
    "contingency_has_hashtags_normalized_percentage = pd.crosstab(\n",
    "    index = viral_dataset_labeled['has_hashtags'], columns=viral_dataset_labeled['viral'], normalize='columns') \n",
    "display(contingency_has_hashtags_normalized_percentage)\n",
    "\n",
    "c, p, dof, expected = chi2_contingency(contingency_has_hashtags) \n",
    "display(f'p-value {p}')\n",
    "c, p, dof, expected"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1.3.6 - Verified account"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Verified account\n",
    "labels = [\"Verified\", \"Not verified\"]\n",
    "viral_is_verified = len(viral_dataset_labeled[(viral_dataset_labeled.viral) & (viral_dataset_labeled.verified)])\n",
    "viral_not_verified = len(viral_dataset_labeled[(viral_dataset_labeled.viral) & (~viral_dataset_labeled.verified)])\n",
    "normal_is_verified = len(viral_dataset_labeled[(~viral_dataset_labeled.viral) & (viral_dataset_labeled.verified)])\n",
    "normal_not_verified = len(viral_dataset_labeled[(~viral_dataset_labeled.viral) & (~viral_dataset_labeled.verified)])\n",
    "\n",
    "\n",
    "# Create subplots: use 'domain' type for Pie subplot\n",
    "fig = make_subplots(rows=1, cols=2, specs=[[{'type':'domain'}, {'type':'domain'}]])\n",
    "fig.add_trace(go.Pie(labels=labels, values=[viral_is_verified, viral_not_verified], name=\"Viral with verified accounts\"),\n",
    "              1, 1)\n",
    "fig.add_trace(go.Pie(labels=labels, values=[normal_is_verified, normal_not_verified], name=\"Tweet with an unverified account\"),\n",
    "              1, 2)\n",
    "\n",
    "# Use `hole` to create a donut-like pie chart\n",
    "fig.update_traces(hole=.4, hoverinfo=\"label+percent+name\")\n",
    "\n",
    "fig.update_layout(\n",
    "    width=1000,\n",
    "    height=500,\n",
    "    title_text=\"Percentage of tweets from verified accounts\",\n",
    "    # Add annotations in the center of the donut pies.\n",
    "    annotations=[dict(text='Viral', x=0.18, y=0.5, font_size=20, showarrow=False),\n",
    "                 dict(text='Non-Viral', x=0.82, y=0.5, font_size=20, showarrow=False)])\n",
    "fig.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Calculating the p-value between the target `viral` and `is_verified`\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scipy.stats import chi2_contingency \n",
    "\n",
    "# Calculating the p-value\n",
    "contingency_verified = pd.crosstab(index = viral_dataset_labeled['verified'], columns=viral_dataset_labeled['viral']) \n",
    "# Display with percentages\n",
    "contingency_verified_normalized_percentage = pd.crosstab(\n",
    "    index = viral_dataset_labeled['verified'], columns=viral_dataset_labeled['viral'], normalize='columns') \n",
    "display(contingency_verified_normalized_percentage)\n",
    "\n",
    "c, p, dof, expected = chi2_contingency(contingency_verified) \n",
    "display(f'p-value {p}')\n",
    "c, p, dof, expected"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1.3.7 - Has mentions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "viral_dataset_labeled.loc[:, \"nb_of_mentions\"] = viral_dataset_labeled.mentions.apply(lambda x: len(x) if np.all(x) else 0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scipy.stats import chi2_contingency \n",
    "\n",
    "# Calculating the p-value\n",
    "contingency_has_mentions = pd.crosstab(index = viral_dataset_labeled['nb_of_mentions'] > 0, columns=viral_dataset_labeled['viral']) \n",
    "display(contingency_has_mentions)\n",
    "# Display with percentages\n",
    "display(pd.crosstab(index = viral_dataset_labeled['nb_of_mentions'] > 0, columns=viral_dataset_labeled['viral'], normalize='columns') )\n",
    "\n",
    "c, p, dof, expected = chi2_contingency(contingency_has_mentions) \n",
    "display(f'p-value {p}')\n",
    "c, p, dof, expected"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1.3.8 - Save result of preprocessing to disk"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "viral_dataset_labeled.to_parquet(f'{PROCESSED_PATH_VIRAL}/all_english_tweets_with_users_with_sentiment.parquet.gzip', index=False, compression=\"gzip\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "viral_dataset_labeled.columns\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.4 - Covid dataset Exploration"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we concern ourselves only with original tweets (no retweets)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "original_covid_tweets = pd.read_parquet(f\"{COVID_TWEETS_PATH}/all_original_tweets.parquet.gzip\")\n",
    "original_covid_tweets.loc[:, \"text\"] = original_covid_tweets.text.apply(clear_reply_mentions)\n",
    "\n",
    "covid_users = pd.read_parquet(f\"{COVID_TWEETS_PATH}/users.parquet.gzip\")\n",
    "\n",
    "display(\"---  COVID DATASET ---\")\n",
    "\n",
    "display(f\"{len(original_covid_tweets)} original (not retweeted) covid tweets collected\")\n",
    "display(f\"{len(original_covid_tweets.author_id.unique())} covid users collected\")\n",
    "\n",
    "original_covid_tweets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "user_columns = ['author_id', 'followers_count', 'following_count', 'tweet_count', 'protected', 'verified', 'username']\n",
    "covid_dataset_with_users = original_covid_tweets.merge(covid_users.rename(columns={'id': 'author_id'})[user_columns], on='author_id')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Applying the first metric on the covid dataset\n",
    "covid_dataset_with_users['virality_followers'] = covid_dataset_with_users['retweet_count'] / covid_dataset_with_users['followers_count'].astype(\"float64\")\n",
    "# Handle division by zero if user has 0 followers\n",
    "covid_dataset_with_users['virality_followers'] = covid_dataset_with_users.virality_followers.replace({np.inf: 0.0})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "covid_dataset_with_users"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "px.histogram(covid_dataset_with_users, x='followers_count', y = 'virality_followers', log_y=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "covid_dataset_with_users['viral'] = covid_dataset_with_users.virality_followers > 1\n",
    "covid_dataset_with_users[covid_dataset_with_users.viral]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.4.1 - Language"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "languages_aggregates = covid_dataset_with_users.groupby(by='lang', as_index=False)[['id']].count().rename(columns={'id': 'count'})\n",
    "languages_aggregates = languages_aggregates.sort_values(by='count', ascending=False)\n",
    "languages_aggregates.loc[languages_aggregates['count'] < 10000, 'lang'] = 'Other Languages'\n",
    "fig = px.pie(languages_aggregates, values='count', names='lang', title='Distribution of Tweets languages')\n",
    "\n",
    "fig.update_layout(\n",
    "    autosize=False,\n",
    "    width=500,\n",
    "    height=500\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "english_covid_tweets = covid_dataset_with_users[covid_dataset_with_users.lang == 'en']\n",
    "display(f\"{len(english_covid_tweets)} english covid tweets\")\n",
    "\n",
    "english_viral_covid_tweets = english_covid_tweets[english_covid_tweets.viral]\n",
    "display(f\"{len(english_viral_covid_tweets)} viral english covid tweets\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.4.2 - Media"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Has media\n",
    "labels = [\"Media\", \"No Media\"]\n",
    "viral_has_media = len(covid_dataset_with_users[(covid_dataset_with_users.viral == True) & (covid_dataset_with_users.has_media == True)])\n",
    "viral_no_media = len(covid_dataset_with_users[(covid_dataset_with_users.viral == True) & (covid_dataset_with_users.has_media == False)])\n",
    "normal_has_media = len(covid_dataset_with_users[(covid_dataset_with_users.viral == False) & (covid_dataset_with_users.has_media == True)])\n",
    "normal_no_media = len(covid_dataset_with_users[(covid_dataset_with_users.viral == False) & (covid_dataset_with_users.has_media == False)])\n",
    "\n",
    "\n",
    "# Create subplots: use 'domain' type for Pie subplot\n",
    "fig = make_subplots(rows=1, cols=2, specs=[[{'type':'domain'}, {'type':'domain'}]])\n",
    "fig.add_trace(go.Pie(labels=labels, values=[viral_has_media, viral_no_media], name=\"Viral with Media\"),\n",
    "              1, 1)\n",
    "fig.add_trace(go.Pie(labels=labels, values=[normal_has_media, normal_no_media], name=\"Tweet with Media\"),\n",
    "              1, 2)\n",
    "\n",
    "# Use `hole` to create a donut-like pie chart\n",
    "fig.update_traces(hole=.4, hoverinfo=\"label+percent+name\")\n",
    "\n",
    "fig.update_layout(\n",
    "    width=1000,\n",
    "    height=500,\n",
    "    title_text=\"Percentage of tweets with some kind of media\",\n",
    "    # Add annotations in the center of the donut pies.\n",
    "    annotations=[dict(text='Viral', x=0.18, y=0.5, font_size=20, showarrow=False),\n",
    "                 dict(text='Non-Viral', x=0.82, y=0.5, font_size=20, showarrow=False)])\n",
    "fig.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.4.3 - Tweet Length"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "covid_dataset_with_users.loc[:, 'tweet_length'] = covid_dataset_with_users.text.apply(len)\n",
    "covid_dataset_with_users[['tweet_length', 'retweet_count']].corr()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.4.4 - Sentiment"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from transformers import pipeline\n",
    "\n",
    "# Device = 0 means it will use the Cuda at index 0\n",
    "sentiment_classifier = pipeline(\"sentiment-analysis\", device=0)\n",
    "\n",
    "english_covid_dataset = covid_dataset_with_users[covid_dataset_with_users.lang == 'en']\n",
    "english_covid_dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we compute sentiments again. To avoid having to compute the sentiments again, we've already preprocessed the data and computed the sentiments and saved it to parquet. Feel free to skip the next 2 cells."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "applied = english_covid_dataset.apply(lambda x: sentiment_classifier(x.text)[0], axis=1, result_type='expand')\n",
    "#pd.concat([small_test_set, applied], axis='columns')\n",
    "applied"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sentiment_features = pd.concat([english_covid_dataset, applied], axis=1)\n",
    "sentiment_features = sentiment_features.rename(columns={\"label\": \"sentiment\", \"score\": \"sentiment_score\"})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sentiment_features = pd.read_parquet(f\"{PROCESSED_PATH_COVID}/english_tweets_with_users_with_sentiment.parquet.gzip\")\n",
    "sentiment_features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Tweets with sentiment scores over 70%\n",
    "display(f\"Tweets with sentiment analysis confidence scores above 0.7: {len(sentiment_features[sentiment_features.sentiment_score > 0.7])}\")\n",
    "display(f\"{len(sentiment_features[sentiment_features.sentiment == 'POSITIVE'])} positive tweets\")\n",
    "display(f\"{len(sentiment_features[sentiment_features.sentiment == 'NEGATIVE'])} negative tweets\")\n",
    "\n",
    "confident_sentiment_tweets = sentiment_features[sentiment_features.sentiment_score > 0.7]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We keep only retweeted tweets to pan out tweets with zero retweets with little utility.\n",
    "labels = [\"Positive\", \"Negative\"]\n",
    "viral_positive = len(confident_sentiment_tweets[(confident_sentiment_tweets.viral == True) & (confident_sentiment_tweets.sentiment == 'POSITIVE')])\n",
    "viral_negative = len(confident_sentiment_tweets[(confident_sentiment_tweets.viral == True) & (confident_sentiment_tweets.sentiment == 'NEGATIVE')])\n",
    "normal_positive = len(confident_sentiment_tweets[(confident_sentiment_tweets.viral == False) & (confident_sentiment_tweets.sentiment == 'POSITIVE')])\n",
    "normal_negative = len(confident_sentiment_tweets[(confident_sentiment_tweets.viral == False) & (confident_sentiment_tweets.sentiment == 'NEGATIVE')])\n",
    "\n",
    "\n",
    "# Create subplots: use 'domain' type for Pie subplot\n",
    "fig = make_subplots(rows=1, cols=2, specs=[[{'type':'domain'}, {'type':'domain'}]])\n",
    "fig.add_trace(go.Pie(labels=labels, values=[viral_positive, viral_negative], name=\"Positive Viral Tweets\"),\n",
    "              1, 1)\n",
    "fig.add_trace(go.Pie(labels=labels, values=[normal_positive, normal_negative], name=\"Positive Non-Viral Tweets\"),\n",
    "              1, 2)\n",
    "\n",
    "# Use `hole` to create a donut-like pie chart\n",
    "fig.update_traces(hole=.4, hoverinfo=\"label+percent+name\")\n",
    "\n",
    "fig.update_layout(\n",
    "    width=1000,\n",
    "    height=500,\n",
    "    title_text=\"Distribution of positive and negative sentiment in viral vs non-viral tweets\",\n",
    "    # Add annotations in the center of the donut pies.\n",
    "    annotations=[dict(text='Viral', x=0.18, y=0.5, font_size=20, showarrow=False),\n",
    "                 dict(text='Non-Viral', x=0.82, y=0.5, font_size=20, showarrow=False)])\n",
    "fig.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.4.5 - Number of Hashtags"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "covid_dataset_with_users.loc[:, \"nb_of_hashtags\"] = covid_dataset_with_users.hashtags.apply(lambda x: len(x) if np.all(x) else 0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "labels = [\"Hashtags\", \"No Hashtags\"]\n",
    "viral_has_hashtags = len(covid_dataset_with_users[(covid_dataset_with_users.viral) & (covid_dataset_with_users.nb_of_hashtags >= 1)])\n",
    "viral_no_hashtags = len(covid_dataset_with_users[(covid_dataset_with_users.viral) & (covid_dataset_with_users.nb_of_hashtags == 0)])\n",
    "normal_has_hashtags = len(covid_dataset_with_users[(~covid_dataset_with_users.viral) & (covid_dataset_with_users.nb_of_hashtags > 1)])\n",
    "normal_no_hashtags = len(covid_dataset_with_users[(~covid_dataset_with_users.viral) & (covid_dataset_with_users.nb_of_hashtags == 0)])\n",
    "\n",
    "\n",
    "# Create subplots: use 'domain' type for Pie subplot\n",
    "fig = make_subplots(rows=1, cols=2, specs=[[{'type':'domain'}, {'type':'domain'}]])\n",
    "fig.add_trace(go.Pie(labels=labels, values=[viral_has_hashtags, viral_no_hashtags], name=\"Viral with Hashtags\"),\n",
    "              1, 1)\n",
    "fig.add_trace(go.Pie(labels=labels, values=[normal_has_hashtags, normal_no_hashtags], name=\"Tweet with No Hashtags\"),\n",
    "              1, 2)\n",
    "\n",
    "# Use `hole` to create a donut-like pie chart\n",
    "fig.update_traces(hole=.4, hoverinfo=\"label+percent+name\")\n",
    "\n",
    "fig.update_layout(\n",
    "    width=1000,\n",
    "    height=500,\n",
    "    title_text=\"Percentage of tweets with hashtags\",\n",
    "    # Add annotations in the center of the donut pies.\n",
    "    annotations=[dict(text='Viral', x=0.18, y=0.5, font_size=20, showarrow=False),\n",
    "                 dict(text='Non-Viral', x=0.82, y=0.5, font_size=20, showarrow=False)])\n",
    "fig.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1.4.6 - Verified Account"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Has media\n",
    "labels = [\"Verified\", \"Not verified\"]\n",
    "viral_is_verified = len(covid_dataset_with_users[(covid_dataset_with_users.viral) & (covid_dataset_with_users.verified)])\n",
    "viral_not_verified = len(covid_dataset_with_users[(covid_dataset_with_users.viral) & (~covid_dataset_with_users.verified)])\n",
    "normal_is_verified = len(covid_dataset_with_users[(~covid_dataset_with_users.viral) & (covid_dataset_with_users.verified)])\n",
    "normal_not_verified = len(covid_dataset_with_users[(~covid_dataset_with_users.viral) & (~covid_dataset_with_users.verified)])\n",
    "\n",
    "\n",
    "# Create subplots: use 'domain' type for Pie subplot\n",
    "fig = make_subplots(rows=1, cols=2, specs=[[{'type':'domain'}, {'type':'domain'}]])\n",
    "fig.add_trace(go.Pie(labels=labels, values=[viral_is_verified, viral_not_verified], name=\"Viral with verified accounts\"),\n",
    "              1, 1)\n",
    "fig.add_trace(go.Pie(labels=labels, values=[normal_is_verified, normal_not_verified], name=\"Tweet with an unverified account\"),\n",
    "              1, 2)\n",
    "\n",
    "# Use `hole` to create a donut-like pie chart\n",
    "fig.update_traces(hole=.4, hoverinfo=\"label+percent+name\")\n",
    "\n",
    "fig.update_layout(\n",
    "    width=1000,\n",
    "    height=500,\n",
    "    title_text=\"Percentage of tweets from verified accounts\",\n",
    "    # Add annotations in the center of the donut pies.\n",
    "    annotations=[dict(text='Viral', x=0.18, y=0.5, font_size=20, showarrow=False),\n",
    "                 dict(text='Non-Viral', x=0.82, y=0.5, font_size=20, showarrow=False)])\n",
    "fig.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.4.7 - Save dataframe with analysis to disk"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "covid_dataset_with_users.to_parquet(f'{PROCESSED_PATH_COVID}/all_english_tweets_with_users_with_sentiment.parquet.gzip', index=False, compression=\"gzip\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Questions for TJ:\n",
    "\n",
    "Learn threshold? Use unsupervised learning (anomaly detection), x axis date y retweet count, isolation coordinate\n",
    "Ratio\n",
    "Try to come up with Different metrics (one cannot be used for second dataset)\n",
    "\n",
    "Preprocessing:\n",
    "    - Remove tweets with no retweets or likes? NO\n",
    "    - Define threshold using the metric? DONE (label above viral tweet)\n",
    "    - Skewed distribution if we use only Twitter viral tweets (1000) DONE\n",
    "- Which features? (Any new ideas)\n",
    "    - Topic\n",
    "    - Hashtags relevant? (Most likely different from coronavirus and we already have topics).\n",
    "    - Has media\n",
    "    - Sentiment? [TODO]\n",
    "    - Tweet length [TODO]\n",
    "    - RETRIEVE USERS THAT LIKED OR RETWEETED USING API [TODO]\n",
    "    - Word cloud of entities [TODO]\n",
    "- Check bigrams and trigrams distribution\n",
    "- Normalize features (like, retweets, reply etc...)? DEPENDS, Included in first model, will be removed from second model with covid set.\n",
    "- BertTweet [DO NOT REMOVE STOP WORDS FOR LANGUAGE MODELS, FOR ]\n",
    "- Next steps (now that data collection part is done and data analysis almost done)\n",
    "    - Hydrate Covid dataset id\n",
    "- Viral generator (Trump generator)\n",
    "\n",
    "1st classifier: hashtags, twitter entities (context annotations, domain annotations, entities), mentions, domain of urls (youtube.com let’s say)\n",
    "2nd classifier: bag of words with tf-idf, remove stopwords and other entities that you used in the 1st classifier\n",
    "3rd: language model\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.15"
  },
  "vscode": {
   "interpreter": {
    "hash": "71d2f77bccee14ca7852d7b7a1fa8ea4708b81087104d93973081337557f0ee6"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}