# Comparison of Twitter's model of viral tweets with other tweets

In this notebook, we try to identify common features of Twitter's identified viral tweets found on the topic page "Viral Tweets".

We also experiment to find if other tweets that have not figured on that topic page, can also be labeled as viral based on these common features. This should help homogeinize the data (those that are viral and those that are not) when training the model.

In [None]:
import pandas as pd
import seaborn as sns
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

from helper.text_preprocessing import clear_reply_mentions

from tqdm import tqdm

#pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

DATA_PATH = "../../data"
VIRAL_TWEETS_PATH = f"{DATA_PATH}/new/viral"
COVID_TWEETS_PATH = f"{DATA_PATH}/new/covid"

PROCESSED_PATH_VIRAL = f'{DATA_PATH}/new/processed/viral'
PROCESSED_PATH_COVID = f'{DATA_PATH}/new/processed/covid'

## 0. Preprocessing

In [None]:
viral_dataset = pd.read_parquet(f"{VIRAL_TWEETS_PATH}/all_tweets.parquet.gzip")

In [None]:
covid_dataset = pd.read_parquet(f"{COVID_TWEETS_PATH}/all_tweets.parquet.gzip")

In [None]:
covid_users = pd.read_parquet(f"{COVID_TWEETS_PATH}/users.parquet.gzip")

- Keep only original tweets from **covid dataset**. Viral dataset doesn't have retweets

In [None]:
def is_retweeted(referenced_tweets):
 for x in referenced_tweets:
 if x['type'] == 'retweeted':
 return True
 return False

# Keep only original tweets
referenced = covid_dataset.loc[~covid_dataset.referenced_tweets.isna()].copy()
referenced.loc[:, 'is_retweet'] = referenced.referenced_tweets.apply(is_retweeted)
retweeted = referenced[referenced.is_retweet]
retweeted

In [None]:
original_covid_tweets = covid_dataset[~covid_dataset.id.isin(retweeted.id)]
original_covid_tweets.to_parquet(f"{COVID_TWEETS_PATH}/all_original_tweets.parquet.gzip", index=False, compression="gzip")

In [None]:
# Clear reply mentions at the beginning of tweets texts
original_covid_tweets.loc[:, "text"] = original_covid_tweets.text.apply(clear_reply_mentions)
viral_dataset.loc[:, "text"] = viral_dataset.text.apply(clear_reply_mentions)

## 1. Exploration

### 1.1 - General Exploration

In [None]:
viral_dataset = pd.read_parquet(f"{VIRAL_TWEETS_PATH}/all_tweets.parquet.gzip")
viral_users = pd.read_parquet(f"{VIRAL_TWEETS_PATH}/users.parquet.gzip")
viral_tweets_ids = pd.read_parquet(f"{VIRAL_TWEETS_PATH}/viral_tweets_ids.parquet.gzip")

In [None]:
original_covid_tweets = pd.read_parquet(f"{COVID_TWEETS_PATH}/all_original_tweets.parquet.gzip")
covid_users = pd.read_parquet(f"{COVID_TWEETS_PATH}/users.parquet.gzip")

In [None]:
display("--- VIRAL DATASET ---")

display(f"{len(viral_tweets_ids)} viral tweets collected")
display(f"{len(viral_users)} viral users")
display(f"{len(viral_dataset)} all tweets collected")

display("--- COVID DATASET ---")

display(f"{len(original_covid_tweets)} original (not retweeted) covid tweets collected")
display(f"{len(original_covid_tweets.author_id.unique())} covid users collected")

In [None]:
# REMOVE THIS WHEN DONE COLLECTION (WARNING NOT NECESSARILY)
viral_dataset['viral'] = viral_dataset.id.isin(viral_tweets_ids.id)

#viral_tweets = all_tweets[all_tweets.id.isin(viral_tweets.id)]
#viral_tweets

len(viral_dataset[viral_dataset.viral])

- merge tweets with user info

In [None]:
covid_users.columns

In [None]:
user_columns = ['author_id', 'followers_count', 'following_count', 'tweet_count', 'protected', 'verified', 'username']
viral_dataset_with_users = viral_dataset.merge(viral_users.rename(columns={'id': 'author_id'})[user_columns], on='author_id')
covid_dataset_with_users = original_covid_tweets.merge(covid_users.rename(columns={'id': 'author_id'})[user_columns], on='author_id')

#### 1.1.1 - Correlation between public metrics

- Pearson Correlation between the different public metrics

In [None]:
public_metrics = ['retweet_count', 'like_count', 'reply_count', 'quote_count', 'followers_count', 'following_count']
display(viral_dataset_with_users[public_metrics].corr())
display(covid_dataset_with_users[public_metrics].corr())

In [None]:
px.scatter(viral_dataset, x='like_count', y='retweet_count')

#### 1.1.2 - Exploring retweet count of viral vs non viral tweets

Since we have a large number of tweets to plot, we'll only sample a few from each user

In [None]:
def get_largest_n(all_tweets, by='retweet_count', n=100):
 '''Get the largest 100 tweets by retweet count for every user
 '''
 top_n_per_user = all_tweets.groupby(by='author_id')[by].nlargest(n=100).reset_index(level=0, drop=True)
 tweets_for_plot = all_tweets[all_tweets.index.isin(top_n_per_user.index)].reset_index()
 return tweets_for_plot

In [None]:
tweets_plot_df = get_largest_n(viral_dataset, by='retweet_count')
fig = px.scatter(tweets_plot_df, x=tweets_plot_df.index, y='retweet_count', color='viral')

fig.update_layout(title_text="Viral Dataset Scatter plot of the retweet count for the top 100 tweets per user", xaxis_title="Index", yaxis_title="retweet count")

fig.show()

In [None]:
covid_tweets_plot_df = original_covid_tweets.sort_values(by='retweet_count', ascending=False)[:10000]
fig = px.scatter(covid_tweets_plot_df, x=covid_tweets_plot_df.reset_index().index, y='retweet_count')

fig.update_layout(title_text="Covid Dataset Scatter plot of retweet count sorted by retweet count on a 10000 sample", xaxis_title="Index", yaxis_title="retweet count")

fig.show()

**Finding**: Viral tweets identified by twitter are by no means more viral than other tweets tweeted by the same users. Are users who have tweeted viral tweets (as identified by Twitter) likely to have tweeted other viral tweets?

In [None]:
# Get the ratio for each tweet's retweet count wrt to the mean retweet count of the user's tweets
# Again since we're retrieved 3200 tweets per user, we're only taking the average over that
users_avg_retweets = viral_dataset.groupby(by='author_id').agg(mean_retweets=('retweet_count', 'mean'))
tweets_merged_avg_retweets = viral_dataset.merge(right=users_avg_retweets, left_on='author_id', right_index=True)
tweets_merged_avg_retweets['ratio_avg_retweets'] = tweets_merged_avg_retweets['retweet_count'] / tweets_merged_avg_retweets['mean_retweets']
tweets_merged_avg_retweets_sorted = tweets_merged_avg_retweets.sort_values(by='ratio_avg_retweets').reset_index()

In [None]:
tweets_plot_df = get_largest_n(tweets_merged_avg_retweets_sorted, by='ratio_avg_retweets')

fig = px.scatter(tweets_plot_df, x=tweets_plot_df.index, y='ratio_avg_retweets', color='viral')

fig.update_layout(title_text="Scatter plot of the tweets sorted by the ratio #retweets/(the mean user avg #retweets)", xaxis_title="Index", yaxis_title="ratio")

fig.show()

**Finding**: Cleaner separation. Viral tweets, as expected, are on the other end of the spectrum. However other tweets in the same range could qualify as viral as well. These tweets should be identified as viral by the Twitter model.

### 1.2 Finding the right threshold for virality

#### 1.2.0 - Relabel viral tweets in the viral dataset by correcting the initial virality threshold (ONLY IN OLD PAPER SUBMITTED BY STUDENT)

Let's observe the retweet count of a user based on the tweet date.

In [None]:
sample_user = viral_users.id[10]
author_tweets = viral_dataset[viral_dataset.author_id == sample_user]
fig = px.scatter(author_tweets, x='created_at', y='retweet_count', color='viral')

fig.update_layout(title_text="Scatter plot of the retweet count wrt to the tweet date for a single user")

fig.show() 

**Finding**: The above graph of a user's retweet count wrt the tweet date, shows that the viral tweets taken from the Twitter "Viral Tweets" topic page, have been taken at certain points in time. **Other tweets with higher retweet counts** may have been on that Topic page at different points in time as well. In any case, they **should be qualified as viral all the same**.

One quick fix for that is, for each user, mark as viral all tweets that have higher retweet count than the viral tweet we scraped for that user. 

In [None]:
# Get the minimum retweet count out of the viral tweets for each user
min_retweet_count_by_user = viral_dataset[viral_dataset.viral].groupby(by='author_id')[['retweet_count']].min()

# Set as viral any tweet that has a retweet count higher or equal to the user's minimum retweet count we just computed
viral_dataset_labeled = viral_dataset.merge(min_retweet_count_by_user, left_on='author_id', right_index=True, suffixes=(None, "_user_viral_threshold"))
viral_dataset_labeled['viral'] = viral_dataset_labeled['retweet_count'] >= viral_dataset_labeled['retweet_count_user_viral_threshold']

In [None]:
# Save this result 
#viral_dataset_labeled.to_parquet(f'{PROCESSED_PATH_VIRAL}/all_tweets.parquet.gzip', compression='gzip')

In [None]:
display(f"Number of identified viral tweets increased from {len(viral_tweets_ids)} to {len(viral_dataset_labeled[viral_dataset_labeled.viral])}")

Another problem we're facing is that we're **missing historical data** on the number of followers of a user. So we cannot use the metric of:
$ \frac{\#retweets}{\#followers}$ effectively. That's why we came up with the other metric: $\frac{\#retweets}{mean(\#retweets)}$.

#### 1.2.1 Applying the virality followers metric to both datasets

In [None]:
# Applying the first metric on the covid dataset
covid_dataset_with_users['virality_followers'] = covid_dataset_with_users['retweet_count'] / covid_dataset_with_users['followers_count'].astype("float64")
# Handle division by zero if user has 0 followers
covid_dataset_with_users['virality_followers'] = covid_dataset_with_users.virality_followers.replace({np.inf: 0.0})

In [None]:
len(covid_dataset_with_users[(covid_dataset_with_users['virality_followers'] > 0.8)])

In [None]:
# Applying the second metric on the viral dataset
viral_dataset_with_users['virality_followers'] = viral_dataset_with_users['retweet_count'] / viral_dataset_with_users['followers_count'].astype("float64")
# Handle division by zero if user has 0 followers
viral_dataset_with_users['virality_followers'] = viral_dataset_with_users.virality_followers.replace({np.inf: 0.0})

In [None]:
len(viral_dataset_with_users[(viral_dataset_with_users['virality_followers'] > 1)])

#### 1.2.2 Applying the virality avg retweets metric to viral dataset 

In [None]:
viral_users_retweet_statistics = viral_dataset_with_users.groupby(by='author_id').retweet_count.agg(['min', 'mean', 'max'])
viral_users_retweet_statistics = viral_users_retweet_statistics.rename(columns={"min": "min_user_retweets", "max": "max_user_retweets", "mean": "mean_user_retweets"})

In [None]:
viral_dataset_with_users = viral_dataset_with_users.merge(viral_users_retweet_statistics, on='author_id')

In [None]:
# Applying the first metric on the viral dataset
viral_dataset_with_users['virality_avg_retweets'] = viral_dataset_with_users['retweet_count'] / viral_dataset_with_users['mean_user_retweets'].astype("float64")
# Handle division by zero if user has 0 followers
viral_dataset_with_users['virality_avg_retweets'] = viral_dataset_with_users.virality_avg_retweets.replace({np.inf: 0.0})

In [None]:
len(viral_dataset_with_users[(viral_dataset_with_users['virality_avg_retweets'] > 1)])

#### 1.2.3 How many tweets are covered by metric 1?

In [None]:
temp = viral_dataset_with_users[viral_dataset_with_users.virality_followers > 0]
temp_2 = viral_dataset_with_users[viral_dataset_with_users.virality_avg_retweets > 0]
viral_temp = viral_dataset_with_users[viral_dataset_with_users.viral]

In [None]:
fig = px.ecdf(viral_dataset_with_users[viral_dataset_with_users.viral], x='virality_followers')

# TODO: percentage y axis
# TODO: Only take the scraped tweets
fig.update_layout(title_text="Percentage of viral tweets recognized by Metric 1: number of followers", xaxis_title="Metric 1: virality_followers", yaxis_title="Percentage")

fig.show()

In [None]:
fig1 = sns.displot(temp, x='virality_followers', kind='ecdf')

plt.xscale('log')
plt.title("Proportion of tweets labeled as viral as function of Metric 1: number of followers (logscale)")

In [None]:
temp_2 = viral_dataset_with_users[viral_dataset_with_users.virality_avg_retweets > 0]

In [None]:
fig = px.ecdf(viral_temp, x='virality_avg_retweets')

fig.update_layout(title_text="Percentage of viral tweets recognized by Metric 2 avg retweets", xaxis_title="Metric 2: avg retweets", yaxis_title="Percentage")

fig.show()

In [None]:
fig = sns.displot(temp_2, x='virality_avg_retweets', kind='ecdf')

plt.xscale('log')
plt.title("Proportion of tweets labeled as viral as function of Metric 2: avg retweets (logscale)")

TODO: Plot the percentage of viral tweets labeled vs the # of new tweets labeled wrt to the varying threshold of the metric we use. 

In [None]:
#viral_dataset_with_users = viral_dataset_with_users.groupby(by='virality_followers').count()
viral_dataset_with_users = pd.read_parquet(f"{PROCESSED_PATH_VIRAL}/all_tweets.parquet.gzip")
# Applying the second metric on the viral dataset
viral_dataset_with_users['virality_followers'] = viral_dataset_with_users['retweet_count'] / viral_dataset_with_users['followers_count'].astype("float64")
# Handle division by zero if user has 0 followers
viral_dataset_with_users['virality_followers'] = viral_dataset_with_users.virality_followers.replace({np.inf: 0.0})


In [None]:
viral_dataset_with_users

In [None]:
viral_dataset_with_users_truncated = viral_dataset_with_users[viral_dataset_with_users.virality_followers > 0.1]
#viral_dataset_with_users['viral_metric_1'] = viral_dataset_with_users['']
len(viral_dataset_with_users_truncated)

In [None]:
ready_to_plot = viral_dataset_with_users_truncated.copy()
ready_to_plot['viral'] = ready_to_plot['viral'].replace({False: None})
ready_to_plot = ready_to_plot.groupby(by='virality_followers').count()[['text', 'viral']].cumsum().rename(columns={'text':'tweets'})

In [None]:
fig = px.line(ready_to_plot, x='viral', y='tweets', hover_data=[ready_to_plot.index])#, log_y=True)

fig.update_layout(title_text="Line plot of #viral tweets labeled as viral vs # new tweets labeled as viral by varying threshold of Metric 1 (#followers)", xaxis_title="Number of viral tweets labeled as viral", yaxis_title="Number of new tweets labeled as viral")
fig.show()

In [None]:
ready_to_plot = viral_dataset_with_users_truncated.copy()
ready_to_plot['viral'] = ready_to_plot['viral'].replace({False: None})
ready_to_plot = ready_to_plot.groupby(by='virality_followers').count()[['text', 'viral']].cumsum().rename(columns={'text':'tweets'})
ready_to_plot['tweets'] = len(viral_dataset_with_users) - ready_to_plot.tweets

In [None]:
fig = px.line(ready_to_plot, x='viral', y='tweets', hover_data=[ready_to_plot.index])#, log_y=True)

fig.update_layout(title_text="Line plot of #viral tweets labeled as viral vs # new tweets labeled as viral by varying threshold of Metric 1 (#followers)", xaxis_title="Number of viral tweets labeled as viral", yaxis_title="Number of new tweets labeled as viral")
fig.show()

In [None]:
'''
tempo3 = tempo2.copy()
tempo3['viral'] = tempo3['viral'].replace({False: None})
tempo3 = tempo3.groupby(by='virality_followers').count()[['text', 'viral']].rename(columns={'text':'tweets'})
tempo3['viral_cumsum'] = tempo3.viral.cumsum()
tempo3
'''

In [None]:
min_threshold = viral_dataset_with_users.virality_followers.min()
max_threshold = viral_dataset_with_users.virality_followers.max()
display(f"sampling from {min_threshold} to {max_threshold}")
thresholds_space = np.linspace(min_threshold, max_threshold, num=10000)

number_of_viral_tweets = len(viral_dataset_with_users[viral_dataset_with_users.viral]) 

percentages_of_viral_covered = []
nb_of_tweets_labeled_as_viral = []

for i in thresholds_space:
 new_tweets_labeled = viral_dataset_with_users[viral_dataset_with_users.virality_followers >= i]
 percentage_of_viral_covered = len(new_tweets_labeled[new_tweets_labeled.viral]) / number_of_viral_tweets
 nb_of_tweets_labeled_as_viral.append(len(new_tweets_labeled))
 percentages_of_viral_covered.append(percentage_of_viral_covered)

In [None]:
result_to_plot = pd.DataFrame({'percentage_of_viral_covered':percentages_of_viral_covered, 'nb_of_tweets_labeled_as_viral':nb_of_tweets_labeled_as_viral, 'thresholds': thresholds_space})

px.scatter(
 result_to_plot,
 x='percentage_of_viral_covered',
 y='nb_of_tweets_labeled_as_viral', log_y=True, hover_name='thresholds')

In [None]:
result_to_plot.to_csv('new_tweets_labeled_vs_percentage_of_viral.csv', index=False)

#### 1.2.4 Comparing several metrics wrt distributions of viral tweets covered

In [None]:
def plot_distribution_for_metric(
 df, metric='virality_followers', num_experiments=1000, generate_thresholds_from_viral_quantiles=True, min_threshold=None, max_threshold=None, remove_duplicates=True, output_filename=None):
 viral_tweets = df[df.viral]
 number_of_viral_tweets = len(viral_tweets)
 
 if not generate_thresholds_from_viral_quantiles: 
 # If not, generate a linear space of the thresholds between min and max of the metric values
 if not min_threshold:
 min_threshold = df[metric].min()
 if not max_threshold:
 max_threshold = df[metric].max()
 display(f"sampling from {min_threshold} to {max_threshold}")
 thresholds_space = np.linspace(min_threshold, max_threshold, num=num_experiments)
 else:
 # Take quantiles of metric for different percentages of viral tweets covered (from 0 to 100)
 thresholds_space = viral_tweets[metric].quantile([i / 100 for i in range(101)]) 
 display(f"sampling from {thresholds_space.min()} to {thresholds_space.max()}")

 percentages_of_viral_covered = []
 nb_of_tweets_labeled_as_viral = []

 for i in thresholds_space:
 new_tweets_labeled = df[df[metric] >= i]
 percentage_of_viral_covered = len(new_tweets_labeled[new_tweets_labeled.viral]) / number_of_viral_tweets
 nb_of_tweets_labeled_as_viral.append(len(new_tweets_labeled))
 percentages_of_viral_covered.append(percentage_of_viral_covered)
 
 results_to_plot = pd.DataFrame({
 f'percentage_of_viral_covered_{metric}':percentages_of_viral_covered,
 f'nb_of_tweets_labeled_as_viral_{metric}':nb_of_tweets_labeled_as_viral,
 f'thresholds_{metric}': thresholds_space})

 #if remove_duplicates:
 # results_to_plot = results_to_plot.sort_values(by='nb_of_tweets_labeled_as_viral').drop_duplicates(subset=['percentage_of_viral_covered'], keep='first')

 # Discard rows where 100% of viral tweets are covered
 #results_to_plot = results_to_plot[results_to_plot.percentage_of_viral_covered < 1.0]
 # TODO: take min of 100% coverage

 fig = px.scatter(
 results_to_plot,
 x=f'percentage_of_viral_covered_{metric}',
 y=f'nb_of_tweets_labeled_as_viral_{metric}', hover_name=f'thresholds_{metric}')#log_y=True, trendline='ols' 

 fig.update_layout(title_text=f"Percentage of viral covered vs new tweets labeled as viral according to varying metric {metric}")
 fig.show()

 display(f"Result length {len(results_to_plot)}")
 if not output_filename:
 output_filename = metric
 results_to_plot.to_csv(f'{output_filename}_viral_covered_vs_new_tweets_labeled.csv', index=False) 
 
 return results_to_plot

In [None]:
#viral_dataset_with_users = viral_dataset_with_users.groupby(by='virality_followers').count()
METRIC_1 = 'virality_followers'
viral_dataset_with_users = pd.read_parquet(f"{PROCESSED_PATH_VIRAL}/all_tweets.parquet.gzip")
# Applying the second metric on the viral dataset
viral_dataset_with_users[METRIC_1] = viral_dataset_with_users['retweet_count'] / viral_dataset_with_users['followers_count'].astype("float64")
# Handle division by zero if user has 0 followers
viral_dataset_with_users[METRIC_1] = viral_dataset_with_users[METRIC_1].replace({np.inf: 0.0})

In [None]:
df_1 = plot_distribution_for_metric(viral_dataset_with_users, metric='virality_followers', num_experiments=10000)

In [None]:
# Metric 2: retweet / user avg retweets
METRIC_2 = 'virality_avg_retweets'
viral_users_retweet_statistics = viral_dataset_with_users.groupby(by='author_id').retweet_count.agg(['min', 'mean', 'max', 'median'])
viral_users_retweet_statistics = viral_users_retweet_statistics.rename(columns={
 "min": "min_user_retweets", "max": "max_user_retweets", "mean": "mean_user_retweets", "median": "median_user_retweets"})

viral_dataset_with_users = viral_dataset_with_users.merge(viral_users_retweet_statistics, on='author_id')

viral_dataset_with_users[METRIC_2] = viral_dataset_with_users['retweet_count'] / viral_dataset_with_users['mean_user_retweets'].astype("float64")
# Handle division by zero if user has 0 followers
viral_dataset_with_users[METRIC_2] = viral_dataset_with_users[METRIC_2].replace({np.inf: 0.0})

In [None]:
df_2 = plot_distribution_for_metric(viral_dataset_with_users, metric='virality_avg_retweets', num_experiments=10000)

In [None]:
# Metric 3: Minimum retweet count (Hard threshold)
METRIC_3 = 'retweet_count'

viral_tweets = viral_dataset_with_users[viral_dataset_with_users.viral]
min_viral_retweet_count = viral_tweets.retweet_count.min()
max_viral_retweet_count = viral_tweets.retweet_count.max()

df_3 = plot_distribution_for_metric(
 viral_dataset_with_users, metric=METRIC_3, num_experiments=10000,
 min_threshold=min_viral_retweet_count, max_threshold=max_viral_retweet_count, generate_thresholds_from_viral_quantiles=False,
 output_filename='hard_threshold')

In [None]:
# Metric 4 from Maldonado paper 'Virality Prediction for News Tweets Using RoBERTa'
def roberta_paper_metric(x):
 g = x['retweet_count'] + x['like_count']
 h = x['followers_count'] - x['following_count']
 A = 10

 r = max(x['retweet_count'], 1)
 f = max(x['like_count'], 1)
 w = max(x['followers_count'], 1)
 d = max(x['following_count'], 1)
 h = max(h, 1)

 num = g * d * (A * r + f)
 denom = w * r * (A * d + h)
 #if denom == 0:
 # return 0
 return num / denom

In [None]:
METRIC_4 = 'roberta_paper_metric'
viral_dataset_with_users[METRIC_4] = viral_dataset_with_users.apply(lambda x: roberta_paper_metric(x), axis='columns')

df_4 = plot_distribution_for_metric(
 viral_dataset_with_users, metric=METRIC_4, num_experiments=100000)

In [None]:
METRIC_5 = 'virality_retweet_percentile_per_user'

# Take only tweets with positive retweet count, otherwise the quantiles will be very heavy-tailed
#tweets_with_retweets = viral_dataset_with_users[viral_dataset_with_users.retweet_count > 0]

viral_tweets = viral_dataset_with_users[viral_dataset_with_users.viral]
percentiles = [i/100 for i in range(101)]
number_of_viral_tweets = len(viral_tweets)

percentages_of_viral_covered = []
nb_of_tweets_labeled_as_viral = []

for i in tqdm(percentiles):
 temp = viral_dataset_with_users.groupby(by='author_id')[['retweet_count']].quantile(i).rename(columns={'retweet_count': f'percentile_{i}'})
 temp = viral_dataset_with_users.merge(temp, on='author_id')

 new_tweets_labeled = temp[temp['retweet_count'] >= temp[f'percentile_{i}']]
 percentage_of_viral_covered = len(new_tweets_labeled[new_tweets_labeled.viral]) / number_of_viral_tweets
 nb_of_tweets_labeled_as_viral.append(len(new_tweets_labeled))
 percentages_of_viral_covered.append(percentage_of_viral_covered)

df_5 = pd.DataFrame({
 f'percentage_of_viral_covered_{METRIC_5}':percentages_of_viral_covered,
 f'nb_of_tweets_labeled_as_viral_{METRIC_5}':nb_of_tweets_labeled_as_viral,
 f'thresholds_{METRIC_5}': percentiles})

fig = px.scatter(
 df_5,
 x=f'percentage_of_viral_covered_{METRIC_5}',
 y=f'nb_of_tweets_labeled_as_viral_{METRIC_5}', hover_name=f'thresholds_{METRIC_5}')#log_y=True, trendline='ols' 

fig.update_layout(title_text=f"Percentage of viral covered vs new tweets labeled as viral according to varying metric {METRIC_5}")
fig.show()

display(f"Result length {len(df_5)}")
df_5.to_csv(f'{METRIC_5}_viral_covered_vs_new_tweets_labeled.csv', index=False)

In [None]:
# Metric 6: Median
METRIC_6 = 'virality_median_retweets'

positive_median_dataset = viral_dataset_with_users[viral_dataset_with_users['median_user_retweets'] > 0].copy()
positive_median_dataset.loc[:, METRIC_6] = positive_median_dataset['retweet_count'] / positive_median_dataset['median_user_retweets'].astype("float64")
# Handle division by zero if user has 0 followers
positive_median_dataset.loc[:, METRIC_6] = positive_median_dataset[METRIC_6].replace({np.inf: 0.0, np.nan:0.0})

In [None]:
df_6 = plot_distribution_for_metric(
 positive_median_dataset, metric=METRIC_6, num_experiments=10000, remove_duplicates=True)

In [None]:
# log(retweet_counts) / followers_count
METRIC_7 = 'log_retweets_over_followers'

positive_retweet_and_follower_count = viral_dataset_with_users[(viral_dataset_with_users.retweet_count > 0) & (viral_dataset_with_users.followers_count > 0)].copy()

positive_retweet_and_follower_count.loc[:, METRIC_7] = (np.log(positive_retweet_and_follower_count['retweet_count']) / positive_retweet_and_follower_count['followers_count']).astype("float64")
positive_retweet_and_follower_count.loc[:, METRIC_7] = positive_retweet_and_follower_count[METRIC_7].replace({np.inf: 0.0, np.nan:0.0})

df_7 = plot_distribution_for_metric(
 positive_retweet_and_follower_count, metric=METRIC_7, num_experiments=10000, remove_duplicates=True)

In [None]:
METRIC_8 = 'retweets_over_log_followers'

positive_retweet_and_follower_count.loc[:, METRIC_8] = (positive_retweet_and_follower_count['retweet_count'] / np.log(positive_retweet_and_follower_count['followers_count'])).astype("float64")
positive_retweet_and_follower_count.loc[:, METRIC_8] = positive_retweet_and_follower_count[METRIC_8].replace({np.inf: 0.0, np.nan:0.0})

df_8 = plot_distribution_for_metric(
 positive_retweet_and_follower_count, metric=METRIC_8, num_experiments=10000, remove_duplicates=True)

In [None]:
METRIC_9 = 'log_retweets_over_log_followers'

positive_retweet_and_follower_count.loc[:, METRIC_9] = (np.log(positive_retweet_and_follower_count['retweet_count']) / np.log(positive_retweet_and_follower_count['followers_count'])).astype("float64")
positive_retweet_and_follower_count.loc[:, METRIC_9] = positive_retweet_and_follower_count[METRIC_9].replace({np.inf: 0.0, np.nan:0.0})

df_9 = plot_distribution_for_metric(
 positive_retweet_and_follower_count, metric=METRIC_9, num_experiments=10000, remove_duplicates=True)

In [None]:
final_result = pd.concat([df_1, df_2, df_3, df_4, df_5, df_6, df_7, df_8, df_9], axis=1)
final_result.to_csv('final_result_viral_coverage.csv')

### 1.3 Viral Dataset Exploration: Comparison between viral and non viral tweets using other features 

In [None]:
# TODO: Only take viral tweets from scraped. Since sentiment is already computed on the other dataset, we relabel dataset viral by checking if in scraped ids 
# (DONE)

In [None]:
viral_dataset_labeled = pd.read_parquet(f'{PROCESSED_PATH_VIRAL}/all_tweets.parquet.gzip')

In [None]:
display(f"{len(viral_dataset_labeled[viral_dataset_labeled.viral])} viral tweets out of {len(viral_dataset_labeled)}")

#### 1.3.1 - Language

In [None]:
languages_aggregates = viral_dataset_labeled.groupby(by='lang', as_index=False)[['id']].count().rename(columns={'id': 'count'})
languages_aggregates = languages_aggregates.sort_values(by='count', ascending=False)
languages_aggregates.loc[languages_aggregates['count'] < 10000, 'lang'] = 'Other Languages'
fig = px.pie(languages_aggregates, values='count', names='lang', title='Distribution of Tweets languages')

fig.update_layout(
 autosize=False,
 width=500,
 height=500
)

In [None]:
pd.crosstab(index = viral_dataset_labeled['lang'] == 'en', columns=viral_dataset_labeled['viral']) 

#### 1.3.2 - Media

In [None]:
# Has media
labels = ["Media", "No Media"]
viral_has_media = len(viral_dataset_labeled[(viral_dataset_labeled.viral == True) & (viral_dataset_labeled.has_media == True)])
viral_no_media = len(viral_dataset_labeled[(viral_dataset_labeled.viral == True) & (viral_dataset_labeled.has_media == False)])
normal_has_media = len(viral_dataset_labeled[(viral_dataset_labeled.viral == False) & (viral_dataset_labeled.has_media == True)])
normal_no_media = len(viral_dataset_labeled[(viral_dataset_labeled.viral == False) & (viral_dataset_labeled.has_media == False)])


# Create subplots: use 'domain' type for Pie subplot
fig = make_subplots(rows=1, cols=2, specs=[[{'type':'domain'}, {'type':'domain'}]])
fig.add_trace(go.Pie(labels=labels, values=[viral_has_media, viral_no_media], name="Viral with Media"),
 1, 1)
fig.add_trace(go.Pie(labels=labels, values=[normal_has_media, normal_no_media], name="Tweet with Media"),
 1, 2)

# Use `hole` to create a donut-like pie chart
fig.update_traces(hole=.4, hoverinfo="label+percent+name")

fig.update_layout(
 width=1000,
 height=500,
 title_text="Percentage of tweets with some kind of media",
 # Add annotations in the center of the donut pies.
 annotations=[dict(text='Viral', x=0.18, y=0.5, font_size=20, showarrow=False),
 dict(text='Non-Viral', x=0.82, y=0.5, font_size=20, showarrow=False)])
fig.show()

Calculating the p-value between the target `viral` and `has_media`


In [None]:
from scipy.stats import chi2_contingency 

# Calculating the p-value
contingency_media = pd.crosstab(index = viral_dataset_labeled['has_media'], columns=viral_dataset_labeled['viral']) 
display(contingency_media)
# Display with percentages
display(pd.crosstab(index = viral_dataset_labeled['has_media'], columns=viral_dataset_labeled['viral'], normalize='columns') )

c, p, dof, expected = chi2_contingency(contingency_media) 
display(f'p-value {p}')
c, p, dof, expected

**Finding**: Viral tweets have more chance of having some kind of media (Video, Image, GIF..) embedded than non viral tweets.

#### 1.3.2 - Context annotations (Topics)

In [None]:
viral_tweets_topic_domains = viral_dataset_labeled[viral_dataset_labeled.viral == True] \
 .explode('topic_domains') \
 .dropna(axis=0, subset=['topic_domains']) \
 .topic_domains 

tweets_topic_domains = viral_dataset_labeled[viral_dataset_labeled.viral == False] \
 .explode('topic_domains') \
 .dropna(axis=0, subset=['topic_domains']) \
 .topic_domains

viral_topics_domains_sorted = viral_tweets_topic_domains.groupby(viral_tweets_topic_domains).count().sort_values(ascending=False)
tweet_topics_domains_sorted = tweets_topic_domains.groupby(tweets_topic_domains).count().sort_values(ascending=False)

In [None]:
import pickle

with open(f'{DATA_PATH}/topic_domains.pickle', 'rb') as handle:
 topic_domains = pickle.load(handle)

top_10_viral_topic_domains = viral_topics_domains_sorted[:10]
top_10_tweet_topic_domains = tweet_topics_domains_sorted[:10]

display(f"Top 10 topic domains in viral tweets: \n {[topic_domains.get(x)['name'] for x in top_10_viral_topic_domains.index.values]}")
display(f"Top 10 topic domains in general tweets: \n {[topic_domains.get(x)['name'] for x in top_10_tweet_topic_domains.index.values]}")

In [None]:
viral_labels = [topic_domains.get(x)['name'] for x in top_10_viral_topic_domains.index.values]
non_viral_labels = [topic_domains.get(x)['name'] for x in top_10_tweet_topic_domains.index.values]

# Create subplots: use 'domain' type for Pie subplot
fig2 = make_subplots(rows=1, cols=2, specs=[[{'type':'domain'}, {'type':'domain'}]])
fig2.add_trace(go.Pie(labels=viral_labels, values=top_10_viral_topic_domains.values, name="Viral Tweet Topic domain"),
 1, 1)
fig2.add_trace(go.Pie(labels=non_viral_labels, values=top_10_tweet_topic_domains.values, name="Non-Viral Tweet Topic domain"),
 1, 2)

# Use `hole` to create a donut-like pie chart
fig2.update_traces(hole=.4, hoverinfo="label+percent+name")

fig2.update_layout(
 width=1000,
 height=500,
 title_text="Top 10 topic domains for viral vs non-viral tweets",
 # Add annotations in the center of the donut pies.
 annotations=[dict(text='Viral', x=0.18, y=0.5, font_size=20, showarrow=False),
 dict(text='Non-Viral', x=0.82, y=0.5, font_size=20, showarrow=False)])
fig2.show()

#### 1.3.3 - Tweet Length

In [None]:
viral_dataset_labeled.loc[:, 'tweet_length'] = viral_dataset_labeled.text.apply(len)

In [None]:
display(viral_dataset_labeled[['tweet_length', 'retweet_count']].corr())

avg_tweet_length_viral = viral_dataset_labeled[viral_dataset_labeled.viral].tweet_length.mean()
avg_tweet_length_non_viral = viral_dataset_labeled[~viral_dataset_labeled.viral].tweet_length.mean()

display(f'viral avg tweet length: {avg_tweet_length_viral} \n non-viral avg tweet length: {avg_tweet_length_non_viral}')

Some tweets are replies to others so **mentions are automatically inserted at the beginning of the tweet**, but they do not count in the Twitter max character count, so we should discard them.

In [None]:
viral_dataset_labeled.loc[:, "text"] = viral_dataset_labeled.text.apply(clear_reply_mentions)
viral_dataset_labeled.loc[:, 'tweet_length'] = viral_dataset_labeled.text.apply(len)

In [None]:
display(viral_dataset_labeled[['tweet_length', 'retweet_count']].corr())

avg_tweet_length_viral = viral_dataset_labeled[viral_dataset_labeled.viral].tweet_length.mean()
avg_tweet_length_non_viral = viral_dataset_labeled[~viral_dataset_labeled.viral].tweet_length.mean()

display(f'viral avg tweet length: {avg_tweet_length_viral} \n non-viral avg tweet length: {avg_tweet_length_non_viral}')

Calculating the welch’s t-test (scipy t-test) for continuous variable `tweet_length`

In [None]:
from scipy.stats import ttest_ind

ttest_ind(viral_dataset_labeled[viral_dataset_labeled.viral].tweet_length, viral_dataset_labeled[~viral_dataset_labeled.viral].tweet_length, equal_var=False)

#### 1.3.4 - Sentiment 

For the sentiment analysis, we used huggingface's [default sentiment analysis model](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english?text=I+like+you.+I+love+you). We instantiate a huggingface pipeline using that default model, and we pass the tweets text to it, outputting a **label** (e.g. POSITIVE, NEGATIVE) alongside a **confidence score**. This will only be applied to english tweets.

**NOTE**: Feel free to skip the following cells if you already have the processed data. Sentiment analysis takes some time (around 2 hours on the whole data). 

In [None]:
from transformers import pipeline

# Device = 0 means it will use the Cuda at index 0
sentiment_classifier = pipeline("sentiment-analysis", device=0)

This will only be applied to **english tweets**. All the viral tweets we scraped are in English, so we won't be losing viral data when filtering.

In [None]:
english_viral_dataset = viral_dataset_labeled[viral_dataset_labeled.lang == 'en']
english_viral_dataset

Here we use the pandas `apply` function, with `result_type` to *expand*, so that the sentiment scores and label will be output into different columns.

In [None]:
applied = english_viral_dataset.apply(lambda x: sentiment_classifier(x.text)[0], axis=1, result_type='expand')
#pd.concat([small_test_set, applied], axis='columns')
applied

In [None]:
sentiment_features = pd.concat([english_viral_dataset, applied], axis=1)
sentiment_features

In [None]:
sentiment_features = sentiment_features.rename(columns={"label": "sentiment", "score": "sentiment_score"})

In [None]:
sentiment_features.to_parquet(f"{PROCESSED_PATH_VIRAL}/all_english_tweets_with_users_with_sentiment.parquet.gzip", index=False, compression="gzip")

Get the processed data already

In [None]:
sentiment_features = pd.read_parquet(f"{PROCESSED_PATH_VIRAL}/all_english_tweets_with_users_with_sentiment.parquet.gzip")
display(f"{len(sentiment_features[sentiment_features.viral])} viral tweets out of {len(sentiment_features)}")

In [None]:
# Tweets with sentiment scores over 70%
display(f"Tweets with sentiment analysis confidence scores above 0.7: {len(sentiment_features[sentiment_features.sentiment_score > 0.7])}")
display(f"{len(sentiment_features[sentiment_features.sentiment == 'POSITIVE'])} positive tweets")
display(f"{len(sentiment_features[sentiment_features.sentiment == 'NEGATIVE'])} negative tweets")

confident_sentiment_tweets = sentiment_features[sentiment_features.sentiment_score > 0.7]

In [None]:
# We keep only retweeted tweets to pan out tweets with zero retweets with little utility.
#retweeted_tweets = confident_sentiment_tweets[confident_sentiment_tweets.retweet_count > 0]

labels = ["Positive", "Negative"]
viral_positive = len(confident_sentiment_tweets[(confident_sentiment_tweets.viral == True) & (confident_sentiment_tweets.sentiment == 'POSITIVE')])
viral_negative = len(confident_sentiment_tweets[(confident_sentiment_tweets.viral == True) & (confident_sentiment_tweets.sentiment == 'NEGATIVE')])
normal_positive = len(confident_sentiment_tweets[(confident_sentiment_tweets.viral == False) & (confident_sentiment_tweets.sentiment == 'POSITIVE')])
normal_negative = len(confident_sentiment_tweets[(confident_sentiment_tweets.viral == False) & (confident_sentiment_tweets.sentiment == 'NEGATIVE')])


# Create subplots: use 'domain' type for Pie subplot
fig = make_subplots(rows=1, cols=2, specs=[[{'type':'domain'}, {'type':'domain'}]])
fig.add_trace(go.Pie(labels=labels, values=[viral_positive, viral_negative], name="Positive Viral Tweets"),
 1, 1)
fig.add_trace(go.Pie(labels=labels, values=[normal_positive, normal_negative], name="Positive Non-Viral Tweets"),
 1, 2)

# Use `hole` to create a donut-like pie chart
fig.update_traces(hole=.4, hoverinfo="label+percent+name")

fig.update_layout(
 width=1000,
 height=500,
 title_text="Distribution of positive and negative sentiment in viral vs non-viral tweets",
 # Add annotations in the center of the donut pies.
 annotations=[dict(text='Viral', x=0.18, y=0.5, font_size=20, showarrow=False),
 dict(text='Non-Viral', x=0.82, y=0.5, font_size=20, showarrow=False)])
fig.show()

Calculating the p-value between the target `viral` and positive sentiment


In [None]:
from scipy.stats import chi2_contingency 

#confident_sentiment_tweets.loc[:, 'is_positive'] = confident_sentiment_tweets.sentiment == 'POSITIVE'

# Calculating the p-value
contingency_sentiment = pd.crosstab(index = confident_sentiment_tweets['sentiment'], columns=confident_sentiment_tweets['viral']) 
# Display with percentages
contingency_sentiment_normalized_percentage = pd.crosstab(
 index = confident_sentiment_tweets['sentiment'], columns=confident_sentiment_tweets['viral'], normalize='columns') 
display(contingency_sentiment_normalized_percentage)

c, p, dof, expected = chi2_contingency(contingency_sentiment) 
display(f'p-value {p}')
c, p, dof, expected

Calculating the p-value between the target `viral` and negative sentiment


In [None]:
from scipy.stats import chi2_contingency 

confident_sentiment_tweets.loc[:, 'is_negative'] = confident_sentiment_tweets.sentiment == 'NEGATIVE'

# Calculating the p-value
contingency_negative_sentiment = pd.crosstab(index = confident_sentiment_tweets['is_negative'], columns=confident_sentiment_tweets['viral']) 
# Display with percentages
contingency_negative_sentiment_normalized_percentage = pd.crosstab(
 index = confident_sentiment_tweets['is_negative'], columns=confident_sentiment_tweets['viral'], normalize='columns') 
display(contingency_negative_sentiment_normalized_percentage)

c, p, dof, expected = chi2_contingency(contingency_negative_sentiment) 
display(f'p-value {p}')
c, p, dof, expected

In [None]:
'''
import spacy
import vaderSentiment
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

nlp = spacy.load("en_core_web_sm")

spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
print('Number of stop words: %d' % len(spacy_stopwords))
print('First ten stop words:',list(spacy_stopwords)[:10])
'''

In [None]:
'''
# Remove new lines 
remove_new_lines = lambda x: " ".join(x.split())
viral_dataset_labeled['processed_text'] = viral_dataset_labeled['text'].apply(remove_new_lines)


english_tweets = viral_dataset_labeled[viral_dataset_labeled.lang == 'en']
'''

#### 1.3.5 - Number of hashtags 

In [None]:
viral_dataset_labeled.loc[:, "nb_of_hashtags"] = viral_dataset_labeled.hashtags.apply(lambda x: len(x) if np.all(x) else 0)

In [None]:
labels = ["Hashtags", "No Hashtags"]
viral_has_hashtags = len(viral_dataset_labeled[(viral_dataset_labeled.viral) & (viral_dataset_labeled.nb_of_hashtags >= 1)])
viral_no_hashtags = len(viral_dataset_labeled[(viral_dataset_labeled.viral) & (viral_dataset_labeled.nb_of_hashtags == 0)])
normal_has_hashtags = len(viral_dataset_labeled[(~viral_dataset_labeled.viral) & (viral_dataset_labeled.nb_of_hashtags >= 1)])
normal_no_hashtags = len(viral_dataset_labeled[(~viral_dataset_labeled.viral) & (viral_dataset_labeled.nb_of_hashtags == 0)])


# Create subplots: use 'domain' type for Pie subplot
fig = make_subplots(rows=1, cols=2, specs=[[{'type':'domain'}, {'type':'domain'}]])
fig.add_trace(go.Pie(labels=labels, values=[viral_has_hashtags, viral_no_hashtags], name="Viral with Hashtags"),
 1, 1)
fig.add_trace(go.Pie(labels=labels, values=[normal_has_hashtags, normal_no_hashtags], name="Tweet with No Hashtags"),
 1, 2)

# Use `hole` to create a donut-like pie chart
fig.update_traces(hole=.4, hoverinfo="label+percent+name")

fig.update_layout(
 width=1000,
 height=500,
 title_text="Percentage of tweets with hashtags",
 # Add annotations in the center of the donut pies.
 annotations=[dict(text='Viral', x=0.18, y=0.5, font_size=20, showarrow=False),
 dict(text='Non-Viral', x=0.82, y=0.5, font_size=20, showarrow=False)])
fig.show()

Calculating the p-value between the target `viral` and `has_hashtags`


In [None]:
from scipy.stats import chi2_contingency 

viral_dataset_labeled['has_hashtags'] = viral_dataset_labeled.nb_of_hashtags >= 1

# Calculating the p-value
contingency_has_hashtags = pd.crosstab(index = viral_dataset_labeled['has_hashtags'], columns=viral_dataset_labeled['viral']) 
# Display with percentages
contingency_has_hashtags_normalized_percentage = pd.crosstab(
 index = viral_dataset_labeled['has_hashtags'], columns=viral_dataset_labeled['viral'], normalize='columns') 
display(contingency_has_hashtags_normalized_percentage)

c, p, dof, expected = chi2_contingency(contingency_has_hashtags) 
display(f'p-value {p}')
c, p, dof, expected

#### 1.3.6 - Verified account

In [None]:
# Verified account
labels = ["Verified", "Not verified"]
viral_is_verified = len(viral_dataset_labeled[(viral_dataset_labeled.viral) & (viral_dataset_labeled.verified)])
viral_not_verified = len(viral_dataset_labeled[(viral_dataset_labeled.viral) & (~viral_dataset_labeled.verified)])
normal_is_verified = len(viral_dataset_labeled[(~viral_dataset_labeled.viral) & (viral_dataset_labeled.verified)])
normal_not_verified = len(viral_dataset_labeled[(~viral_dataset_labeled.viral) & (~viral_dataset_labeled.verified)])


# Create subplots: use 'domain' type for Pie subplot
fig = make_subplots(rows=1, cols=2, specs=[[{'type':'domain'}, {'type':'domain'}]])
fig.add_trace(go.Pie(labels=labels, values=[viral_is_verified, viral_not_verified], name="Viral with verified accounts"),
 1, 1)
fig.add_trace(go.Pie(labels=labels, values=[normal_is_verified, normal_not_verified], name="Tweet with an unverified account"),
 1, 2)

# Use `hole` to create a donut-like pie chart
fig.update_traces(hole=.4, hoverinfo="label+percent+name")

fig.update_layout(
 width=1000,
 height=500,
 title_text="Percentage of tweets from verified accounts",
 # Add annotations in the center of the donut pies.
 annotations=[dict(text='Viral', x=0.18, y=0.5, font_size=20, showarrow=False),
 dict(text='Non-Viral', x=0.82, y=0.5, font_size=20, showarrow=False)])
fig.show()

Calculating the p-value between the target `viral` and `is_verified`


In [None]:
from scipy.stats import chi2_contingency 

# Calculating the p-value
contingency_verified = pd.crosstab(index = viral_dataset_labeled['verified'], columns=viral_dataset_labeled['viral']) 
# Display with percentages
contingency_verified_normalized_percentage = pd.crosstab(
 index = viral_dataset_labeled['verified'], columns=viral_dataset_labeled['viral'], normalize='columns') 
display(contingency_verified_normalized_percentage)

c, p, dof, expected = chi2_contingency(contingency_verified) 
display(f'p-value {p}')
c, p, dof, expected

#### 1.3.7 - Has mentions

In [None]:
viral_dataset_labeled.loc[:, "nb_of_mentions"] = viral_dataset_labeled.mentions.apply(lambda x: len(x) if np.all(x) else 0)

In [None]:
from scipy.stats import chi2_contingency 

# Calculating the p-value
contingency_has_mentions = pd.crosstab(index = viral_dataset_labeled['nb_of_mentions'] > 0, columns=viral_dataset_labeled['viral']) 
display(contingency_has_mentions)
# Display with percentages
display(pd.crosstab(index = viral_dataset_labeled['nb_of_mentions'] > 0, columns=viral_dataset_labeled['viral'], normalize='columns') )

c, p, dof, expected = chi2_contingency(contingency_has_mentions) 
display(f'p-value {p}')
c, p, dof, expected

#### 1.3.8 - Save result of preprocessing to disk

In [None]:
viral_dataset_labeled.to_parquet(f'{PROCESSED_PATH_VIRAL}/all_english_tweets_with_users_with_sentiment.parquet.gzip', index=False, compression="gzip")

In [None]:
viral_dataset_labeled.columns


### 1.4 - Covid dataset Exploration

Here we concern ourselves only with original tweets (no retweets).

In [None]:
original_covid_tweets = pd.read_parquet(f"{COVID_TWEETS_PATH}/all_original_tweets.parquet.gzip")
original_covid_tweets.loc[:, "text"] = original_covid_tweets.text.apply(clear_reply_mentions)

covid_users = pd.read_parquet(f"{COVID_TWEETS_PATH}/users.parquet.gzip")

display("--- COVID DATASET ---")

display(f"{len(original_covid_tweets)} original (not retweeted) covid tweets collected")
display(f"{len(original_covid_tweets.author_id.unique())} covid users collected")

original_covid_tweets

In [None]:
user_columns = ['author_id', 'followers_count', 'following_count', 'tweet_count', 'protected', 'verified', 'username']
covid_dataset_with_users = original_covid_tweets.merge(covid_users.rename(columns={'id': 'author_id'})[user_columns], on='author_id')

In [None]:
# Applying the first metric on the covid dataset
covid_dataset_with_users['virality_followers'] = covid_dataset_with_users['retweet_count'] / covid_dataset_with_users['followers_count'].astype("float64")
# Handle division by zero if user has 0 followers
covid_dataset_with_users['virality_followers'] = covid_dataset_with_users.virality_followers.replace({np.inf: 0.0})

In [None]:
covid_dataset_with_users

In [None]:
px.histogram(covid_dataset_with_users, x='followers_count', y = 'virality_followers', log_y=True)

In [None]:
covid_dataset_with_users['viral'] = covid_dataset_with_users.virality_followers > 1
covid_dataset_with_users[covid_dataset_with_users.viral]

### 1.4.1 - Language

In [None]:
languages_aggregates = covid_dataset_with_users.groupby(by='lang', as_index=False)[['id']].count().rename(columns={'id': 'count'})
languages_aggregates = languages_aggregates.sort_values(by='count', ascending=False)
languages_aggregates.loc[languages_aggregates['count'] < 10000, 'lang'] = 'Other Languages'
fig = px.pie(languages_aggregates, values='count', names='lang', title='Distribution of Tweets languages')

fig.update_layout(
 autosize=False,
 width=500,
 height=500
)

In [None]:
english_covid_tweets = covid_dataset_with_users[covid_dataset_with_users.lang == 'en']
display(f"{len(english_covid_tweets)} english covid tweets")

english_viral_covid_tweets = english_covid_tweets[english_covid_tweets.viral]
display(f"{len(english_viral_covid_tweets)} viral english covid tweets")

### 1.4.2 - Media

In [None]:
# Has media
labels = ["Media", "No Media"]
viral_has_media = len(covid_dataset_with_users[(covid_dataset_with_users.viral == True) & (covid_dataset_with_users.has_media == True)])
viral_no_media = len(covid_dataset_with_users[(covid_dataset_with_users.viral == True) & (covid_dataset_with_users.has_media == False)])
normal_has_media = len(covid_dataset_with_users[(covid_dataset_with_users.viral == False) & (covid_dataset_with_users.has_media == True)])
normal_no_media = len(covid_dataset_with_users[(covid_dataset_with_users.viral == False) & (covid_dataset_with_users.has_media == False)])


# Create subplots: use 'domain' type for Pie subplot
fig = make_subplots(rows=1, cols=2, specs=[[{'type':'domain'}, {'type':'domain'}]])
fig.add_trace(go.Pie(labels=labels, values=[viral_has_media, viral_no_media], name="Viral with Media"),
 1, 1)
fig.add_trace(go.Pie(labels=labels, values=[normal_has_media, normal_no_media], name="Tweet with Media"),
 1, 2)

# Use `hole` to create a donut-like pie chart
fig.update_traces(hole=.4, hoverinfo="label+percent+name")

fig.update_layout(
 width=1000,
 height=500,
 title_text="Percentage of tweets with some kind of media",
 # Add annotations in the center of the donut pies.
 annotations=[dict(text='Viral', x=0.18, y=0.5, font_size=20, showarrow=False),
 dict(text='Non-Viral', x=0.82, y=0.5, font_size=20, showarrow=False)])
fig.show()

### 1.4.3 - Tweet Length

In [None]:
covid_dataset_with_users.loc[:, 'tweet_length'] = covid_dataset_with_users.text.apply(len)
covid_dataset_with_users[['tweet_length', 'retweet_count']].corr()

### 1.4.4 - Sentiment

In [None]:
from transformers import pipeline

# Device = 0 means it will use the Cuda at index 0
sentiment_classifier = pipeline("sentiment-analysis", device=0)

english_covid_dataset = covid_dataset_with_users[covid_dataset_with_users.lang == 'en']
english_covid_dataset

Here we compute sentiments again. To avoid having to compute the sentiments again, we've already preprocessed the data and computed the sentiments and saved it to parquet. Feel free to skip the next 2 cells.

In [None]:
applied = english_covid_dataset.apply(lambda x: sentiment_classifier(x.text)[0], axis=1, result_type='expand')
#pd.concat([small_test_set, applied], axis='columns')
applied

In [None]:
sentiment_features = pd.concat([english_covid_dataset, applied], axis=1)
sentiment_features = sentiment_features.rename(columns={"label": "sentiment", "score": "sentiment_score"})

In [None]:
sentiment_features = pd.read_parquet(f"{PROCESSED_PATH_COVID}/english_tweets_with_users_with_sentiment.parquet.gzip")
sentiment_features

In [None]:
# Tweets with sentiment scores over 70%
display(f"Tweets with sentiment analysis confidence scores above 0.7: {len(sentiment_features[sentiment_features.sentiment_score > 0.7])}")
display(f"{len(sentiment_features[sentiment_features.sentiment == 'POSITIVE'])} positive tweets")
display(f"{len(sentiment_features[sentiment_features.sentiment == 'NEGATIVE'])} negative tweets")

confident_sentiment_tweets = sentiment_features[sentiment_features.sentiment_score > 0.7]

In [None]:
# We keep only retweeted tweets to pan out tweets with zero retweets with little utility.
labels = ["Positive", "Negative"]
viral_positive = len(confident_sentiment_tweets[(confident_sentiment_tweets.viral == True) & (confident_sentiment_tweets.sentiment == 'POSITIVE')])
viral_negative = len(confident_sentiment_tweets[(confident_sentiment_tweets.viral == True) & (confident_sentiment_tweets.sentiment == 'NEGATIVE')])
normal_positive = len(confident_sentiment_tweets[(confident_sentiment_tweets.viral == False) & (confident_sentiment_tweets.sentiment == 'POSITIVE')])
normal_negative = len(confident_sentiment_tweets[(confident_sentiment_tweets.viral == False) & (confident_sentiment_tweets.sentiment == 'NEGATIVE')])


# Create subplots: use 'domain' type for Pie subplot
fig = make_subplots(rows=1, cols=2, specs=[[{'type':'domain'}, {'type':'domain'}]])
fig.add_trace(go.Pie(labels=labels, values=[viral_positive, viral_negative], name="Positive Viral Tweets"),
 1, 1)
fig.add_trace(go.Pie(labels=labels, values=[normal_positive, normal_negative], name="Positive Non-Viral Tweets"),
 1, 2)

# Use `hole` to create a donut-like pie chart
fig.update_traces(hole=.4, hoverinfo="label+percent+name")

fig.update_layout(
 width=1000,
 height=500,
 title_text="Distribution of positive and negative sentiment in viral vs non-viral tweets",
 # Add annotations in the center of the donut pies.
 annotations=[dict(text='Viral', x=0.18, y=0.5, font_size=20, showarrow=False),
 dict(text='Non-Viral', x=0.82, y=0.5, font_size=20, showarrow=False)])
fig.show()

### 1.4.5 - Number of Hashtags

In [None]:
covid_dataset_with_users.loc[:, "nb_of_hashtags"] = covid_dataset_with_users.hashtags.apply(lambda x: len(x) if np.all(x) else 0)

In [None]:
labels = ["Hashtags", "No Hashtags"]
viral_has_hashtags = len(covid_dataset_with_users[(covid_dataset_with_users.viral) & (covid_dataset_with_users.nb_of_hashtags >= 1)])
viral_no_hashtags = len(covid_dataset_with_users[(covid_dataset_with_users.viral) & (covid_dataset_with_users.nb_of_hashtags == 0)])
normal_has_hashtags = len(covid_dataset_with_users[(~covid_dataset_with_users.viral) & (covid_dataset_with_users.nb_of_hashtags > 1)])
normal_no_hashtags = len(covid_dataset_with_users[(~covid_dataset_with_users.viral) & (covid_dataset_with_users.nb_of_hashtags == 0)])


# Create subplots: use 'domain' type for Pie subplot
fig = make_subplots(rows=1, cols=2, specs=[[{'type':'domain'}, {'type':'domain'}]])
fig.add_trace(go.Pie(labels=labels, values=[viral_has_hashtags, viral_no_hashtags], name="Viral with Hashtags"),
 1, 1)
fig.add_trace(go.Pie(labels=labels, values=[normal_has_hashtags, normal_no_hashtags], name="Tweet with No Hashtags"),
 1, 2)

# Use `hole` to create a donut-like pie chart
fig.update_traces(hole=.4, hoverinfo="label+percent+name")

fig.update_layout(
 width=1000,
 height=500,
 title_text="Percentage of tweets with hashtags",
 # Add annotations in the center of the donut pies.
 annotations=[dict(text='Viral', x=0.18, y=0.5, font_size=20, showarrow=False),
 dict(text='Non-Viral', x=0.82, y=0.5, font_size=20, showarrow=False)])
fig.show()

#### 1.4.6 - Verified Account

In [None]:
# Has media
labels = ["Verified", "Not verified"]
viral_is_verified = len(covid_dataset_with_users[(covid_dataset_with_users.viral) & (covid_dataset_with_users.verified)])
viral_not_verified = len(covid_dataset_with_users[(covid_dataset_with_users.viral) & (~covid_dataset_with_users.verified)])
normal_is_verified = len(covid_dataset_with_users[(~covid_dataset_with_users.viral) & (covid_dataset_with_users.verified)])
normal_not_verified = len(covid_dataset_with_users[(~covid_dataset_with_users.viral) & (~covid_dataset_with_users.verified)])


# Create subplots: use 'domain' type for Pie subplot
fig = make_subplots(rows=1, cols=2, specs=[[{'type':'domain'}, {'type':'domain'}]])
fig.add_trace(go.Pie(labels=labels, values=[viral_is_verified, viral_not_verified], name="Viral with verified accounts"),
 1, 1)
fig.add_trace(go.Pie(labels=labels, values=[normal_is_verified, normal_not_verified], name="Tweet with an unverified account"),
 1, 2)

# Use `hole` to create a donut-like pie chart
fig.update_traces(hole=.4, hoverinfo="label+percent+name")

fig.update_layout(
 width=1000,
 height=500,
 title_text="Percentage of tweets from verified accounts",
 # Add annotations in the center of the donut pies.
 annotations=[dict(text='Viral', x=0.18, y=0.5, font_size=20, showarrow=False),
 dict(text='Non-Viral', x=0.82, y=0.5, font_size=20, showarrow=False)])
fig.show()

### 1.4.7 - Save dataframe with analysis to disk

In [None]:
covid_dataset_with_users.to_parquet(f'{PROCESSED_PATH_COVID}/all_english_tweets_with_users_with_sentiment.parquet.gzip', index=False, compression="gzip")

Questions for TJ:

Learn threshold? Use unsupervised learning (anomaly detection), x axis date y retweet count, isolation coordinate
Ratio
Try to come up with Different metrics (one cannot be used for second dataset)

Preprocessing:
 - Remove tweets with no retweets or likes? NO
 - Define threshold using the metric? DONE (label above viral tweet)
 - Skewed distribution if we use only Twitter viral tweets (1000) DONE
- Which features? (Any new ideas)
 - Topic
 - Hashtags relevant? (Most likely different from coronavirus and we already have topics).
 - Has media
 - Sentiment? [TODO]
 - Tweet length [TODO]
 - RETRIEVE USERS THAT LIKED OR RETWEETED USING API [TODO]
 - Word cloud of entities [TODO]
- Check bigrams and trigrams distribution
- Normalize features (like, retweets, reply etc...)? DEPENDS, Included in first model, will be removed from second model with covid set.
- BertTweet [DO NOT REMOVE STOP WORDS FOR LANGUAGE MODELS, FOR ]
- Next steps (now that data collection part is done and data analysis almost done)
 - Hydrate Covid dataset id
- Viral generator (Trump generator)

1st classifier: hashtags, twitter entities (context annotations, domain annotations, entities), mentions, domain of urls (youtube.com let’s say)
2nd classifier: bag of words with tf-idf, remove stopwords and other entities that you used in the 1st classifier
3rd: language model
