File size: 6,682 Bytes
3fb26c5
 
 
 
 
 
744de59
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3fb26c5
 
ca87440
 
 
 
 
 
 
 
 
744de59
 
3fb26c5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
744de59
 
 
 
 
3fb26c5
 
 
 
 
 
 
 
 
744de59
3fb26c5
744de59
3fb26c5
 
 
744de59
 
3fb26c5
 
 
 
744de59
 
 
3fb26c5
 
 
 
744de59
3fb26c5
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
---
title: Sentimentr
desc: A tool to visualize bias in news headlines about presidential candidates
published: true
date_published: 2020-01-12
tags: nlp
figs:
  linear:
    src: /assets/article-count-raw-and-opinion.png
    alt: Linear Graph of Article Counts
    caption: Bar plots of how many articles had a given candidate's name in it. Top is a raw count of total articles. Bottom separates it by news group.
    full_width: yes
    left: no
  log:
    src: /assets/article-count-log.png
    alt: Logarithmic Graph of Article Counts
    caption: Logarithmic barplots of how many articles had a given candidate's name in it. Blue represents CNN, orange is Fox News, and green is The New York Times.
    full_width: yes
    left: no
  bar:
    src: /assets/news-barplot-average-scores.png
    alt: Sentiment scores over time
    caption: Bar plots of average sentiment scores separated by model and candidate.
    left: no
    full_width: yes
  avg_over_time:
    src: /assets/average-4wk-scores-over-time-top-6.png
    alt: Sentiment scores over time
    caption: Line plots of sentiment scores separated by model and candidate."
    left: no
    full_width: yes
  avg_w_debates:
    src: /assets/average-4wk-scores-over-time-top-6-with-debates.png
    alt: Sentiment scores over time
    caption: Line plots of sentiment scores separated by model and candidate with debates superimposed over.
    left: no
    full_width: yes
---

<script async src="//embedr.flickr.com/assets/client-code.js" charset="utf-8"></script>
<div class="figure d-flex flex-column align-items-center my-4">
  <a  data-flickr-embed="true" href="https://www.flickr.com/photos/janitors/30280548214" title="2016 U.S. presidential election party, Riga, Latvia"><img src="https://live.staticflickr.com/5523/30280548214_2810c4f91f.jpg"
         width="500" 
         height="334" 
         alt="_DSC4896"
         class="figure-img img-fluid rounded">
  </a>
  <figcaption class="figure-caption mt-2 text-center">Click the photo to view photo credits.</figcaption>
</div>


With presidential primaries just around the corner, I thought it would be interesting to see if I could tell if there is a consistent bias toward one candidate or another.  Could I quantitatively show that Fox News has more favorable headlines about Trump and CNN showing the opposite? 

The ideal news source is unbiased and not focusing all of their attention on one candidate; however we live in a time where "fake news" has entered everyone's daily vernacular. Unfortunately, there is scorn going both ways between liberals and conservatives with both claiming that their side knows the truth and lambasting the other side for being deceived and following villainous leaders. 

I gathered thousands of headlines from CNN, Fox News, and The New York Times that contain the keywords Trump, Biden, Sanders, Warren, Harris, or Buttigieg.  I had to exclude many headlines that contained the names of multiple candidates because it would require making multiple models that are each tailored to one single candidate.

Here are a few instances that have contain different candidates in the same headline that would make it difficult to measure a single sentiment for each candidate.

*  *Here's why Trump keeps pumping up Bernie Sanders*  
*  *Buttigieg on Trump: 'Senate is the jury today, but we are the jury tomorrow'*
*  *Elizabeth Warren sought to 'raise a concern' with Bernie Sanders in post-debate exchange, Sanders campaign says*

For this reason I decided to drop all headlines with the names of multiple candidates for this analysis.  Thankfully, I still ended up with over 5,000 articles. Take a look at the distribution of articles for each candidate and for each news source.


<|linear|>

<|log|>

<!-- {% include figure image_path="/assets/images/graphs/article-count-raw-and-opinion.png" alt="Linear Graph of Article Counts" caption="Bar plots of how many articles had a given candidate's name in it. Top is a raw count of total articles. Bottom separates it by news group."%} -->

<!--{% include figure image_path="/assets/images/graphs/article-count-log.png" alt="Logarithmic Graph of Article Counts" caption="Logarithmic barplots of how many articles had a given candidate's name in it. Blue represents CNN, orange is Fox News, and green is The New York Times."%}
-->
Trump is by far the most talked-about candidate and for good reason: he is the sitting president and the sole republican candidate. After Trump in the ranking goes Biden, then Sanders and Warren are about the same then finally Harris and Buttigieg. 

I was surprised at the sheer volume of CNN articles and also The New York Times' tiny quantity.  


# Sentiment Analysis Models

I used 3 different sentiment analysis models: two of which were pre-made packages. VADER and TextBlob are python packages that offer sentiment analysis trained on different subjects. VADER is a lexicon approach based off of social media tweets that contain emoticons and slang.  TextBlob is a Naive Bayes approach trained on IMDB review data.  My model is an LSTM with a base language model based off of the [AWD-LSTM](https://arxiv.org/abs/1708.02182). I then trained its language model on news articles. Following that, I trained it on hand-labeled (by me 😤) article headlines. 

Here are the average scores for each candidate.

<|bar|>

<!-- {% include figure image_path="/assets/images/graphs/news-barplot-average-scores.png" alt="Sentiment scores over time" caption="Bar plots of average sentiment scores separated by model and candidate."%} -->

And then looking average scores over time.


<|avg_over_time|>

<!-- {% include figure image_path="/assets/images/graphs/average-4wk-scores-over-time-top-6.png" alt="Sentiment scores over time" caption="Line plots of sentiment scores separated by model and candidate."%} -->

I should also note that these scores have been smoothed by a sliding average with a window size of 4 weeks.  Without smoothing it looks much more chaotic.  Smoothing hopefully shows longterm trends.  Even with smoothing it is a bit hard to tell if there are any consistent trends. The next thing I tried was to superimpose debate dates onto the democratic candidates to see if the candidate's performance could be seen after a debate. In some cases, there does seem to be a rise or drop in scores after a debate, but whether they are correlated remains unknown.

<|avg_w_debates|>

<!-- {% include figure image_path="/assets/images/graphs/average-4wk-scores-over-time-top-6-with-debates.png" alt="Sentiment scores over time" caption="Line plots of sentiment scores separated by model and candidate with debates superimposed over."%} -->