What I Discovered About Trump and Clinton From Analyzing 4 Million Facebook Posts


On Facebook, headlines are often more important than the articles themselves. Most headlines are browsed, not clicked — think about your own Facebook behavior; How often do you click on links? Because of this, the headlines frame our positions on topics without even having to read the content. It’s quick, simple, and we feel informed. But with respect to politics, this news feed browsing behavior creates an electorate that can become dangerously uninformed.

These same headlines also leave breadcrumbs of the 2016 political narrative, which we can analyze. For this study, we focused on four things:

  1. Exploring media coverage frequency and bias of “Trump” and “Clinton” across different media sources (Headlines)
  2. Comparing social media attention in 2016 to social media attention during the 2012 Obama vs. Romney campaign (Headlines)
  3. Describing other topics the mainstream media brought up when describing Trump and Clinton during the 2016 election (Headlines)
  4. Quantifying the differences in Facebook audience engagement for Clinton and Trump (Facebook Post Engagement)

Analysis Setup

After assembling a corpus of >4 million Facebook posts from >500 sources, I down-selected to 15 of the top sources on Facebook. Because the choice of sources should be highly-scrutinized for bias, I chose to use mainstream sources across the political spectrum, identified by a past study conducted by the Berkeley data science group. In addition to these sources, I included Huffington Post, Fox & Friends, and Time to refresh the Berkeley study with a few Facebook sources that have recently become more active— the Berkeley study was conducted a few years ago.

Selected sources based on Berkeley data science study, 2013, https://datascience.berkeley.edu/data-media-map-bitly/

Hillary and Donald: Relentless Media Coverage

Named Entity Recognition (NER) is a machine learning task that identifies, extracts, and labels things — people, companies, locations, etc — from unstructured text in an unsupervised format. In this study, I used Stanford’s NLTK machine learning software to extract ‘Hillary Clinton’ and ‘Donald Trump’ from Facebook headlines and post descriptions. Quantifying how frequently each candidate appeared in headlines allows us to perform statistical comparisons. Another method to structure this data would be to perform basic string-matching for ‘trump’ and ‘clinton’ across the available headlines (although this is less accurate).

As an example, NER allows us to extract “[Person], Obama”, and “[Location], Sun Stop” from the following tweet:


Applying this method to the corpus of recent headlines from the top 15 media sources, we find that Donald Trump appeared in Facebook posts nearly 2x as frequently as Hillary Clinton in the past 3 months. This is regardless of the social media advertising that each candidate is doing — this 2x mention frequency is just the organic articles published by the top 15 news sources.

Percentage of all content published by top media sources, mentioning “Clinton” or “Trump”

When we break down each candidate’s coverage for specific media sources, we find surprisingly that the far-right and far-left sources exhausted more content slots publishing stories about the opposing candidates. Why? Both candidates were under attack. Fox (Trump supporters) were attacking Clinton frequently, while many of the other sources (Clinton supporters or neutral) were attacking Trump more frequently in their headlines than they were promoting Clinton.

Percentage of all content published by top media sources, mentioning “Clinton” or “Trump”

Comparing To Previous Elections: 2016 IS DIFFERENT

If we repeat the above study for the 2012 election, and search for “Mitt Romney” and “Barack Obama” in the 2012 headlines, we find that the 2012 election coverage was much more balanced. (There wasn’t enough Facebook data generated in 2008 to repeat this study from the Obama vs McCain election cycle).

Percentage of all content published by top media sources, mentioning “Obama” or “Romney”

And we can break down the 2012 coverage, seeing some notable differences across the board, but overall observing much more balanced coverage.

Percentage of all content published by top media sources, mentioning “Obama” or “Romney”

Returning to the 2016 election, we find that coverage was persistent and stable up until the month prior to when the presidential debates began. The scandals, endorsements, and public debates fueled unprecedented social media attention. During the weeks prior to the third presidential debate, more than 30% of all content published by top media sources mentioned Clinton or Trump by name.

Percentage Of Headlines Published Per Week Mentioning Trump Or Clinton In Top News Media Sources During Final 3 Months Of Election

Sentiment Polarity — Very Challenging For This Dataset

There’s one dimension that we didn’t explore in this study — how positive or negative was the mainstream media while covering Trump and Clinton? This is a particularly difficult task to automate using NLP, for example “Donald Trump loves violence”, or “Hillary Clinton supports corporate greed” would produce conflicting sentiment results. And I certainly wasn’t going to manually review and classify millions of sentences for sentiment... Lexalytics offers the industry-leading, commercial-grade text sentiment analysis software, and I worked with them to run a quick test. We looked at headlines where only one of the candidates was mentioned, and generated average sentiment scores for Trump- and Clinton- mentioned headlines. Unfortunately the results were not statistically significant. Given the time pressure, I didn’t have a chance to take another look, but perhaps when the post-election mayhem calms down we can revisit this with a supervised learning approach. Regardless, special thanks to my friends at Lexalytics who provided the sentiment prediction credits, and technical support and feedback. Definitely recommend working with them if you need to produce quick results for social listening or textual sentiment analysis.

What Other Topics Are Included In These Headlines?

One technique used to contrast two different text corpora is to look at what other words are used when “Clinton” or “Trump” appear in the headlines. To do this, we can look at the weighted log-odds ratio — a common technique to distinguish which words most likely are to be used in one of two different text corpora — this is particularly useful when you want to exclude words or phrases that overlap between corpora. In order to do this properly, we use word stemming (clinton ~ clinton’s, and trump ~ trump’s) and remove the common stopwords (if, and, or, but, etc) to clean up the text. We segment the headlines into Trump-mentioned or Clinton-mentioned corpora, and calculate the weighted log-odds ratio of each remaining word to compare these two groups of headlines. What we find is a word cloud of largely negative words for both candidates.

Most frequent words used by top news sources when mentioning “Trump” or “Clinton”

This shows statistically that both candidates have been under constant attack by the media — the political narrative is unapologetically negative… And as a side-note, the entire election can effectively be summarized by the chart above.

Donald Trump: Master of Social Media Engagement

The last thing I looked at was the activation/engagement of the Trump and Clinton Facebook audiences. First I looked at the posts on Trump’s and Clinton’s verified Facebook pages. Each candidate posted at roughly the same cadence over the past few months, and despite Donald sharing more photos and status updates, and Hillary sharing more links and videos (each type of content propagates differently through the Facebook news feed algorithms), I decided to clump each post type together for each candidate. Also, since Donald’s Facebook page currently has ~13.5 million fans, and Hillary’s Facebook page has ~8.8 million fans, I normalized the results — this analysis is based on an average engagement per one million Facebook fans. What we find is that in every dimension, Donald Trump’s social media engagement on his page is significantly higher than that on Hillary’s page.

With his content and his audience, Donald Trump has found product-market fit on social media.

As an example of how to interpret the chart below, the average piece of content that Donald posted on his page generated, on average, 57% more comments than the average piece of content that Hillary posted on her page.

Trump-divided-by-Clinton Facebook engagement, normalized per 1 million Facebook fans

An equally interesting point, if we repeat this same comparison between Fox News and the New York Times, we see an even more substantial imbalance: Fox News generated, on average, greater than 7X more comments than New York Times Facebook posts.

Fox-divided-by-NYT Facebook engagement, normalized per 1 million Facebook fans


  1. Every time a new communications medium achieves scale, the political narrative shifts to keep up and take advantage. FDR (radio), JFK and Reagan (TV), Obama (Internet), Trump (Social Media). What will be the next big shift?? VR?
  2. Our Facebook news feeds are fantastically personalized. Radio and Television made the world feel like a smaller place, but with machine learning and personalized news feeds, social media micro-focuses content before delivering it to you — giving you exactly what you want, but leaving you unexposed and uninformed to different viewpoints. It’s easy to jump on the ‘YES I AGREE’ rage click bandwagon, but these days it takes much more time, critical thinking, curiosity, and energy to find, listen to, and absorb viewpoints and content that oppose our own beliefs.
  3. Machine Learning is not perfect, but the statistical significance of these findings is cause for reflection. Is our media delivering what we need to make informed decisions? Facebook is not a media company, but is Facebook enabling good, or evil wrt media? Union or divisiveness? How can we discover unbiased, or 2-sided sources of content? Does the average citizen care if they have easy access to 2-sided or unbiased content / headlines?
  4. The excitement, and danger, of social media is that the media delivers headlines and articles so quickly that it’s difficult to fact check the narrative. The truth isn’t as important anymore — people move from topic to topic without sufficient analysis or thought. Did a candidate just tweet or post something crazy? You won’t believe what they say tomorrow… Headlines are quick to write, simple to digest, and the audience quickly moves on to the next headline. Digital publishers, and headline framers receive a lot of negative attention due to the incentives and rewards of ‘clickbait’ publishing, but perhaps the true digital weapons of mass destruction are talented headline framers, convincing the electorate of ideas, opinions, or lies in 10-word headlines without being held accountable for their influence? Not accusatory, just a thought :p
  5. Negative election cycles are not necessarily toxic to our democracy — free speech is healthy and foundational to our country. But the style of attacks is important — I believe that attacking candidates on issues can help the electorate gain different perspective on a policy or stance. But perhaps the ‘crooked’, ‘nasty’, and ‘deplorable’ rhetoric is demeaning and distracting. Perhaps it regresses the democracy, when taken too far?
  6. There is much more to explore with this dataset, and with the ~250 million additional datapoints that I’ve gathered, related to candidate audiences and comments threads.. just waiting to be analyzed. If you’d like to help unravel this social media mess, reach out 🙂 patrick dot martinchek [at] gmail dot com

The headline of this article was created with help from my friends at headlines.ai — if you’re a writer, blogger, social media manager, or advertiser, their product makes it simple to boost your CTR’s. And they’ll give you a free trial 🙂 contact@headlines.ai

Leave a Reply