Sentiment analysis on Trump's tweets using Python π
DESCRIPTION:
In this article we will:
- Extract twitter data using tweepy and learn how to handle it using pandas.
- Do some basic statistics and visualizations with numpy, matplotlib and seaborn.
- Do sentiment analysis of extracted (Trump's) tweets using textblob.
Phew! It's been a while since I wrote something kinda nice. I hope you find this a bit useful and/or interesting. This is based on a workshop I taught in Mexico City. I'll explain the whole post along with code, in the most simple way possible. Anyway, the original blog post can be found in my blog and all the code can be found in the repo I used for this workshop.
What will we need?
First of all, we need to have Python installed.
I'm almost sure that all the code will run in Python 2.7, but I'll use Python 3.6. I highly recommend to install Anaconda, which is a very useful Python distribution to manage packages that includes a lot of useful tools, such as Jupyter Notebooks. I'll explain the code supposing that we will be using a Jupyter Notebook, but the code will run if you are programming a simple script from your text editor. You'll just need to adapt it (it's not hard).
The requirements that we'll need to install are:
- NumPy: This is the fundamental package for scientific computing with Python. Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data.
- Pandas: This is an open source library providing high-performance, easy-to-use data structures and data analysis tools.
- Tweepy: This is an easy-to-use Python library for accessing the Twitter API.
- Matplotlib: This is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms.
- Seaborn: This is a Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics.
- Textblob: This is a Python library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks.
All of them are "pip installable". At the end of this article you'll be able to find more references about this Python libraries.
Now that we have all the requirements, let's get started!
1. Extracting twitter data (tweepy + pandas)
1.1. Importing our libraries
This will be the most difficult part of all the post... π₯
Just kidding, obviously it won't. It'll be just as easy as copying and pasting the following code in your notebook:
# General:
import tweepy # To consume Twitter's API
import pandas as pd # To handle data
import numpy as np # For number computing
# For plotting and visualization:
from IPython.display import display
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
Excellent! We can now just run this cell of code and go to the next subsection.
1.2. Creating a Twitter App
In order to extract tweets for a posterior analysis, we need to access to our Twitter account and create an app. The website to do this is https://apps.twitter.com/. (If you don't know how to do this, you can follow this tutorial video to create an account and an application.)
From this app that we're creating we will save the following information in a script called credentials.py
:
- Consumer Key (API Key)
- Consumer Secret (API Secret)
- Access Token
- Access Token Secret
An example of this script is the following:
# Twitter App access keys for @user
# Consume:
CONSUMER_KEY = ''
CONSUMER_SECRET = ''
# Access:
ACCESS_TOKEN = ''
ACCESS_SECRET = ''
The reason of creating this extra file is that we want to export only the value of this variables, but being unseen in our main code (our notebook). We are now able to consume Twitter's API. In order to do this, we will create a function to allow us our keys authentication. We will add this function in another cell of code and we will run it:
# We import our access keys:
from credentials import * # This will allow us to use the keys as variables
# API's setup:
def twitter_setup():
"""
Utility function to setup the Twitter's API
with our access keys provided.
"""
# Authentication and access using keys:
auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_SECRET)
# Return API with authentication:
api = tweepy.API(auth)
return api
So far, so easy right? We're good to extract tweets in the next section.
1.3. Tweets extraction
Now that we've created a function to setup the Twitter API, we can use this function to create an "extractor" object. After this, we will use Tweepy's function extractor.user_timeline(screen_name, count)
to extract from screen_name
's user the quantity of count
tweets.
As it is mentioned in the title, I've chosen @realDonaldTrump as the user to extract data for a posterior analysis. Yeah, we wanna keep it interesting, LOL.
The way to extract Twitter's data is as follows:
# We create an extractor object:
extractor = twitter_setup()
# We create a tweet list as follows:
tweets = extractor.user_timeline(screen_name="realDonaldTrump", count=200)
print("Number of tweets extracted: {}.\n".format(len(tweets)))
# We print the most recent 5 tweets:
print("5 recent tweets:\n")
for tweet in tweets[:5]:
print(tweet.text)
print()
With this we will have an output similar to this one, and we are able to compare the output with the Twitter account (to check if we're being consistent):
Number of tweets extracted: 200.
5 recent tweets:
On behalf of @FLOTUS Melania & myself, THANK YOU for today's update & GREAT WORK! #SouthernBaptist @SendRelief,β¦ https://t.co/4yZCeXCt6n
I will be going to Texas and Louisiana tomorrow with First Lady. Great progress being made! Spending weekend working at White House.
Stock Market up 5 months in a row!
'President Donald J. Trump Proclaims September 3, 2017, as a National Day of Prayer' #HurricaneHarvey #PrayForTexas⦠https://t.co/tOMfFWwEsN
Texas is healing fast thanks to all of the great men & women who have been working so hard. But still so much to do. Will be back tomorrow!
We now have an extractor and extracted data, which is listed in the tweets
variable. I must mention at this point that each element in that list is a tweet
object from Tweepy, and we will learn how to handle this data in the next subsection.
1.4. Creating a (pandas) DataFrame
We now have initial information to construct a pandas DataFrame
, in order to manipulate the info in a very easy way.
IPython's display
function plots an output in a friendly way, and the head
method of a dataframe allows us to visualize the first 5 elements of the dataframe (or the first number of elements that are passed as an argument).
So, using Python's list comprehension:
# We create a pandas dataframe as follows:
data = pd.DataFrame(data=[tweet.text for tweet in tweets], columns=['Tweets'])
# We display the first 10 elements of the dataframe:
display(data.head(10))
This will create an output similar to this:
Tweets | |
---|---|
0 | On behalf of @FLOTUS Melania & myself, THA... |
1 | I will be going to Texas and Louisiana tomorro... |
2 | Stock Market up 5 months in a row! |
3 | 'President Donald J. Trump Proclaims September... |
4 | Texas is healing fast thanks to all of the gre... |
5 | ...get things done at a record clip. Many big ... |
6 | General John Kelly is doing a great job as Chi... |
7 | Wow, looks like James Comey exonerated Hillary... |
8 | THANK YOU to all of the incredible HEROES in T... |
9 | RT @FoxNews: .@KellyannePolls on Harvey recove... |
So we now have a nice table with ordered data.
An interesting thing is the number if internal methods that the tweet
structure has in Tweepy:
# Internal methods of a single tweet object:
print(dir(tweets[0]))
This outputs the following list of elements:
['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_api', '_json', 'author', 'contributors', 'coordinates', 'created_at', 'destroy', 'entities', 'favorite', 'favorite_count', 'favorited', 'geo', 'id', 'id_str', 'in_reply_to_screen_name', 'in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'is_quote_status', 'lang', 'parse', 'parse_list', 'place', 'possibly_sensitive', 'retweet', 'retweet_count', 'retweeted', 'retweets', 'source', 'source_url', 'text', 'truncated', 'user']
The interesting part from here is the quantity of metadata contained in a single tweet. If we want to obtain data such as the creation date, or the source of creation, we can access the info with this attributes. An example is the following:
# We print info from the first tweet:
print(tweets[0].id)
print(tweets[0].created_at)
print(tweets[0].source)
print(tweets[0].favorite_count)
print(tweets[0].retweet_count)
print(tweets[0].geo)
print(tweets[0].coordinates)
print(tweets[0].entities)
Obtaining an output like this:
903778130850131970
2017-09-02 00:34:32
Twitter for iPhone
24572
5585
None
None
{'hashtags': [{'text': 'SouthernBaptist', 'indices': [90, 106]}], 'symbols': [], 'user_mentions': [{'screen_name': 'FLOTUS', 'name': 'Melania Trump', 'id': 818876014390603776, 'id_str': '818876014390603776', 'indices': [13, 20]}, {'screen_name': 'sendrelief', 'name': 'Send Relief', 'id': 3228928584, 'id_str': '3228928584', 'indices': [107, 118]}], 'urls': [{'url': 'https://t.co/4yZCeXCt6n', 'expanded_url': 'https://twitter.com/i/web/status/903778130850131970', 'display_url': 'twitter.com/i/web/status/9β¦', 'indices': [121, 144]}]}
We're now able to order the relevant data and add it to our dataframe.
1.5. Adding relevant info to our dataframe
As we can see, we can obtain a lot of data from a single tweet. But not all this data is always useful for specific stuff. In our case we well just add some data to our dataframe. For this we will use Pythons list comprehension and a new column will be added to the dataframe by just simply adding the name of the content between square brackets and assign the content. The code goes as...:
# We add relevant data:
data['len'] = np.array([len(tweet.text) for tweet in tweets])
data['ID'] = np.array([tweet.id for tweet in tweets])
data['Date'] = np.array([tweet.created_at for tweet in tweets])
data['Source'] = np.array([tweet.source for tweet in tweets])
data['Likes'] = np.array([tweet.favorite_count for tweet in tweets])
data['RTs'] = np.array([tweet.retweet_count for tweet in tweets])
And to display again the dataframe to see the changes we just...:
# Display of first 10 elements from dataframe:
display(data.head(10))
Tweets | len | ID | Date | Source | Likes | RTs | |
---|---|---|---|---|---|---|---|
0 | On behalf of @FLOTUS Melania & myself, THA... | 144 | 903778130850131970 | 2017-09-02 00:34:32 | Twitter for iPhone | 24572 | 5585 |
1 | I will be going to Texas and Louisiana tomorro... | 132 | 903770196388831233 | 2017-09-02 00:03:00 | Twitter for iPhone | 44748 | 8825 |
2 | Stock Market up 5 months in a row! | 34 | 903766326631698432 | 2017-09-01 23:47:38 | Twitter for iPhone | 44518 | 9134 |
3 | 'President Donald J. Trump Proclaims September... | 140 | 903705867891204096 | 2017-09-01 19:47:23 | Media Studio | 47009 | 15127 |
4 | Texas is healing fast thanks to all of the gre... | 143 | 903603043714957312 | 2017-09-01 12:58:48 | Twitter for iPhone | 77680 | 15398 |
5 | ...get things done at a record clip. Many big ... | 113 | 903600265420578819 | 2017-09-01 12:47:46 | Twitter for iPhone | 54664 | 11424 |
6 | General John Kelly is doing a great job as Chi... | 140 | 903597166249246720 | 2017-09-01 12:35:27 | Twitter for iPhone | 59840 | 11678 |
7 | Wow, looks like James Comey exonerated Hillary... | 130 | 903587428488839170 | 2017-09-01 11:56:45 | Twitter for iPhone | 110667 | 35936 |
8 | THANK YOU to all of the incredible HEROES in T... | 110 | 903348312421670912 | 2017-08-31 20:06:35 | Twitter for iPhone | 112012 | 29064 |
9 | RT @FoxNews: .@KellyannePolls on Harvey recove... | 140 | 903234878124249090 | 2017-08-31 12:35:50 | Twitter for iPhone | 0 | 6638 |
Now that we have extracted and have the data in a easy-to-handle ordered way, we're ready to do a bit more of manipulation to visualize some plots and gather some statistical data. The first part of the post is done.
2. Visualization and basic statistics
2.1. Averages and popularity
We first want to calculate some basic statistical data, such as the mean of the length of characters of all tweets, the tweet with more likes and retweets, etc.
From now, I'll just add some input code and the output right below the code.
To obtain the mean, using numpy:
# We extract the mean of lenghts:
mean = np.mean(data['len'])
print("The lenght's average in tweets: {}".format(mean))
The lenght's average in tweets: 125.925
To extract more data, we will use some pandas' functionalities:
# We extract the tweet with more FAVs and more RTs:
fav_max = np.max(data['Likes'])
rt_max = np.max(data['RTs'])
fav = data[data.Likes == fav_max].index[0]
rt = data[data.RTs == rt_max].index[0]
# Max FAVs:
print("The tweet with more likes is: \n{}".format(data['Tweets'][fav]))
print("Number of likes: {}".format(fav_max))
print("{} characters.\n".format(data['len'][fav]))
# Max RTs:
print("The tweet with more retweets is: \n{}".format(data['Tweets'][rt]))
print("Number of retweets: {}".format(rt_max))
print("{} characters.\n".format(data['len'][rt]))
The tweet with more likes is:
The United States condemns the terror attack in Barcelona, Spain, and will do whatever is necessary to help. Be tough & strong, we love you!
Number of likes: 222205
144 characters.
The tweet with more retweets is:
The United States condemns the terror attack in Barcelona, Spain, and will do whatever is necessary to help. Be tough & strong, we love you!
Number of retweets: 66099
144 characters.
This is common, but it won't necessarily happen: the tweet with more likes is the tweet with more retweets. What we're doing is that we find the maximum number of likes from the 'Likes' column and the maximum number of retweets from the 'RTs' using numpy's max
function. With this we just look for the index in each of both columns that satisfy to be the maximum. Since more than one could have the same number of likes/retweets (the maximum) we just need to take the first one found, and that's why we use .index[0]
to assign the index to the variables fav
and rt
. To print the tweet that satisfies, we access the data in the same way we would access a matrix or any indexed object.
We're now ready to plot some stuff.
2.2. Time series
Pandas has its own object for time series. Since we have a whole vector with creation dates, we can construct time series respect tweets lengths, likes and retweets.
The way we do it is:
# We create time series for data:
tlen = pd.Series(data=data['len'].values, index=data['Date'])
tfav = pd.Series(data=data['Likes'].values, index=data['Date'])
tret = pd.Series(data=data['RTs'].values, index=data['Date'])
And if we want to plot the time series, pandas already has its own method in the object. We can plot a time series as follows:
# Lenghts along time:
tlen.plot(figsize=(16,4), color='r');
This creates the following output:
And to plot the likes versus the retweets in the same chart:
# Likes vs retweets visualization:
tfav.plot(figsize=(16,4), label="Likes", legend=True)
tret.plot(figsize=(16,4), label="Retweets", legend=True);
This will create the following output:
2.3. Pie charts of sources
We're almost done with this second section of the post. Now we will plot the sources in a pie chart, since we realized that not every tweet is tweeted from the same source (π±π€). We first clean all the sources:
# We obtain all possible sources:
sources = []
for source in data['Source']:
if source not in sources:
sources.append(source)
# We print sources list:
print("Creation of content sources:")
for source in sources:
print("* {}".format(source))
With the following output, we realize that basically this twitter account has two sources:
Creation of content sources:
* Twitter for iPhone
* Media Studio
We now count the number of each source and create a pie chart. You'll notice that this code cell is not the most optimized one... Please have in mind that it was 4 in the morning when I was designing this workshop. π
# We create a numpy vector mapped to labels:
percent = np.zeros(len(sources))
for source in data['Source']:
for index in range(len(sources)):
if source == sources[index]:
percent[index] += 1
pass
percent /= 100
# Pie chart:
pie_chart = pd.Series(percent, index=sources, name='Sources')
pie_chart.plot.pie(fontsize=11, autopct='%.2f', figsize=(6, 6));
With this we obtain an output like this one:
And we can see the percentage of tweets per source.
We can now proceed to do sentiment analysis.
3. Sentiment analysis
3.1. Importing textblob
As we mentioned at the beginning of this post, textblob will allow us to do sentiment analysis in a very simple way. We will also use the re
library from Python, which is used to work with regular expressions. For this, I'll provide you two utility functions to: a) clean text (which means that any symbol distinct to an alphanumeric value will be remapped into a new one that satisfies this condition), and b) create a classifier to analyze the polarity of each tweet after cleaning the text in it. I won't explain the specific way in which the function that cleans works, since it would be extended and it might be better understood in the official re
documentation.
The code that I'm providing is:
from textblob import TextBlob
import re
def clean_tweet(tweet):
'''
Utility function to clean the text in a tweet by removing
links and special characters using regex.
'''
return ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", tweet).split())
def analize_sentiment(tweet):
'''
Utility function to classify the polarity of a tweet
using textblob.
'''
analysis = TextBlob(clean_tweet(tweet))
if analysis.sentiment.polarity > 0:
return 1
elif analysis.sentiment.polarity == 0:
return 0
else:
return -1
The way it works is that textblob already provides a trained analyzer (cool, right?). Textblob can work with different machine learning models used in natural language processing. If you want to train your own classifier (or at least check how it works) feel free to check the following link. It might result relevant since we're working with a pre-trained model (for which we don't not the data that was used).
Anyway, getting back to the code we will just add an extra column to our data. This column will contain the sentiment analysis and we can plot the dataframe to see the update:
# We create a column with the result of the analysis:
data['SA'] = np.array([ analize_sentiment(tweet) for tweet in data['Tweets'] ])
# We display the updated dataframe with the new column:
display(data.head(10))
Obtaining the new output:
Tweets | len | ID | Date | Source | Likes | RTs | SA | |
---|---|---|---|---|---|---|---|---|
0 | On behalf of @FLOTUS Melania & myself, THA... | 144 | 903778130850131970 | 2017-09-02 00:34:32 | Twitter for iPhone | 24572 | 5585 | 1 |
1 | I will be going to Texas and Louisiana tomorro... | 132 | 903770196388831233 | 2017-09-02 00:03:00 | Twitter for iPhone | 44748 | 8825 | 1 |
2 | Stock Market up 5 months in a row! | 34 | 903766326631698432 | 2017-09-01 23:47:38 | Twitter for iPhone | 44518 | 9134 | 0 |
3 | 'President Donald J. Trump Proclaims September... | 140 | 903705867891204096 | 2017-09-01 19:47:23 | Media Studio | 47009 | 15127 | 0 |
4 | Texas is healing fast thanks to all of the gre... | 143 | 903603043714957312 | 2017-09-01 12:58:48 | Twitter for iPhone | 77680 | 15398 | 1 |
5 | ...get things done at a record clip. Many big ... | 113 | 903600265420578819 | 2017-09-01 12:47:46 | Twitter for iPhone | 54664 | 11424 | 1 |
6 | General John Kelly is doing a great job as Chi... | 140 | 903597166249246720 | 2017-09-01 12:35:27 | Twitter for iPhone | 59840 | 11678 | 1 |
7 | Wow, looks like James Comey exonerated Hillary... | 130 | 903587428488839170 | 2017-09-01 11:56:45 | Twitter for iPhone | 110667 | 35936 | 1 |
8 | THANK YOU to all of the incredible HEROES in T... | 110 | 903348312421670912 | 2017-08-31 20:06:35 | Twitter for iPhone | 112012 | 29064 | 1 |
9 | RT @FoxNews: .@KellyannePolls on Harvey recove... | 140 | 903234878124249090 | 2017-08-31 12:35:50 | Twitter for iPhone | 0 | 6638 | 0 |
As we can see, the last column contains the sentiment analysis (SA
). We now just need to check the results.
3.2. Analyzing the results
To have a simple way to verify the results, we will count the number of neutral, positive and negative tweets and extract the percentages.
# We construct lists with classified tweets:
pos_tweets = [ tweet for index, tweet in enumerate(data['Tweets']) if data['SA'][index] > 0]
neu_tweets = [ tweet for index, tweet in enumerate(data['Tweets']) if data['SA'][index] == 0]
neg_tweets = [ tweet for index, tweet in enumerate(data['Tweets']) if data['SA'][index] < 0]
Now that we have the lists, we just print the percentages:
# We print percentages:
print("Percentage of positive tweets: {}%".format(len(pos_tweets)*100/len(data['Tweets'])))
print("Percentage of neutral tweets: {}%".format(len(neu_tweets)*100/len(data['Tweets'])))
print("Percentage de negative tweets: {}%".format(len(neg_tweets)*100/len(data['Tweets'])))
Obtaining the following result:
Percentage of positive tweets: 51.0%
Percentage of neutral tweets: 27.0%
Percentage de negative tweets: 22.0%
We have to consider that we're working only with the 200 most recent tweets from D. Trump (last updated: September 2nd.). For more accurate results we can consider more tweets. An interesting thing (an invitation to the readers) is to analyze the polarity of the tweets from different sources, it might be deterministic that by only considering the tweets from one source the polarity would result more positive/negative. Anyway, I hope this resulted interesting.
As we saw, we can extract, manipulate, visualize and analyze data in a very simple way with Python. I hope that this leaves some uncertainty in the reader, for further exploration using this tools.
It might be possible to find little mistakes in the translation of the material (I designed the workshop in Spanish, originally π ). Please feel free to comment or suggest all that comes up to your mind. That would complement some ideas that I already have in mind for further work. π
I'll now leave some references for documentation and tutorials on the used libraries. Hope to hear from you!
References:
- Official documentation - Tweepy
- Official documentation - NumPy
- Official tutorial - NumPy
- Official tutorial - Pandas
- Official documentation - Pandas
- Official documentation - Matplotlib
- Official tutorial - Pyplot
- Official website - Seaborn
- Official documentation - TextBlob
- Tutorial: Building a Text Classification System - TextBlob
Rodolfo
Can you do one tutorial for Keyword search instead screen name setting lang=English. Thanks!
Thatβs amazing workβ¦