JSON Data to a pandas df without read_json() method
Dealing with JSON data; pulling it in and reading it was something I strugled with significantly when I began coding. This article is for anyone who is facing that issue or just wants a diffrent point of view.
Recently I was challenged to create a sentiment analysis from Twitter data; so I created this api endpoint and it can be used for this example.
http://www.sportsbookanalysis.com/restapi/tweets/?format=json
To access this data we need json and request libraries or we can use the built in pandas read_json() method. We will go through not using the pd.read_json() method because it is good practice and it is helpful know what is going on when using the data outside of pandas, such as in js. Moving from Python to Javascript learning how to deal with json objects in greater detail will lower make picking up js much easier.
In actual practice it clearly would be more sensible to just use the read_json method and then pass in the appropriate "orient" parameter.https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html
import json, requests
response = requests.get("http://www.sportsbookanalysis.com/restapi/tweets/?format=json")
print(response)
<Response [200]>
This <Response [200]> was one of my first "gothas" because I see that format=json, however this is just giving us a status code 200 which is a good thing to see in our case. However, to access the actual data we need .json()
response = requests.get("http://www.sportsbookanalysis.com/restapi/tweets/?format=json").json()
# Dont print this as its fairly large
From here we can see what is going on with the data. It can be hit or miss, but the first thing I try is to just print out the keys of the data. If that fails it means the data is in a list so we can use indexing
print(response.keys())
#AttributeError: 'list' object has no attribute 'keys'
print(len(response))
print(response[1250])
#output {'id': 42750, 'team': 'NYG', 'sport': 'NFL', 'tw_time': '2019-07-27T11:00:26Z',
#'tweet': "Giants' Saquon Barkley rips Daniel Jones critics: You don't know football #nyg #giants #NFL https://t.co/5T6oz6sRc3",
#'score': -0.296, 'followers': 3688}
print(response[1250].keys())
#dict_keys(['id', 'team', 'sport', 'tw_time', 'tweet', 'score', 'followers'])
#and finally we can do soemthing like
for i in response[1250].keys():
print("{}:{}".format(i,response[1250][i]))
#output id:42750
#team:NYG
#sport:NFL
#tw_time:2019-07-27T11:00:26Z
#tweet:Giants' Saquon Barkley rips Daniel Jones critics: You don't know football #nyg #giants #NFL https://t.co/5T6oz6sRc3
#score:-0.296
#followers:3688
Going another step further we can extract all of the the team and score data into seperate lists and then assign it to empty dataframes.
team = []
score = []
timeposted = []
for x in response:
team.append(x['team'])
score.append(x['score'])
timeposted.append(x['tw_time'])
df = pd.DataFrame() # create empty dataframe
df['score'] = score
df['team'] = team
df['timeposted'] = timeposted ###add new columns based off lists
print(df.head())
I think something in a function like this that is adaptable based on the values you want whether the json is coming from an api where you can specify the field desired or the specific values wanted like in your case ‘team’ and ‘score’ at varying lengths is better than creating multiple separate lists and appending. Very accessible article though!
Thank you for the comment. This is excellent and will incorporate moving forward