Codementor Events

JSON Data to a pandas df without read_json() method

Published Aug 09, 2019Last updated Aug 10, 2019
JSON Data to a pandas df without  read_json() method

Dealing with JSON data; pulling it in and reading it was something I strugled with significantly when I began coding. This article is for anyone who is facing that issue or just wants a diffrent point of view.

Recently I was challenged to create a sentiment analysis from Twitter data; so I created this api endpoint and it can be used for this example.
http://www.sportsbookanalysis.com/restapi/tweets/?format=json

To access this data we need json and request libraries or we can use the built in pandas read_json() method. We will go through not using the pd.read_json() method because it is good practice and it is helpful know what is going on when using the data outside of pandas, such as in js. Moving from Python to Javascript learning how to deal with json objects in greater detail will lower make picking up js much easier.
In actual practice it clearly would be more sensible to just use the read_json method and then pass in the appropriate "orient" parameter.https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html

import json, requests

response = requests.get("http://www.sportsbookanalysis.com/restapi/tweets/?format=json")
print(response)
<Response [200]>

This <Response [200]> was one of my first "gothas" because I see that format=json, however this is just giving us a status code 200 which is a good thing to see in our case. However, to access the actual data we need .json()

response = requests.get("http://www.sportsbookanalysis.com/restapi/tweets/?format=json").json()
# Dont print this as its fairly large

From here we can see what is going on with the data. It can be hit or miss, but the first thing I try is to just print out the keys of the data. If that fails it means the data is in a list so we can use indexing

print(response.keys())
#AttributeError: 'list' object has no attribute 'keys'
print(len(response))
print(response[1250])
#output {'id': 42750, 'team': 'NYG', 'sport': 'NFL', 'tw_time': '2019-07-27T11:00:26Z', 
#'tweet': "Giants' Saquon Barkley rips Daniel Jones critics: You don't know football #nyg #giants #NFL https://t.co/5T6oz6sRc3", 
#'score': -0.296, 'followers': 3688}

print(response[1250].keys())
#dict_keys(['id', 'team', 'sport', 'tw_time', 'tweet', 'score', 'followers'])
#and finally we can do soemthing like
for i in response[1250].keys():
    print("{}:{}".format(i,response[1250][i]))
#output id:42750
#team:NYG
#sport:NFL
#tw_time:2019-07-27T11:00:26Z
#tweet:Giants' Saquon Barkley rips Daniel Jones critics: You don't know football #nyg #giants #NFL https://t.co/5T6oz6sRc3
#score:-0.296
#followers:3688

Going another step further we can extract all of the the team and score data into seperate lists and then assign it to empty dataframes.

team = []
score = []
timeposted = []
for x in response:
        team.append(x['team'])
        score.append(x['score'])
        timeposted.append(x['tw_time'])
df = pd.DataFrame() # create empty dataframe 
df['score'] = score 
df['team'] = team
df['timeposted'] = timeposted ###add new columns based off lists

print(df.head())
Discover and read more posts from Anthony Carnevale
get started
post comments2Replies
Billy Doyle
5 years ago

I think something in a function like this that is adaptable based on the values you want whether the json is coming from an api where you can specify the field desired or the specific values wanted like in your case ‘team’ and ‘score’ at varying lengths is better than creating multiple separate lists and appending. Very accessible article though!

def json_func(json_request, field=None, 
              keys=['team', 'score', 'tw_time']):
    """
    json_request: requests.get().json() variable
    field: specific field wanted from json
    keys: the underlying json data contained 
    in the field
    
    returns pd.DataFrame with keys as columns
    """
    
    df_d = {}
    lst_lst = [[] for i in range(len(keys))]
    
    if field is not None:
        json_request = json_request[field]
            
    for lst, key in zip(lst_lst, keys):
        for item in range(len(json_request)):
            lst.append(json_request[item][key])
            
        df_d[key] = lst
    
    return pd.DataFrame(df_d)
Anthony Carnevale
5 years ago

Thank you for the comment. This is excellent and will incorporate moving forward