Basic Pandas
Pandas is a data analysis library written in Python. In this post, I will show you how powerful it is to help you quickly get some insight from different dataset.
Install pyenv
We will install pyenv first, pyenv is a conveient tool if you want to use multi python version in your laptop.
$ brew install pyenv
Install python 3 and pandas
We use python 3 here
$ pyenv install 3.5.0
$ pyenv global 3.5.0
$ pyenv rehash
$ pip install pandas
Start coding!
Ok, let's start our pandas adventure! By the way Visual Studio Code is the best editor to work with pandas. Don't forget to install python extension
First, we need to import pandas library. Just create a demo.py
file, and add the line below.
import pandas as pd
Download quiz.csv and users.json
which is used to demo pandas's utility
You can read json file using pd.read_json
, it will store the data in DataFrame, you can imagine DataFrame like a virtual table
## read data from json and store in dataframe
user_df = pd.read_json('users.json')
## show first 5 data
user_df.head()
Load csv data, basically the same operation like above, just different file format, pandas suport a lot file format like json, csv, excel...
quiz_df = pd.read_csv('quiz.csv')
quiz_df.head()
Now we can start find some insight in data, first let's try to find max year in quiz
# find max year in quiz data
max_years = quiz_df['years'].max()
print(max_years)
Try to get data with max year in quiz, pandas use boolean mask to filter data, you will find boolean mask is a powerful tool when you want to query data with some complicate condition
quiz_df['years'] == max_years
quiz_df[quiz_df['years'] == max_years]
# aggregate average years in quiz data
mean_years = quiz_df['years'].mean()
print(mean_years)
#%%
# agregate familiar language count
result = quiz_df["familiar language"].value_counts()
print(result)
#%%
# find user using the most popular language
popular_language = result.index[0]
quiz_user_with_popular_language = quiz_df[quiz_df['familiar language']==popular_language]
print(quiz_user_with_popular_language)
# join quiz with user using right join
#%%
quiz_with_user = pd.merge(user_df, quiz_df, how='right', left_on = 'email', right_on = 'email')
print(quiz_with_user)
# drop na user data
#%%
result = quiz_with_user.dropna()
print(result)
# find user willing to use code editor
result = result[result['will you want to use code editor']=='T']