Codementor Events

#01 | Getting Started with Pandas

Published Oct 07, 2022
#01 | Getting Started with Pandas

A clear introduction to Pandas, a Python library to manipulate tabular data, where you can discover its many possibilities and get a concise overview.

Read the original article here, in Hashnode.

An array is any type of object that can store more than one object. For example, the list:

[100, 134, 87, 99]

Let's say we are talking about the revenue our e-commerce has had over the last 4 months:

list_revenue = [100, 134, 87, 99]

We want to calculate the total revenue (i.e., we sum up the objects within the list):

list_revenue.sum()
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) Input In [3], in <cell line: 1>()
----> 1 list_revenue.sum() AttributeError: 'list' object has no attribute 'sum'

The list is a poor object which doesn't contain powerful functions.

What can we do then?

We convert the list to a powerful object such as the Series, which comes from pandas library.

import pandas

pandas.Series(list_revenue)
>>>
0 100
1 134
2 87
3 99
dtype: int64
series_revenue = pandas.Series(list_revenue)

Now we have a powerful object that can perform the .sum():

series_revenue.sum()
>>> 420

Series.jpg

Within the Series, we can find more objects.

series_revenue
>>>
0 100
1 134
2 87
3 99
dtype: int64
series_revenue.index
>>> RangeIndex(start=0, stop=4, step=1)

Let's change the elements of the index:

series_revenue.index = ['1st Month', '2nd Month', '3rd Month', '4th Month']
series_revenue
>>>
1st Month 100
2nd Month 134
3rd Month 87
4th Month 99
dtype: int64
series_revenue.values
>>> array([100, 134, 87, 99])
series_revenue.name

The Series doesn't contain a name. Let's define it:

series_revenue.name = 'Revenue'
series_revenue
>>>
1st Month 100
2nd Month 134
3rd Month 87
4th Month 99
Name: Revenue, dtype: int64

The values of the Series (right-hand side) are determined by their data type (alias dtype):

series_revenue.dtype
>>> dtype('float64')

Let's change the values' dtype to be float (decimal numbers)

series_revenue.astype(float)
>>>
1st Month 100.0
2nd Month 134.0
3rd Month 87.0
4th Month 99.0
Name: Revenue, dtype: float64
series_revenue = series_revenue.astype(float)

What else could we do with the Series object?

series_revenue.describe()
>>>
count 4.000000
mean 105.000000
std 20.215506
min 87.000000
25% 96.000000
50% 99.500000
75% 108.500000
max 134.000000
Name: Revenue, dtype: float64
series_revenue.plot.bar();

output_39_0.png

series_revenue.plot.barh();

output_40_0.png

series_revenue.plot.pie();

output_41_0.png

DataFrame.jpg

The DataFrame is a set of Series.

We will create another Series series_expenses to later put them together into a DataFrame.

pandas.Series( data=[20, 23, 21, 18], index=['1st Month','2nd Month','3rd Month','4th Month'], name='Expenses'
)
>>>
1st Month 20
2nd Month 23
3rd Month 21
4th Month 18
Name: Expenses, dtype: int64
series_expenses = pandas.Series( data=[20, 23, 21, 18], index=['1st Month','2nd Month','3rd Month','4th Month'], name='Expenses'
)
pandas.DataFrame(data=[series_revenue, series_expenses])

df1.png

df_shop = pandas.DataFrame(data=[series_revenue, series_expenses])

Let's transpose the DataFrame to have the variables in columns:

df_shop.transpose()

df2.png

df_shop = df_shop.transpose()
df_shop.index
>>> Index(['1st Month', '2nd Month', '3rd Month', '4th Month'], dtype='object')
df_shop.columns
>>> Index(['Revenue', 'Expenses'], dtype='object')
df_shop.values
>>>
array([[100., 20.], [134., 23.], [87., 21.], [99., 18.]])
df_shop.shape
>>> (4, 2)

What else could we do with the DataFrame object?

df_shop.describe()

df3.png

df_shop.plot.bar();

output_63_0.png

df_shop.plot.pie(subplots=True);

output_64_0.png

df_shop.plot.line();

output_65_0.png

df_shop.plot.area();

output_66_0.png

We could also export the DataFrame to formatted data files:

df_shop.to_excel('data.xlsx')
df_shop.to_csv('data.csv')
url = 'https://raw.githubusercontent.com/jsulopzs/data/main/football_players_stats.json'
pandas.read_json(url, orient='index')

df4.png

df_football = pandas.read_json(url, orient='index')
df_football.Goals.plot.pie();

output_76_0.png

url = 'https://raw.githubusercontent.com/jsulopzs/data/main/best_tennis_players_stats.json'
pandas.read_json(path_or_buf=url, orient='index')

df5.png

df_tennis = pandas.read_json(path_or_buf=url, orient='index')
df_tennis.style.background_gradient()

df6.png

df_tennis.plot.pie(subplots=True, layout=(2,3), figsize=(10,6));

output_82_0.png

HTML Web Page

pandas.read_html('https://www.skysports.com/la-liga-table/2021', index_col='Team')[0]

df7.png

df_laliga = pandas.read_html('https://www.skysports.com/la-liga-table/2021', index_col='Team')[0]
df_laliga.Pts.plot.barh();

output_87_0.png

df_laliga.Pts.sort_values().plot.barh();

output_88_0.png

url = 'https://raw.githubusercontent.com/jsulopzs/data/main/internet_usage_spain.csv'
pandas.read_csv(filepath_or_buffer=url)

df8.png

df_internet = pandas.read_csv(filepath_or_buffer=url)
df_internet.hist();

output_93_0.png

df_internet.pivot_table(index='education', columns='internet_usage', aggfunc='size')

df-pivot.png

dfres = df_internet.pivot_table(index='education', columns='internet_usage', aggfunc='size')
dfres.style.background_gradient('Greens', axis=1)

dfpivot-color.png

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Discover and read more posts from Jesus Lopez
get started