Jesus Lopez

Teaching professionals build Python data workflows, APIs, AI tools, and deploy them to the cloud—step by step

#01 | Getting Started with Pandas

Published Oct 07, 2022

A clear introduction to Pandas, a Python library to manipulate tabular data, where you can discover its many possibilities and get a concise overview.

Read the original article here, in Hashnode.

An array is any type of object that can store more than one object. For example, the list:

[100, 134, 87, 99]

Let's say we are talking about the revenue our e-commerce has had over the last 4 months:

list_revenue = [100, 134, 87, 99]

We want to calculate the total revenue (i.e., we sum up the objects within the list):

list_revenue.sum()

--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) Input In [3], in <cell line: 1>()
----> 1 list_revenue.sum() AttributeError: 'list' object has no attribute 'sum'

The list is a poor object which doesn't contain powerful functions.

What can we do then?

We convert the list to a powerful object such as the Series, which comes from pandas library.

import pandas

pandas.Series(list_revenue)

>>>
0 100
1 134
2 87
3 99
dtype: int64

series_revenue = pandas.Series(list_revenue)

Now we have a powerful object that can perform the .sum():

series_revenue.sum()

>>> 420

Within the Series, we can find more objects.

series_revenue

>>>
0 100
1 134
2 87
3 99
dtype: int64

series_revenue.index

>>> RangeIndex(start=0, stop=4, step=1)

Let's change the elements of the index:

series_revenue.index = ['1st Month', '2nd Month', '3rd Month', '4th Month']

series_revenue

>>>
1st Month 100
2nd Month 134
3rd Month 87
4th Month 99
dtype: int64

series_revenue.values

>>> array([100, 134, 87, 99])

series_revenue.name

The Series doesn't contain a name. Let's define it:

series_revenue.name = 'Revenue'

series_revenue

>>>
1st Month 100
2nd Month 134
3rd Month 87
4th Month 99
Name: Revenue, dtype: int64

The values of the Series (right-hand side) are determined by their data type (alias dtype):

series_revenue.dtype

>>> dtype('float64')

Let's change the values' dtype to be float (decimal numbers)

series_revenue.astype(float)

>>>
1st Month 100.0
2nd Month 134.0
3rd Month 87.0
4th Month 99.0
Name: Revenue, dtype: float64

series_revenue = series_revenue.astype(float)

What else could we do with the Series object?

series_revenue.describe()

>>>
count 4.000000
mean 105.000000
std 20.215506
min 87.000000
25% 96.000000
50% 99.500000
75% 108.500000
max 134.000000
Name: Revenue, dtype: float64

series_revenue.plot.bar();

series_revenue.plot.barh();

series_revenue.plot.pie();

The DataFrame is a set of Series.

We will create another Series series_expenses to later put them together into a DataFrame.

pandas.Series( data=[20, 23, 21, 18], index=['1st Month','2nd Month','3rd Month','4th Month'], name='Expenses'
)

>>>
1st Month 20
2nd Month 23
3rd Month 21
4th Month 18
Name: Expenses, dtype: int64

series_expenses = pandas.Series( data=[20, 23, 21, 18], index=['1st Month','2nd Month','3rd Month','4th Month'], name='Expenses'
)

pandas.DataFrame(data=[series_revenue, series_expenses])

df_shop = pandas.DataFrame(data=[series_revenue, series_expenses])

Let's transpose the DataFrame to have the variables in columns:

df_shop.transpose()

df_shop = df_shop.transpose()

df_shop.index

>>> Index(['1st Month', '2nd Month', '3rd Month', '4th Month'], dtype='object')

df_shop.columns

>>> Index(['Revenue', 'Expenses'], dtype='object')

df_shop.values

>>>
array([[100., 20.], [134., 23.], [87., 21.], [99., 18.]])

df_shop.shape

>>> (4, 2)

What else could we do with the DataFrame object?

df_shop.describe()

df_shop.plot.bar();

df_shop.plot.pie(subplots=True);

df_shop.plot.line();

df_shop.plot.area();

We could also export the DataFrame to formatted data files:

df_shop.to_excel('data.xlsx')

df_shop.to_csv('data.csv')

url = 'https://raw.githubusercontent.com/jsulopzs/data/main/football_players_stats.json'
pandas.read_json(url, orient='index')

df_football = pandas.read_json(url, orient='index')

df_football.Goals.plot.pie();

url = 'https://raw.githubusercontent.com/jsulopzs/data/main/best_tennis_players_stats.json'
pandas.read_json(path_or_buf=url, orient='index')

df_tennis = pandas.read_json(path_or_buf=url, orient='index')

df_tennis.style.background_gradient()

df_tennis.plot.pie(subplots=True, layout=(2,3), figsize=(10,6));

HTML Web Page

pandas.read_html('https://www.skysports.com/la-liga-table/2021', index_col='Team')[0]

df_laliga = pandas.read_html('https://www.skysports.com/la-liga-table/2021', index_col='Team')[0]

df_laliga.Pts.plot.barh();

df_laliga.Pts.sort_values().plot.barh();

url = 'https://raw.githubusercontent.com/jsulopzs/data/main/internet_usage_spain.csv'
pandas.read_csv(filepath_or_buffer=url)

df_internet = pandas.read_csv(filepath_or_buffer=url)

df_internet.hist();

df_internet.pivot_table(index='education', columns='internet_usage', aggfunc='size')

dfres = df_internet.pivot_table(index='education', columns='internet_usage', aggfunc='size')

dfres.style.background_gradient('Greens', axis=1)

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Python Pandas Data Science Data visualization Data analytics

Report

Enjoy this post? Give Jesus Lopez a like if it's helpful.

Jesus Lopez

Teaching professionals build Python data workflows, APIs, AI tools, and deploy them to the cloud—step by step

**👋 About me** Frustrated by wasting time fixing errors without guidance? I’ve been there. When I started, I just wanted to master programming so I could build real data projects—fast and confidently. Now, after 5+ years and th...

Discover and read more posts from Jesus Lopez

get started