anshika vohra

Data Enthusiastic

Pandas in Python

Published Feb 13, 2022Last updated Feb 20, 2022

Introduction :

Pandas is a Python library used for working with datasets.
It is used for exploring, cleaning, manipulating, and analyzing data.
The word "Pandas" has reference to "Panel Data" and "Python Data Analysis".

Importing Pandas :

import pandas

Now, it is ready to use.

import pandas
a={
"Fruits":["apple","mango","kiwi"],
"Qty":[1,2,3]
}
df=pandas.DataFrame(a)
print(df)

Importing Pandas with alias :

Usually, Pandas is imported with pd alias.
alias: alias is alternate name for referencing the same thing.

import pandas as pd
a={
"Fruits":["apple", "mango", "kiwi"],
"Qty":[1,2,3]
}
df=pd.DataFrame(a)
print(df)

Pandas Series :

Series is a one-dimensional array, capable of holding single type of data.
Series is like a column in a table.

import pandas as pd
a=[1,2,3]
s=pd.Series(a)
print(s)

Labels :

If index is not specified, the values are labelled with their index number, first element has index 0 and second has 1 and so on.

We can also access elements in series by index number :

import pandas as pd
a=[1,2,3]
s=pd.Series(a)
print(s[0])

Creating Labels:

We can also create our index with the help of index argument.

import pandas as pd
a=[1,2,3]
s=pd.Series(a,index= 'a','b','c')
print(s)

Pandas DataFrames :

DataFrame is like a tabular spreadsheet representing rows which contain one or more columns.
Series is like a column in a table where DataFrame is a table.

import pandas as pd
a={
"Fruits":["apple","mango","kiwi"],
"Qty.":[1,2,3]
}
df=pd.DataFrame(a)
print(df)

Index in DataFrame :

As in Series, we can also name the indexes in DataFrames.

import pandas as pd
a={
"Fruits":["apple","mango","kiwi"],
"Qty":[1,2,3]
}
df=pd.DataFrame(a,index="x","y","z")
print(df)

Loc :

loc[] attribute returns one or more specified rows.

import pandas as pd
a={
"Fruits":["apple","mango","banana"],
"Qty.":[1,2,3]
}
df=pd.DataFrame(a)
print(df.loc[0])

You can also access the DataFrame elements by referring named index using loc[] attribute :

import pandas as pd
a={
"Fruits":["apple","mango","kiwi"],
"Qty.":[1,2,3]
}
df=pd.DataFrame(a,index='x','y','z')
print(df.loc['x'])

Reading CSV File:

CSV stands for Comma Seprated Files.
Pandas provide read_csv() method to load CSV files in DataFrame.
I will be using 'data.csv' file as an example.

a=pd.read_csv('data.csv')
print(a)

By default it will print first 5 and last 5 rows with headers.
If you want to print the entire DataFrame, use to_string method.

import pandas as pd
a=pd.read_csv('data.csv')
print(a.to_string)

Analyzing the Data :

Head() Method :

The head() method returns headers and specified number of rows from the top of the dataset.

# Get the quick overview by printing 3 rows of the dataset :
import pandas as pd
a=pd.read_csv('data.csv')
print(a.head(3))

NOTE : If number of rows are not specified, head method will return 5 rows.

Tail Method :

The tail() method returns headers and specified number of rows from the bottom of dataset.

# Get the first 10 rows of the dataset
import pandas as pd
a=pd.read_csv('data.csv')
print(a.tail(10))

Information about Data :

The info() method is used to give more information about the dataset.

import pandas as pd
a=pd.read_csv('data.csv')
print(a.info())

Data Cleaning:

Data cleaning means fixing wrong data.
Wrong data can be empty values, duplicates, data in wrong format.

Remove Empty values :

One way to remove empty values is to remove rows that contain empty values.
The dropna() method is used to remove rows with duplicate values.

import pandas as pd
a=pd.read_csv('data.csv')
df=a.dropna()
print(df)

By default, the dropna() method will return a new DataFrame without affecting the original DataFrame.
If you want to change the original DataFrame, use inplace = True.

import pandas as pd
a=pd.read_csv('data.csv')
a.dropna(inplace=True)
print(a)

Another way to fill empty values is to fill a new value instead.
The fillna() method is to fill null values.

# Fill the null values with 130 :
import pandas as pd
a=pd.read_csv('data.csv')
a.fillna(130,inplace=True)
print(a)

Removing Duplicates :

To discover duplicates in a dataset, use duplicated() method.

import pandas as pd
a=pd.read_csv('data.csv')
print(a.duplicated())

The duplicated() method returns True and False for each row.

To remove duplicates from a dataset, use drop_duplicates() method.

import pandas as pd
a=pd.read_csv('data.csv')
print(a.drop_duplicates(inplace=True))
print(a)

Cleaning Wrong Data :

Wrong data can be data in wrong format.
To remove wrong data, use loc[] attribute.

import pandas as pd
a=pd.read_csv('data.csv')
a.loc[0,7]=45

Correlation in Pandas :

The corr() method returns relationship between each column in a dataset.

import pandas as pd
a=pd.read_csv('data.csv')
print(a.corr())

Python Pandas Data Science Data analytics

Report

Enjoy this post? Give anshika vohra a like if it's helpful.

anshika vohra

Data Enthusiastic

Currently Working on my skills to be a **Data Analyst**

Discover and read more posts from anshika vohra

get started