Pandas in Python
Introduction :
- Pandas is a Python library used for working with datasets.
- It is used for exploring, cleaning, manipulating, and analyzing data.
- The word "Pandas" has reference to "Panel Data" and "Python Data Analysis".
Importing Pandas :
import pandas
Now, it is ready to use.
import pandas
a={
"Fruits":["apple","mango","kiwi"],
"Qty":[1,2,3]
}
df=pandas.DataFrame(a)
print(df)
Importing Pandas with alias :
Usually, Pandas is imported with pd alias.
alias: alias is alternate name for referencing the same thing.
import pandas as pd
a={
"Fruits":["apple", "mango", "kiwi"],
"Qty":[1,2,3]
}
df=pd.DataFrame(a)
print(df)
Pandas Series :
- Series is a one-dimensional array, capable of holding single type of data.
- Series is like a column in a table.
import pandas as pd
a=[1,2,3]
s=pd.Series(a)
print(s)
Labels :
If index is not specified, the values are labelled with their index number, first element has index 0 and second has 1 and so on.
- We can also access elements in series by index number :
import pandas as pd
a=[1,2,3]
s=pd.Series(a)
print(s[0])
Creating Labels:
We can also create our index with the help of index
argument.
import pandas as pd
a=[1,2,3]
s=pd.Series(a,index= 'a','b','c')
print(s)
Pandas DataFrames :
- DataFrame is like a tabular spreadsheet representing rows which contain one or more columns.
- Series is like a column in a table where DataFrame is a table.
import pandas as pd
a={
"Fruits":["apple","mango","kiwi"],
"Qty.":[1,2,3]
}
df=pd.DataFrame(a)
print(df)
Index in DataFrame :
As in Series, we can also name the indexes in DataFrames.
import pandas as pd
a={
"Fruits":["apple","mango","kiwi"],
"Qty":[1,2,3]
}
df=pd.DataFrame(a,index="x","y","z")
print(df)
Loc :
loc[]
attribute returns one or more specified rows.
import pandas as pd
a={
"Fruits":["apple","mango","banana"],
"Qty.":[1,2,3]
}
df=pd.DataFrame(a)
print(df.loc[0])
You can also access the DataFrame elements by referring named index using loc[]
attribute :
import pandas as pd
a={
"Fruits":["apple","mango","kiwi"],
"Qty.":[1,2,3]
}
df=pd.DataFrame(a,index='x','y','z')
print(df.loc['x'])
Reading CSV File:
- CSV stands for Comma Seprated Files.
- Pandas provide
read_csv()
method to load CSV files in DataFrame. - I will be using 'data.csv' file as an example.
a=pd.read_csv('data.csv')
print(a)
By default it will print first 5 and last 5 rows with headers.
If you want to print the entire DataFrame, use to_string
method.
import pandas as pd
a=pd.read_csv('data.csv')
print(a.to_string)
Analyzing the Data :
Head() Method :
The head()
method returns headers and specified number of rows from the top of the dataset.
# Get the quick overview by printing 3 rows of the dataset :
import pandas as pd
a=pd.read_csv('data.csv')
print(a.head(3))
NOTE : If number of rows are not specified, head method will return 5 rows.
Tail Method :
The tail()
method returns headers and specified number of rows from the bottom of dataset.
# Get the first 10 rows of the dataset
import pandas as pd
a=pd.read_csv('data.csv')
print(a.tail(10))
Information about Data :
The info()
method is used to give more information about the dataset.
import pandas as pd
a=pd.read_csv('data.csv')
print(a.info())
Data Cleaning:
- Data cleaning means fixing wrong data.
- Wrong data can be empty values, duplicates, data in wrong format.
Remove Empty values :
- One way to remove empty values is to remove rows that contain empty values.
- The
dropna()
method is used to remove rows with duplicate values.
import pandas as pd
a=pd.read_csv('data.csv')
df=a.dropna()
print(df)
By default, the dropna()
method will return a new DataFrame without affecting the original DataFrame.
If you want to change the original DataFrame, use inplace = True
.
import pandas as pd
a=pd.read_csv('data.csv')
a.dropna(inplace=True)
print(a)
- Another way to fill empty values is to fill a new value instead.
- The
fillna()
method is to fill null values.
# Fill the null values with 130 :
import pandas as pd
a=pd.read_csv('data.csv')
a.fillna(130,inplace=True)
print(a)
Removing Duplicates :
To discover duplicates in a dataset, use duplicated()
method.
import pandas as pd
a=pd.read_csv('data.csv')
print(a.duplicated())
The duplicated()
method returns True and False for each row.
To remove duplicates from a dataset, use drop_duplicates()
method.
import pandas as pd
a=pd.read_csv('data.csv')
print(a.drop_duplicates(inplace=True))
print(a)
Cleaning Wrong Data :
Wrong data can be data in wrong format.
To remove wrong data, use loc[]
attribute.
import pandas as pd
a=pd.read_csv('data.csv')
a.loc[0,7]=45
Correlation in Pandas :
The corr()
method returns relationship between each column in a dataset.
import pandas as pd
a=pd.read_csv('data.csv')
print(a.corr())