Kaggl Titanic: A Machine Learning from Disaster | Feature Eng. Part 1
I barely remember first when exactly I watched Titanic movie but still now Titanic remains a discussion subject in the most diverse areas. The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. It was April 15-1912 during her maiden voyage, the Titanic sank after colliding with an iceberg and killing 1502 out of 2224 passengers and crew.
In kaggle challenge, we're asked to complete the analysis of what sorts of people were likely to survive. In particular, we're asked to apply the tools of machine learning to predict which passengers survived the tragedy.
More challenge information and the datasets are available on Kaagle Titanic Page The datasets has been split into two groups:
training set (train.csv)
test set (test.csv)
TL;DR
Look at the Big Picture
The goal is to build a Model that can predict the survival or the death of a given passenger based on a set of variables describing their such as age, sex, or passenger class on the boat.
Frame the Problem
To frame the ML problem elegantly, is very much important because it will determine our problem spaces. What algorithms we will select, what performance measure we will use to evaluate our model and also how much effort we should spend tweaking it.
The test set should be used to see how well our model performs on unseen data. For the test set, the ground truth for each passenger is not provided. It is our job to predict these outcomes. For each passenger in the test set, we use the trained model to predict whether or not they survived the sinking of the Titanic. We will use Cross-validation for evaluating estimator performance.
Basically, we've two datasets are available, a train set
and a test set
. We'll be using the training set to build our predictive model and the testing set will be used to validate that model. This is a binary classification problem.
To solve this ML problem, topics like feature analysis, data visualization, missing data imputation, feature engineering, model fine tuning and various classification models will be addressed for ensemble modeling.
Preprocessing
In Data Science or ML problem spaces, Data Preprocessing means a lot, which is to make the Data usable or clean before using it, like before fit the model.
Now, the real world data is so messy, like following -
- inconsistant values
- duplicate records
- missing values
- invalid data
- outlier
So what? Actually this is a matter of big concern. Because, Model can't handle missing data. So, we need to handle this manually. Actually there're many approaches we can take to handle missing value in our data sets, such as-
-
Remove observation/records that have missing values. But..
- data may randomly missing, so by doing this we may loss a lots of data
- data may non-randomly missing, so by doing this we may also loss a lots of data, again we're also introducing potential biases
-
Imputation
- replace missing values with another values
- strategies: mean, median or highest frequency value of the given feature
Table of Contents
The steps we will go through are as follows:
-
Get The Data and Explore
Here we'll explore what inside of the dataset and based on that we'll make our first commit on it. -
Feature Analysis To Gain Insights
First we try to find out outlier from our datasets. There're many method to dectect outlier but here we will use tukey method to detect it. Then we will do component analysis of our features. -
Feature Engineering
Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. And here, in our datasets there are few features that we can do engineering on it. I like to choose two of them.- Name
- Family Size
-
Predictive Modeling (In Part 2)
Here, we will use various classificatiom models and compare the results. We'll useCross-validation
for evaluating estimator performance and fine-tune the model and observe the learning curve, of best estimator and finally, will do enseble modeling of with three best predictive model. -
Submit Predictor
Create a CSV file and submit to Kaggle.
Import
At first we will load some various libraries. It may be confusing but we will see the use cases each of them in details later on.
# Data Processing and Visualization Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Data Modelling Libraries
from sklearn.ensemble import (RandomForestClassifier, AdaBoostClassifier,
GradientBoostingClassifier, ExtraTreesClassifier,
VotingClassifier)
from sklearn.model_selection import (GridSearchCV, cross_val_score, cross_val_predict,
StratifiedKFold, learning_curve)
from sklearn.metrics import (confusion_matrix, accuracy_score)
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
import warnings
from collections import Counter
sns.set(style = 'white' , context = 'notebook', palette = 'deep')
warnings.filterwarnings('ignore', category = DeprecationWarning)
%matplotlib inline
Get Data Sets
Using pandas, we now load the dataset. Basically two files, one is for training purpose and other is for testng.
# load the datasets using pandas's read_csv method
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
# concat these two datasets, this will come handy while processing the data
dataset = pd.concat(objs=[train, test], axis=0).reset_index(drop=True)
# separately store ID of test datasets,
# this will be using at the end of the task to predict.
TestPassengerID = test['PassengerId']
Look Inside
Let's look what we've just loaded. Datasets size, shape, short description and few more.
# shape of the data set
train.shape # (891, 12)
So it has 891 samples with 12 features. That's somewhat big, let's see top 5 sample of it.
# first 5 records
train.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
It's more convenient to run each code snippet on jupyter cell.
Definitions of each features and quick thoughts:
- PassengerId. Unique identification of the passenger. It shouldn't be necessary for the machine learning model.
- Survived. Survival (0 = No, 1 = Yes). Binary variable that will be our target variable.
- Pclass. Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd). Ready to go.
- Name. Name of the passenger. We need to parse before using it.
- Sex. Gender Categorical variable that should be encoded. We can use dummy -variable to encode it.
- Age. Age in years.
- SibSp. Siblings / Spouses aboard the Titanic.
- Parch. Parents / Children aboard the Titanic.
- Ticket. Ticket number. Big mess.
- Fare. Passenger fare.
- Cabin. Cabin number.
- Embarked. Port of Embarkation , C = Cherbourg, Q = Queenstown, S = Southampton. Categorical feature that should be encoded. We can use feature mapping or make dummy vairables for it.
The main conclusion is that we already have a set of features that we can easily use in our machine learning model. But features like Name, Ticket, Cabin require an additional effort before we can integrate them.
# using info method we can get quick overview of the data sets
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
One things to notice, we have 891 samples or entries but columns like Age, Cabin and Embarked have some missing values
. We can't ignore those. However, let's generate the descriptive statistics to get the basic quantitative information about the features of our data set.
# Descriptive Statistics
train.describe()
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
There are three aspects that usually catch my attention when I analyse descriptive statistics:
- Min and max values: This can give us an idea about the range of values and is helpful to detect outliers.
- Mean and standard deviation: The mean shows us the central tendency of the distribution, while the standard deviation quantifies its amount of variation.
- Count: Give us a first perception about the volume of missing data.
Let's define a function for missing data analysis more in details.
# Create table for missing data analysis
def find_missing_data(data):
Total = data.isnull().sum().sort_values(ascending = False)
Percentage = (data.isnull().sum()/data.isnull().count()).sort_values(ascending = False)
return pd.concat([Total,Percentage] , axis = 1 , keys = ['Total' , 'Percent'])
# run the method
find_missing_data(train)
Total Percent
Cabin 687 0.771044
Age 177 0.198653
Embarked 2 0.002245
Fare 0 0.000000
Ticket 0 0.000000
Parch 0 0.000000
SibSp 0 0.000000
Sex 0 0.000000
Name 0 0.000000
Pclass 0 0.000000
Survived 0 0.000000
PassengerId 0 0.000000
Let's create a heatmap plot to visualize the amount of missing values.
# checking only train set - visualize
sns.heatmap(train.isnull(), cbar = False ,
yticklabels = False , cmap = 'viridis')
We can see that, Cabin feature has terrible amount of missing values, around 77% data are missing. Until now, we only see train datasets, now let's see amount of missing values in whole datasets.
find_missing_data(dataset)
Total Percent
Cabin 1014 0.774637
Survived 418 0.319328
Age 263 0.200917
Embarked 2 0.001528
Fare 1 0.000764
Ticket 0 0.000000
SibSp 0 0.000000
Sex 0 0.000000
Pclass 0 0.000000
PassengerId 0 0.000000
Parch 0 0.000000
Name 0 0.000000
# checking only datasets set
sns.heatmap(dataset.isnull(), cbar = False ,
yticklabels = False , cmap = 'viridis')
As it mentioned earlier, ground truth of test datasets are missing.
Problem Spaces
So, about train data set we've seen its internal components and find some missing values there. We've also seen many observations with concern attributes.
Task: The goal is to predict the survival or the death of a given passenger based on a set of variables describing their such as age, sex, or passenger class on the boat.
So, Survived is our target variable, This is the variable we're going to predict. 1
represent survived , 0
represent not survived. And rest of the attributes are called feature variables, based on those we need to build a model which will predict whether a passenger survived or not.
Preprocessing
In Data Science or ML contexts, Data Preprocessing means to make the Data usable or clean before using it, like before fit the model.
Now, the real world data is so messy, they're like -
- inconsistant
- duplicate records
- missing values
- invalid data
- outlier
Feature Analysis
Outlier Detection
There are many method to detect outlier. We will use Tukey Method to accomplish it.
# Outlier detection
def detect_outliers(df,n,features):
"""
Takes a dataframe df of features and returns a list of the indices
corresponding to the observations containing more than n outliers according
to the Tukey method.
"""
outlier_indices = []
# iterate over features(columns)
for col in features:
# 1st quartile (25%)
Q1 = np.percentile(df[col], 25)
# 3rd quartile (75%)
Q3 = np.percentile(df[col],75)
# Interquartile range (IQR)
IQR = Q3 - Q1
# outlier step
outlier_step = 1.5 * IQR
# Determine a list of indices of outliers for feature col
outlier_list_col = df[(df[col] < Q1 - outlier_step) |
(df[col] > Q3 + outlier_step )].index
# append the found outlier indices for col to the list of outlier indices
outlier_indices.extend(outlier_list_col)
# select observations containing more than 2 outliers
outlier_indices = Counter(outlier_indices)
multiple_outliers = list( k for k, v in outlier_indices.items() if v > n )
return multiple_outliers
# detect outliers from Age, SibSp , Parch and Fare
Outliers_to_drop = detect_outliers(train,2,["Age","SibSp","Parch","Fare"])
# Show the outliers rows
train.loc[Outliers_to_drop]
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
27 28 0 1 Fortune, Mr. Charles Alexander male 19.0 3 2 19950 263.00 C23 C25 C27 S
88 89 1 1 Fortune, Miss. Mabel Helen female 23.0 3 2 19950 263.00 C23 C25 C27 S
159 160 0 3 Sage, Master. Thomas Henry male NaN 8 2 CA. 2343 69.55 NaN S
180 181 0 3 Sage, Miss. Constance Gladys female NaN 8 2 CA. 2343 69.55 NaN S
201 202 0 3 Sage, Mr. Frederick male NaN 8 2 CA. 2343 69.55 NaN S
324 325 0 3 Sage, Mr. George John Jr male NaN 8 2 CA. 2343 69.55 NaN S
341 342 1 1 Fortune, Miss. Alice Elizabeth female 24.0 3 2 19950 263.00 C23 C25 C27 S
792 793 0 3 Sage, Miss. Stella Anna female NaN 8 2 CA. 2343 69.55 NaN S
846 847 0 3 Sage, Mr. Douglas Bullen male NaN 8 2 CA. 2343 69.55 NaN S
863 864 0 3 Sage, Miss. Dorothy Edith "Dolly" female NaN 8 2 CA. 2343 69.55 NaN S
# Drop outliers
train = train.drop(Outliers_to_drop, axis = 0).reset_index(drop=True)
# after removing outlier, let's re-concat the data sets
dataset = pd.concat(objs=[train, test], axis=0).reset_index(drop=True)
Now that we've removed outlier, let's analysis the various features and in the same time we'll also handle the missing value during analysis.
- Numerical Analysis
- Categorical Analysi
Numerical Analysis
At first let's analysis the correlation of 'Survived' features with the other numerical features like 'SibSp', 'Parch', 'Age', 'Fare'.
# Correlation matrix between numerical values (SibSp Parch Age and Fare values) and Survived
corr_numeric = sns.heatmap(dataset[["Survived","SibSp","Parch","Age","Fare"]].corr(),
annot=True, fmt = ".2f", cmap = "summer")
Only Fare feature seems to have a significative correlation with the survival probability.
But it doesn't make other features useless. Subpopulations in these features can be correlated with the survival. To estimate this, we need to explore in detail these features.
Age
Let's first look the age distribution among survived and not survived passengers.
# Explore the Age vs Survived features
age_survived = sns.FacetGrid(dataset, col='Survived')
age_survived = age_survived.map(sns.distplot, "Age")
So, It's look like age distributions are not the same in the survived and not survived subpopulations. Indeed, there is a peak corresponding to young passengers, that have survived. We also see that passengers between 60-80 have less survived. So, even if "Age" is not correlated with "Survived", we can see that there is age categories of passengers that of have more or less chance to survive.
It seems that very young passengers have more chance to survive. Let's look one for time.
fig = sns.FacetGrid(dataset, hue = 'Survived', aspect = 4)
fig.map(sns.kdeplot, 'Age' , shade = True)
fig.set(xlim = (0, dataset['Age'].max()))
fig.add_legend()
Again we see that aged passengers between 65-80 have less survived.
Missing Age Value
We have seen significantly missing values in Age coloumn. Missing Age value is a big issue, to address this problem, I've looked at the most correlated features with Age. Let's first try to find correlation between Age and Sex features.
# visualize this using box plot
AS = sns.factorplot(y="Age", x="Sex", data = dataset, kind="box")
Age distribution seems to be almost same in Male and Female subpopulations, so Sex is not informative to predict Age. Let's explore age and pclass distribution.
facet = sns.FacetGrid(dataset, hue="Pclass", aspect=4)
facet.map(sns.kdeplot,'Age',shade= True)
facet.set(xlim=(0, train['Age'].max()))
facet.add_legend()
plt.show()
So, we see there're more young people from class 3. First class passenger seems more aged than second class and third class are following. But we can't get any information to predict age. But let's try an another approach to visualize with the same parameter.
# using boxplot
PA = sns.factorplot(data = dataset , x = 'Pclass' , y = 'Age', kind = 'box')
Here, we can get some information, First class passengers are older than 2nd class passengers who are also older than 3rd class passengers. We can easily visaulize that roughly 37, 29, 24
respectively are the median values of each classes. The strategy can be used to fill Age with the median age of similar rows according to Pclass.
# a custom function for age imputation
def AgeImpute(df):
Age = df[0]
Pclass = df[1]
if pd.isnull(Age):
if Pclass == 1: return 37
elif Pclass == 2: return 29
else: return 24
else:
return Age
# Age Impute
dataset['Age'] = dataset[['Age' , 'Pclass']].apply(AgeImpute, axis = 1)
# age featured imputed; no missing age records
sns.heatmap(dataset.isnull(), yticklabels = False, cbar = False, cmap = 'summer')
Fare
feature missing some values. However, we will handle it later.
SibSP
Now, let's look Survived
and SibSp
features in details.
# Explore SibSp feature vs Survived
# We'll use factorplot to analysis
Sib_Sur = sns.factorplot(x="SibSp",y="Survived",data=train,
kind="bar", size = 6 , palette = "Blues")
Sib_Sur.despine(left=True)
Sib_Sur = Sib_Sur.set_ylabels("survival probability")
It seems that passengers having a lot of siblings/spouses have less chance to survive. Single passengers (0 SibSP) or with two other persons (SibSP 1 or 2) have more chance to survive.
Parch
Let's look Survived
and Parch
features in details.
# Explore Parch feature vs Survived
# We'll use factorplot to analysis
Sur_Par = sns.factorplot(x="Parch",y="Survived",data=train,
kind="bar", size = 6 , palette = "GnBu_d")
Sur_Par.despine(left=True)
Sur_Par = Sur_Par.set_ylabels("survival probability")
Small families have more chance to survive, more than single.
Fare
Let's look Survived
and Fare
features in details. We have seen that, Fare feature also mssing some values. Let's handle it first.
dataset["Fare"].isnull().sum() # 1
Since we have one missing value , I liket to fill it with the median value.
dataset["Fare"] = dataset["Fare"].fillna(dataset["Fare"].median())
Categorical values
We can turn categorical values into numerical values. This is simply needed because of feeding the traing data to model. We can use feature mapping or create dummy variables.
sex
Let's take a quick look of values in this features.
print(dataset['Sex'].head()) # top 5
print(' ')
print(dataset['Sex'].tail()) # last 5
0 male
1 female
2 female
3 female
4 male
Name: Sex, dtype: object
1294 male
1295 female
1296 male
1297 male
1298 male
Name: Sex, dtype: object
Model can not take such values. We need to map the sex column to numeric values, so that our model can digest.
# convert Sex into categorical value 0 for male and 1 for female
sex = pd.get_dummies(dataset['Sex'], drop_first = True)
dataset = pd.concat([dataset,sex], axis = 1)
# After now, we really don't need to Sex features, we can drop it.
dataset.drop(['Sex'] , axis = 1 , inplace = True)
Let see how much people survived
based on their gender
. We can guess though, Female
passenger survived more than Male
, this is just assumption though. In the movie, we heard that Women and Children First.
# using countplot to estimate amount
sns.countplot(data = train , x = 'Survived' , hue = 'Sex', palette = 'GnBu_d')
# let's see the percentage
train[["Sex","Survived"]].groupby('Sex').mean()
Survived
Sex
female 0.747573
male 0.190559
It is clearly obvious that Male have less chance to survive than Female. This is heavily an important feature for our prediction task.
Pclass
Let's explore passenger calsses feature with age feature. From this we can know, how much children, young and aged people were in different passenger class.
facet = sns.FacetGrid(train, hue="Pclass",aspect=4)
facet.map(sns.kdeplot,'Age',shade= True)
facet.set(xlim=(0, train['Age'].max()))
facet.add_legend()
plt.show()
So, we see there're more young people from class 3. And more aged passenger were in first class, and that indicate that they're rich. So, most of the young people were in class three.
However, let's explore the Pclass
vs Survived
using Sex
feature. This will give more information about the survival probability of each classes according to their gender.
Survived_Pcalss = sns.factorplot(x="Pclass", y="Survived",
hue="Sex", data=train,size=6,
kind="bar", palette="BuGn_r")
Survived_Pcalss.despine(left=True)
Survived_Pcalss = Survived_Pcalss.set_ylabels("survival probability")
The passenger survival is not the same in the all classes. First class passengers have more chance to survive than second class and third class passengers. And Female survived more than Male in every classes.
Embarked
Port of Embarkation , C = Cherbourg, Q = Queenstown, S = Southampton. Categorical feature that should be encoded. We can use feature mapping or make dummy vairables for it.
However, let's explore it combining Pclass
and Survivied
features. So that, we can get idea about the classes of passengers and also the concern embarked.
# 'Embarked' vs 'Survived'
sns.barplot(dataset['Embarked'], dataset['Survived']);
Looks like, coming from Cherbourg people have more chance to survive. But why? That's weird. Let's compare this feature with other variables.
# Count
print(dataset.groupby(['Embarked'])['PassengerId'].count())
# Compare with other variables
dataset.groupby(['Embarked']).mean()
Embarked
C 270
Q 123
S 904
Name: PassengerId, dtype: int64
Age Fare Parch PassengerId Pclass SibSp Survived male
Embarked
C 31.242296 62.336267 0.370370 690.655556 1.851852 0.400000 0.553571 0.581481
Q 25.963415 12.409012 0.113821 668.593496 2.894309 0.341463 0.389610 0.512195
S 28.973175 26.296450 0.409292 645.971239 2.347345 0.484513 0.339117 0.683628
Oh, C passenger have paid more and travelling in a better class than people embarking on Q and S. Amount of passenger from S is larger than others. But survival probability of C have more than others.
As we've seen earlier that Embarked feature also has some missing values, so we can fill them with the most fequent value of Embarked which is S (almost 904).
# count missing values
print(dataset["Embarked"].isnull().sum()) # 2
# Fill Embarked nan values of dataset set with 'S' most frequent value
dataset["Embarked"] = dataset["Embarked"].fillna("S")
# let's visualize it to confirm
sns.heatmap(dataset.isnull(), yticklabels = False,
cbar = False, cmap = 'summer')
And there it goes. Now, there's no missing values in Embarked feature. Let's explore this feature a little bit more. We can viz the survival probability with the amount of classes passenger embarked on different port.
# Counting passenger based on Pclass and Embarked
Embarked_Pc = sns.factorplot("Pclass", col="Embarked", data=dataset,
size=5, kind="count", palette="muted", hue = 'Survived')
Embarked_Pc.despine(left=True)
Embarked_Pc = Embarked_Pc.set_ylabels("Count")
Indeed, the third class is the most frequent for passenger coming from Southampton (S) and Queenstown (Q), and but Cherbourg passengers are mostly in first class. From this, we can also get idea about the economic condition of these region on that time.
However, We need to map the Embarked
column to numeric values, so that our model can digest.
# create dummy variable
embarked = pd.get_dummies(dataset['Embarked'], drop_first = True)
dataset = pd.concat([dataset,embarked], axis = 1)
# after now, we don't need Embarked coloumn anymore, so we can drop it.
dataset.drop(['Embarked'] , axis = 1 , inplace = True)
Commitment for Feature Analysis
So far, we've seen various subpopulation components of each features and fill the gap of missing values. We've done many visualization of each components and tried to find some insight of them. Though we can dive into more deeper but I like to end this here and try to focus on feature engineering.
We saw that, we've many messy features like Name
, Ticket
and Cabin
. We can do feature engineering to each of them and find out some meaningfull insight. But, I like to work on only Name
variables. Ticket
is, I think not too much important for prediction task and again almost 77% data missing in Cabin
variables.
However, let's have a quick look over our datasets.
dataset.head()
Age Cabin Fare Name Parch PassengerId Pclass SibSp Survived Ticket male Q S
0 22.0 NaN 7.2500 Braund, Mr. Owen Harris 0 1 3 1 0.0 A/5 21171 1 0 1
1 38.0 C85 71.2833 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 2 1 1 1.0 PC 17599 0 0 0
2 26.0 NaN 7.9250 Heikkinen, Miss. Laina 0 3 3 0 1.0 STON/O2. 3101282 0 0 1
3 35.0 C123 53.1000 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 4 1 1 1.0 113803 0 0 1
4 35.0 NaN 8.0500 Allen, Mr. William Henry 0 5 3 0 0.0 373450 1 0 1
Feature Engineering
Feature engineering is an informal topic, but it is considered essential in applied machine learning. Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work.
Some resources to get more depth on it -
- Data Preprocessing and Feature Exploration
- Makes a Good Feature
- Feature Engineering
- Automated Feature Engineering
Feature engineering is the art of converting raw data into useful features. There are several feature engineering techniques that you can apply. Some techniques are -
- Box-Cox transformations
- Polynomials generation through non-linear expansions
But we don't wanna be too serious on this right now rather than simply apply feature engineering approaches to get usefull information.
Name
We can assume that people's title influences how they are treated. In our case, we have several titles (like Mr, Mrs, Miss, Master etc ), but only some of them are shared by a significant number of people. Accordingly, it would be interesting if we could group some of the titles and simplify our analysis.
Let's analyse the 'Name' and see if we can find a sensible way to group them. Then, we test our new groups and, if it works in an acceptable way, we keep it. For now, optimization will not be a goal. The focus is on getting something that can improve our current situation.
dataset['Name'].head(10)
0 Braund, Mr. Owen Harris
1 Cumings, Mrs. John Bradley (Florence Briggs Th...
2 Heikkinen, Miss. Laina
3 Futrelle, Mrs. Jacques Heath (Lily May Peel)
4 Allen, Mr. William Henry
5 Moran, Mr. James
6 McCarthy, Mr. Timothy J
7 Palsson, Master. Gosta Leonard
8 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
9 Nasser, Mrs. Nicholas (Adele Achem)
Name: Name, dtype: object
# Get Title from Name
dataset_title = [i.split(",")[1].split(".")[0].strip() for i in dataset["Name"]]
# add dataset_title to the main dataset named 'Title'
dataset["Title"] = pd.Series(dataset_title)
# count
dataset["Title"].value_counts()
Mr 753
Miss 255
Mrs 197
Master 60
Dr 8
Rev 8
Col 4
Mlle 2
Ms 2
Major 2
Lady 1
Don 1
the Countess 1
Jonkheer 1
Dona 1
Capt 1
Mme 1
Sir 1
Name: Title, dtype: int64
# Plot bar plot (titles and Age)
plt.figure(figsize=(18,5))
sns.barplot(x=dataset['Title'], y = dataset['Age'])
# Means per title
print(dataset.groupby('Title')['Age'].mean())
Title
Capt 70.000000
Col 54.000000
Don 40.000000
Dona 39.000000
Dr 42.750000
Jonkheer 38.000000
Lady 48.000000
Major 48.500000
Master 7.643000
Miss 22.261137
Mlle 24.000000
Mme 24.000000
Mr 30.926295
Mrs 35.898477
Ms 26.000000
Rev 41.250000
Sir 49.000000
the Countess 33.000000
Name: Age, dtype: float64
There is 18 titles in the dataset and most of them are very uncommon so we like to group them in 4 categories.
# Convert to categorical values Title
dataset["Title"] = dataset["Title"].replace(['Lady', 'the Countess',
'Capt', 'Col','Don', 'Dr',
'Major', 'Rev', 'Sir', 'Jonkheer',
'Dona'], 'Rare')
dataset["Title"] = dataset["Title"].map({"Master":0, "Miss":1, "Ms" : 1 ,
"Mme":1, "Mlle":1, "Mrs":1, "Mr":2,
"Rare":3})
dataset["Title"] = dataset["Title"].astype(int)
# Drop Name variable
dataset.drop(labels = ["Name"], axis = 1, inplace = True)
# viz counts the title coloumn
sns.countplot(dataset["Title"]).set_xticklabels(["Master","Miss-Mrs","Mr","Rare"]);
# Let's see, based on title what's the survival probability
sns.barplot(x='Title', y='Survived', data=dataset);
Catching Aspects:
- People with the title 'Mr' survived less than people with any other title.
- Titles with a survival rate higher than 70% are those that correspond to female (Miss-Mrs)
Our new category, 'Rare', should be more discretized. As we can see by the error bar (black line), there is a significant uncertainty around the mean value. Probably, one of the problems is that we are mixing male and female titles in the 'Rare' category. We should proceed with a more detailed analysis to sort this out. Also, the category 'Master' seems to have a similar problem. For now, we will not make any changes, but we will keep these two situations in our mind for future improvement of our data set.
From now on, there's no Name features and have Title feature to represent it.
# viz top 5
dataset.head()
Age Cabin Fare Parch PassengerId Pclass SibSp Survived Ticket male Q S Title
0 22.0 NaN 7.2500 0 1 3 1 0.0 A/5 21171 1 0 1 2
1 38.0 C85 71.2833 0 2 1 1 1.0 PC 17599 0 0 0 1
2 26.0 NaN 7.9250 0 3 3 0 1.0 STON/O2. 3101282 0 0 1 1
3 35.0 C123 53.1000 0 4 1 1 1.0 113803 0 0 1 1
4 35.0 NaN 8.0500 0 5 3 0 0.0 373450 1 0 1 2
Family size
I like to create a Famize
feature which is the sum of SibSp
, Parch
.
# Create a family size descriptor from SibSp and Parch
dataset["Famize"] = dataset["SibSp"] + dataset["Parch"] + 1
# Drop SibSp and Parch variables
dataset.drop(labels = ["SibSp",'Parch'], axis = 1, inplace = True)
# Viz the survival probabily of Famize feature
facet = sns.FacetGrid(dataset, hue="Survived",aspect=4)
facet.map(sns.kdeplot,'Famize',shade= True)
facet.set(xlim=(0, dataset['Famize'].max()))
facet.add_legend()
plt.xlim(0);
Survival probability is worst for large families.
Cabin & Ticket
Now, Cabin
feature has a huge data missing. So, I like to drop it anyway. Moreover, we also can't get to much information by Ticket
feature for prediction task.
# drop some useless features
dataset.drop(labels = ["Ticket",'Cabin','PassengerId'], axis = 1,
inplace = True)
Predictive Modeling
Continue with Part 2.
Next, We’ll be building predictive model. We'll use cross validation on some promosing machine learning models. Then we will do hype-parameter tuning on some selected machine learning models and end up with ensembling the most prevalent ml algorithms.
However, you can get the source code of today’s demonstration from the link below and can also follow me on GitHub for future code updates. Source Code : Titanic:ML
Say Hi On: Email | LinkedIn | Quora | GitHub | Medium | Twitter | Instagram
Hey Mohammed, please can you provide us with the notebook?
https://nbviewer.jupyter.org/github/iphton/Kaggle-Competition/blob/gh-pages/Titanic Competition/Notebook/Predict survival on the Titanic.ipynb
or
https://github.com/innat/Kaggle-Play
Thanks for the detail explanations! You’ve done a great job! I am interested to see your final results, the model building parts!
Thanks. The second part already has published. Part 2