Kaggl Titanic: A Machine Learning from Disaster | Modelling Part 2
Continue to Part 1.
Previously, we did following
- explore the data set
- advance feature engineering
However, to get a sneak peak on the whole article (part 1 and 2), open up this notebook viewer and if you want to run each notebook cell, you can also use binder
Or go to his Kaggle-Play repo and launch binder to run notebook cell.
OK, it's time to build the model for our survival prediction problem.
TL;DR
Predictive Modeling
Here, we split our datasets according to the previous amounts and make test and train set. To avoid overfitting event we can create validation set but that's not effective. So, we use K-Fold approaches and use StratifiedKFold to split the train datasets into 10 (by default).
# Separate train dataset and test dataset
train = dataset[:len(train)]
test = dataset[len(train):]
test.drop(labels=["Survived"],axis = 1,inplace = True)
## Separate train features and label
Y_train = train["Survived"].astype(int)
X_train = train.drop(labels = ["Survived"],axis = 1)
# Cross validate model with Kfold stratified cross val
K_fold = StratifiedKFold(n_splits=10)
Classifier
I compare 10 popular classifiers and evaluate the mean accuracy of each of them by a stratified kfold cross validation procedure.
- KNN
- AdaBoost
- Decision Tree
- Random Forest
- Extra Trees
- Support Vector Machine
- Gradient Boosting
- Logistic regression
- Linear Discriminant Analysis
- Multiple layer perceprton
Evaluation using Cross Validation
A great alternative is to use Scikit-Learn's cross-validation
feature. The following performs K-fold cross validation; it randomly splits the training set into 10 distinct subsets called folds, then it trains and evaluates the Models 10 times, picking a different fold for evaluation every time and training on the other 9 folds.
# Modeling step Test differents algorithms
random_state = 2
models = [] # append all models or predictive models
cv_results = [] # cross validation result
cv_means = [] # cross validation mean value
cv_std = [] # cross validation standard deviation
models.append(KNeighborsClassifier())
models.append(AdaBoostClassifier(DecisionTreeClassifier(random_state=random_state),random_state=random_state,learning_rate=0.1))
models.append(DecisionTreeClassifier(random_state=random_state))
models.append(RandomForestClassifier(random_state=random_state))
models.append(ExtraTreesClassifier(random_state=random_state))
models.append(SVC(random_state=random_state))
models.append(GradientBoostingClassifier(random_state=random_state))
models.append(LogisticRegression(random_state = random_state))
models.append(LinearDiscriminantAnalysis())
models.append(MLPClassifier(random_state=random_state))
for model in models :
cv_results.append(cross_val_score(model, X_train, Y_train,
scoring = "accuracy", cv = K_fold, n_jobs=4))
for cv_result in cv_results:
cv_means.append(cv_result.mean())
cv_std.append(cv_result.std())
cv_frame = pd.DataFrame(
{
"CrossValMeans":cv_means,
"CrossValErrors": cv_std,
"Algorithms":[
"KNeighboors",
"AdaBoost",
"DecisionTree",
"RandomForest",
"ExtraTrees",
"SVC",
"GradientBoosting",
"LogisticRegression",
"LinearDiscriminantAnalysis",
"MultipleLayerPerceptron"]
})
cv_plot = sns.barplot("CrossValMeans","Algorithms", data = cv_frame,
palette="husl", orient = "h", **{'xerr':cv_std})
cv_plot.set_xlabel("Mean Accuracy")
cv_plot = cv_plot.set_title("CV Scores")
Let's explore following models separately:
- GBC Classifier
- Linear Discriminant Analysis
- Logistic Regression
- Random Forest Classifer
- Gaussian Naive Bayes
- Support Vectore Machine
Let's start with Gradient Boosting Classifier.
# GBC Classifier
GBC_Model = GradientBoostingClassifier()
scores = cross_val_score(GBC_Model, X_train, Y_train, cv = K_fold,
n_jobs = 4, scoring = 'accuracy')
print(scores)
round(np.mean(scores)*100, 2)
# output
[0.83146067 0.82954545 0.76136364 0.89772727 0.90909091 0.875
0.84090909 0.79545455 0.84090909 0.82954545]
84.11
Next, LDA
# Linear Discriminant Analysis
LDA_Model= LinearDiscriminantAnalysis()
scores = cross_val_score(LDA_Model, X_train, Y_train, cv = K_fold,
n_jobs = 4, scoring = 'accuracy')
print(scores)
round(np.mean(scores)*100, 2)
# output
[0.84269663 0.82954545 0.76136364 0.88636364 0.81818182 0.80681818
0.79545455 0.78409091 0.86363636 0.84090909]
82.29
Logistic Regression classifier.
# Logistic Regression
#
Log_Model = LogisticRegression(C=1)
scores = cross_val_score(Log_Model, X_train, Y_train, cv=K_fold,
n_jobs=4, scoring='accuracy')
print(scores)
round(np.mean(scores)*100, 2)
# output
[0.83146067 0.81818182 0.76136364 0.875 0.81818182 0.77272727
0.79545455 0.79545455 0.84090909 0.84090909]
81.5
Random Forest is typically an ensemble of decesion tree classifer. It should perform better than all. Let's see.
# Random Forest Classifier Model
#
RFC_model = RandomForestClassifier(n_estimators=10)
scores = cross_val_score(RFC_model, X_train, Y_train, cv=K_fold,
n_jobs=4, scoring='accuracy')
print(scores)
round(np.mean(scores)*100, 2)
# output
[0.79775281 0.88636364 0.73863636 0.80681818 0.86363636 0.79545455
0.82954545 0.76136364 0.84090909 0.82954545]
81.5
Gaussian NB performs pretty well on binary classification.
# Gaussian Naive Bayes
GNB_Model = GaussianNB()
scores = cross_val_score(GNB_Model, X_train, Y_train, cv=K_fold,
n_jobs=4, scoring='accuracy')
print(scores)
round(np.mean(scores)*100, 2)
# output
[0.78651685 0.81818182 0.75 0.86363636 0.77272727 0.79545455
0.80681818 0.78409091 0.85227273 0.84090909]
80.71
Support Vector Machine or SVM is pretty much promsing ML algorithm. It should perform well also.
# Support Vector Machine
SVM_Model = SVC()
scores = cross_val_score(SVM_Model, X_train, Y_train, cv=K_fold,
n_jobs=4, scoring='accuracy')
print(scores)
round(np.mean(scores)*100, 2)
# output
[0.69662921 0.65909091 0.64772727 0.72727273 0.76136364 0.70454545
0.76136364 0.73863636 0.72727273 0.78409091]
72.08
Hyperparameter Tuning
I decided to choose this promising models of GradientBoosting, Linear Discriminant Analysis, RandomForest, Logistic Regression and SVM for the ensemble modeling. So, now we need to fine-tune them.
One way to do that would be to fiddle with the hyperparameters manually until we find a great combination of hyperparamerter values. This would be very tedious work, and we may not have time to explore many combination. Instead we should get Scikit-Learn's GridSearchCV to search for us. All we need to do is tell it which hyperparameters we want it to experiment with, and what values to try out and it will evaluate all the possible combination of hyperparameter values, using cross-validation.
Here we perform grid search optimization for GradientBoosting, RandomForest, Linear Discriminant Analysis, Logistic Regression and SVC classifier.
Hyper-Parameter Tuning on GBC
# Gradient boosting tunning
GBC = GradientBoostingClassifier()
gb_param_grid = {
'loss' : ["deviance"],
'n_estimators' : [100,200,300],
'learning_rate': [0.1, 0.05, 0.01, 0.001],
'max_depth': [4, 8,16],
'min_samples_leaf': [100,150,250],
'max_features': [0.3, 0.1]
}
gsGBC = GridSearchCV(GBC, param_grid = gb_param_grid, cv=K_fold,
scoring="accuracy", n_jobs= 4, verbose = 1)
gsGBC.fit(X_train,Y_train)
GBC_best = gsGBC.best_estimator_
# Best score
gsGBC.best_score_
#output
Fitting 10 folds for each of 216 candidates, totalling 2160 fits
[Parallel(n_jobs=4)]: Done 42 tasks | elapsed: 2.6s
[Parallel(n_jobs=4)]: Done 626 tasks | elapsed: 12.9s
[Parallel(n_jobs=4)]: Done 1626 tasks | elapsed: 30.5s
[Parallel(n_jobs=4)]: Done 2160 out of 2160 | elapsed: 41.0s finished
0.8365493757094211
Hyper-Parameter Tuning on RFC
# RFC Parameters tunning
RFC = RandomForestClassifier()
## Search grid for optimal parameters
rf_param_grid = {"max_depth": [None],
"min_samples_split": [2, 6, 20],
"min_samples_leaf": [1, 4, 16],
"n_estimators" :[100,200,300,400],
"criterion": ["gini"]}
gsRFC = GridSearchCV(RFC, param_grid = rf_param_grid, cv=K_fold,
scoring="accuracy", n_jobs= 4, verbose = 1)
gsRFC.fit(X_train,Y_train)
RFC_best = gsRFC.best_estimator_
# Best score
gsRFC.best_score_
# output
Fitting 10 folds for each of 36 candidates, totalling 360 fits
[Parallel(n_jobs=4)]: Done 42 tasks | elapsed: 5.5s
[Parallel(n_jobs=4)]: Done 192 tasks | elapsed: 18.7s
[Parallel(n_jobs=4)]: Done 360 out of 360 | elapsed: 32.5s finished
0.8422247446083996
Hyper-Parameter Tuning on LR
# LogisticRegression Parameters tunning
LRM = LogisticRegression()
## Search grid for optimal parameters
lr_param_grid = {"penalty" : ["l2"],
"tol" : [0.0001,0.0002,0.0003],
"max_iter": [100,200,300],
"C" :[0.01, 0.1, 1, 10, 100],
"intercept_scaling": [1, 2, 3, 4],
"solver":['liblinear'],
"verbose":[1]}
gsLRM = GridSearchCV(LRM, param_grid = lr_param_grid, cv=K_fold,
scoring="accuracy", n_jobs= 4, verbose = 1)
gsLRM.fit(X_train,Y_train)
LRM_best = gsLRM.best_estimator_
# Best score
gsLRM.best_score_
# output
Fitting 10 folds for each of 180 candidates, totalling 1800 fits
[Parallel(n_jobs=4)]: Done 351 tasks | elapsed: 2.6s
[LibLinear]
[Parallel(n_jobs=4)]: Done 1800 out of 1800 | elapsed: 4.4s finished
0.8240635641316686
Hyper-Parameter Tuning on LDA
# Linear Discriminant Analysis - Parameter Tuning
LDA = LinearDiscriminantAnalysis()
## Search grid for optimal parameters
lda_param_grid = {"solver" : ["svd"],
"tol" : [0.0001,0.0002,0.0003]}
gsLDA = GridSearchCV(LDA, param_grid = lda_param_grid, cv=K_fold,
scoring="accuracy", n_jobs= 4, verbose = 1)
gsLDA.fit(X_train,Y_train)
LDA_best = gsLDA.best_estimator_
# Best score
gsLDA.best_score_
# output
Fitting 10 folds for each of 3 candidates, totalling 30 fits
[Parallel(n_jobs=4)]: Done 23 out of 30 | elapsed: 1.9s remaining: 0.5s
[Parallel(n_jobs=4)]: Done 30 out of 30 | elapsed: 1.9s finished
0.8229284903518729
Hyper-Parameter Tuning on SVC
### SVC classifier
SVMC = SVC(probability=True)
svc_param_grid = {'kernel': ['rbf'],
'gamma': [0.0001, 0.001, 0.01, 0.1, 1],
'C': [1, 10, 50, 100, 200, 300]}
gsSVMC = GridSearchCV(SVMC, param_grid = svc_param_grid, cv = K_fold,
scoring="accuracy", n_jobs= -1, verbose = 1)
gsSVMC.fit(X_train,Y_train)
SVMC_best = gsSVMC.best_estimator_
# Best score
gsSVMC.best_score_
# output
Fitting 10 folds for each of 30 candidates, totalling 300 fits
[Parallel(n_jobs=-1)]: Done 50 tasks | elapsed: 3.2s
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed: 17.3s finished
0.8161180476730987
Plot Learning Curves
Diagnose Bias and Variance to Reduce Error
Learning curves are a good way to see the overfitting and underfitting effect on the training set and the effect of the training size on the accuracy. Learning curves plots the model's performance on the training set and the validation set as a function of training set size. To generate the plots, we simply train the model several times on different sized subsets of the training sets. In a nutshell, a learning curves shows how error changes as the training set size increases.
If a models perform well on the training data but generalizes poorly according to the cross-validation metrics, the model is called overfitting. And again if it performs poorly on both, the model is called underfitting.
When the model is trained on very few training instances, it is incapable of generalizing properly, which is why the validation error will be initially quite big.
Underfitting: If model is underfitting the training data, adding more training example will not help. We need to use more complex model or come up with better features.
Overfitting: One way to improve the overfitting model is to feed it more training data until the validation error reaches the training error.
Resource:
Bias-Variance Trade-Off
A model's generalization error can be expressed as the sum of three very different errors.
- Bias
- Variance
- Irreducible Error
Bias Error in Learning Curve
This part of generalization error is due to the wrong assumption, such as assuming that, the data is linear when it is actually quadratic.
- A high bias model is most likely to underfit the training data
Variance Error in Learning Curve
This part of generalization is due to the model is excessive sensitivity to small variations in the training data.
- A high variance model is most likely to overfit the training data
Irreducible Error in Learning Curve
This is due to the noisiness of the data itself. This is not concern now, because we already clean the data sets
Increasing a model's complexity will typically increases its variance and reduce its bias. Conversly, reducing a model's complexity increases its bias and reduces its variance.
Now, we'll define a learning curve ploting function where x and y axies will be traning set size and scores (not errors) gradually. So the higher the score, the better the performance of the model.
# Plot learning curve
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
n_jobs=-1, train_sizes=np.linspace(.1, 1.0, 5)):
"""
Generate a simple plot of the test and traning learning curve.
Parameters
----------
estimator : object type that implements the "fit" and "predict" methods
An object of that type which is cloned for each validation.
title : string
Title for the chart.
X : array-like, shape (n_samples, n_features)
Training vector, where n_samples is the number of samples and
n_features is the number of features.
y : array-like, shape (n_samples) or (n_samples, n_features), optional
Target relative to X for classification or regression;
None for unsupervised learning.
ylim : tuple, shape (ymin, ymax), optional
Defines minimum and maximum yvalues plotted.
cv : integer, cross-validation generator, optional
If an integer is passed, it is the number of folds (defaults to 3).
Specific cross-validation objects can be passed, see
sklearn.cross_validation module for the list of possible objects
n_jobs : integer, optional
Number of jobs to run in parallel (default 1).
x1 = np.linspace(0, 10, 8, endpoint=True) produces
8 evenly spaced points in the range 0 to 10
"""
plt.figure()
plt.title(title)
if ylim is not None:
plt.ylim(*ylim)
plt.xlabel("Training examples")
plt.ylabel("Score")
train_sizes, train_scores, test_scores = learning_curve(
estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.grid()
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation score")
plt.legend(loc="best")
return plt
# Gradient boosting - Learning Curve
plot_learning_curve(estimator = gsGBC.best_estimator_,title = "GBC learning curve",
X = X_train, y = Y_train, cv = K_fold);
# Random Forest - Learning Curve
plot_learning_curve(estimator = gsRFC.best_estimator_ ,title = "RF learninc curve",
X = X_train, y = Y_train, cv = K_fold);
# Logistic Regression - Learning Curve gsLRM.best_estimator_
plot_learning_curve(estimator = Log_Model ,title = "Logistic Regression - Learning Curve",
X = X_train, y = Y_train, cv = K_fold);
# Linear Discriminant Analysis - Learning Curve
plot_learning_curve(estimator = gsLDA.best_estimator_ ,title = "Linear Discriminant - Learning Curve",
X = X_train, y = Y_train, cv = K_fold);
# Support Vector Machine - Learning Curve
plot_learning_curve(estimator = gsSVMC.best_estimator_,title = "SVC learning curve",
X = X_train, y = Y_train, cv = K_fold);
SVC seem to better generalize the prediction since the training and cross-validation curves are close together. And again Random Forest and GradientBoosting classifiers tend to overfit the training set. One way to improve the overfitting model is to feed it more training data until the validation error reaches the training error.
Ensemble modeling
The another way to fine-tune our system is to try to combine the models that perform best. The goup will often perform better than the best individual model, especially if the individual models make very different types of errors.
Building a model on top of many other models are called Ensemble Learning. And it is often a great way to push ML algorithm even further.
I use voting classifier to combine the predictions coming from the 2 classifiers. I preferred to pass the argument soft to the voting parameter to take into account the probability of each vote.
#about 84%
VotingPredictor = VotingClassifier(estimators =
[('rfc', RFC_best),
('gbc', GBC_best)],
voting='soft', n_jobs = 4)
VotingPredictor = VotingPredictor.fit(X_train, Y_train)
scores = cross_val_score(VotingPredictor, X_train, Y_train, cv = K_fold,
n_jobs = 4, scoring = 'accuracy')
print(scores)
print(round(np.mean(scores)*100, 2))
# output
[0.79775281 0.84090909 0.72727273 0.90909091 0.90909091 0.85227273
0.85227273 0.77272727 0.88636364 0.84090909]
83.89 # score increased
Submit Predictor
Predictive_Model = pd.DataFrame({
"PassengerId": TestPassengerID,
"Survived": VotingPredictor.predict(test)})
Predictive_Model.to_csv('titanic_model.csv', index=False)
By this, we reach to the end of this series. You can find Part 1 from here. However, you may like to run each notebook cell, so use binder, it's awesome.
You can get the source code of whole (part 1 & 2) demonstration from the link below and can also follow me on GitHub for future code updates. Source Code : Titanic:ML
Say Hi On: Email | LinkedIn | Quora | GitHub | Medium | Twitter | Instagram
Loving this! All theoretical knowledge, with the code to back it up. Binder was really useful.
Glad to hear that.