ML with Python: Part-3
In preious post, we saw various steps involved in creating a machine learning (ML) model. You might have noticed in Building ML Model we consider multiple Algorithums in a pipeline and then tune hyperparameter for all the Models. Don't you feel that it would have been easier if some automated tools are there to ease the burden of repetitive and time-consuming tasks of machine learning pipeline design and hyperparameter optimization.
Here comes AutoML, taking over the machine learning model-building process: once a data set is in a relatively clean format, the AutoML system will be able to design and optimize a machine learning pipeline faster than 99% of the humans out there.
There are many such AutoML tools are available and most popular of them are,
- TPOT
- H2O
- Auto-sklearn
- Azure AutoML etc.
Below we will see an example of TPOT, rest others also work on similar idea. TPOT works on top of scikit-learn and automate the most tedious part of machine learning by intelligently exploring thousands of possible pipelines to find the best one for your data.
I am using the same Problem statement and DataSet what I used in Part-2. But for simplicity I am dilluting the some of the Pre-processing steps, ss TPOT also applied mutiple pre-processing steps (Lists are provided below). Jupyter Notebook file and training/test files can also be downloaded from my git repository.
Alright, Let's get started -
Import Libraries
import numpy as np
import pandas as pd
from tpot import TPOTClassifier
Data Load
test_df = pd.read_csv("test.csv")
train_df = pd.read_csv("train.csv")
Data Cleanup
train_df = train_df.drop(['PassengerId'], axis=1)
train_df = train_df.drop(['Cabin'], axis=1)
test_df = test_df.drop(['Cabin'], axis=1)
train_df = train_df.drop(['Ticket'], axis=1)
test_df = test_df.drop(['Ticket'], axis=1)
train_df = train_df.drop(['Name'], axis=1)
test_df = test_df.drop(['Name'], axis=1)
data = [train_df, test_df]
for dataset in data:
mean = train_df["Age"].mean()
std = test_df["Age"].std()
is_null = dataset["Age"].isnull().sum()
# compute random numbers between the mean, std and is_null
rand_age = np.random.randint(mean - std, mean + std, size = is_null)
# fill NaN values in Age column with random values generated
age_slice = dataset["Age"].copy()
age_slice[np.isnan(age_slice)] = rand_age
dataset["Age"] = age_slice
dataset["Age"] = train_df["Age"].astype(int)
data = [train_df, test_df]
for dataset in data:
dataset['Embarked'] = dataset['Embarked'].fillna('S')
Converting Features:
data = [train_df, test_df]
for dataset in data:
dataset['Fare'] = dataset['Fare'].fillna(0)
dataset['Fare'] = dataset['Fare'].astype(int)
genders = {"male": 0, "female": 1}
data = [train_df, test_df]
for dataset in data:
dataset['Sex'] = dataset['Sex'].map(genders)
ports = {"S": 0, "C": 1, "Q": 2}
data = [train_df, test_df]
for dataset in data:
dataset['Embarked'] = dataset['Embarked'].map(ports)
data = [train_df, test_df]
for dataset in data:
dataset['Age'] = dataset['Age'].astype(int)
dataset.loc[ dataset['Age'] <= 11, 'Age'] = 0
dataset.loc[(dataset['Age'] > 11) & (dataset['Age'] <= 18), 'Age'] = 1
dataset.loc[(dataset['Age'] > 18) & (dataset['Age'] <= 22), 'Age'] = 2
dataset.loc[(dataset['Age'] > 22) & (dataset['Age'] <= 27), 'Age'] = 3
dataset.loc[(dataset['Age'] > 27) & (dataset['Age'] <= 33), 'Age'] = 4
dataset.loc[(dataset['Age'] > 33) & (dataset['Age'] <= 40), 'Age'] = 5
dataset.loc[(dataset['Age'] > 40) & (dataset['Age'] <= 66), 'Age'] = 6
dataset.loc[ dataset['Age'] > 66, 'Age'] = 6
data = [train_df, test_df]
for dataset in data:
dataset.loc[ dataset['Fare'] <= 7.91, 'Fare'] = 0
dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1
dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare'] = 2
dataset.loc[(dataset['Fare'] > 31) & (dataset['Fare'] <= 99), 'Fare'] = 3
dataset.loc[(dataset['Fare'] > 99) & (dataset['Fare'] <= 250), 'Fare'] = 4
dataset.loc[ dataset['Fare'] > 250, 'Fare'] = 5
dataset['Fare'] = dataset['Fare'].astype(int)
Building Machine Learning Models
X_train = train_df.drop("Survived", axis=1)
Y_train = train_df["Survived"]
X_test = test_df.drop("PassengerId", axis=1).copy()
tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2)
tpot.fit(X_train, Y_train)
Result:
Generation 1 - Current best internal CV score: 0.8327578288714715
Generation 2 - Current best internal CV score: 0.8327578288714715
Generation 3 - Current best internal CV score: 0.8327578288714715
Generation 4 - Current best internal CV score: 0.833931853718029
Generation 5 - Current best internal CV score: 0.8395310000365276
Best pipeline: RandomForestClassifier(MultinomialNB(input_matrix, alpha=0.1, fit_prior=True), bootstrap=True, criterion=gini, max_features=0.5, min_samples_leaf=4, min_samples_split=17, n_estimators=100)
TPOTClassifier(config_dict=None, crossover_rate=0.1, cv=5,
disable_update_check=False, early_stop=None, generations=5,
max_eval_time_mins=5, max_time_mins=None, memory=None,
mutation_rate=0.9, n_jobs=1, offspring_size=None,
periodic_checkpoint_folder=None, population_size=20,
random_state=None, scoring=None, subsample=1.0, template=None,
use_dask=False, verbosity=2, warm_start=False)
Above you can see TPOT has chossen RandomForestClassifier as best fit pipeline.
Classification algorithms and parameters TPOT chooses -
- 'sklearn.naive_bayes.BernoulliNB': { 'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.], 'fit_prior': [True, False] }
- 'sklearn.naive_bayes.MultinomialNB': { 'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.], 'fit_prior': [True, False] }
- 'sklearn.tree.DecisionTreeClassifier': { 'criterion': ['gini', 'entropy'], 'max_depth': range(1, 11), 'min_samples_split': range(2, 21), 'min_samples_leaf': range(1, 21) }
- 'sklearn.ensemble.ExtraTreesClassifier': { 'n_estimators': [100], 'criterion': ['gini', 'entropy'], 'max_features': np.arange(0.05, 1.01, 0.05), 'min_samples_split': range(2, 21), 'min_samples_leaf': range(1, 21), 'bootstrap': [True, False] }
- 'sklearn.ensemble.RandomForestClassifier': { 'n_estimators': [100], 'criterion': ['gini', 'entropy'], 'max_features': np.arange(0.05, 1.01, 0.05), 'min_samples_split': range(2, 21), 'min_samples_leaf': range(1, 21), 'bootstrap': [True, False] }
- 'sklearn.ensemble.GradientBoostingClassifier': { 'n_estimators': [100], 'learning_rate': [1e-3, 1e-2, 1e-1, 0.5, 1.], 'max_depth': range(1, 11), 'min_samples_split': range(2, 21), 'min_samples_leaf': range(1, 21), 'subsample': np.arange(0.05, 1.01, 0.05), 'max_features': np.arange(0.05, 1.01, 0.05) }
- 'sklearn.neighbors.KNeighborsClassifier': { 'n_neighbors': range(1, 101), 'weights': ['uniform', 'distance'], 'p': [1, 2] }
- 'sklearn.svm.LinearSVC': { 'penalty': ['l1', 'l2'], 'loss': ['hinge', 'squared_hinge'], 'dual': [True, False], 'tol': [1e-5, 1e-4, 1e-3, 1e-2, 1e-1], 'C': [1e-4, 1e-3, 1e-2, 1e-1, 0.5, 1., 5., 10., 15., 20., 25.] }
- 'sklearn.linear_model.LogisticRegression': { 'penalty': ['l1', 'l2'], 'C': [1e-4, 1e-3, 1e-2, 1e-1, 0.5, 1., 5., 10., 15., 20., 25.], 'dual': [True, False] }
- 'xgboost.XGBClassifier': { 'n_estimators': [100], 'max_depth': range(1, 11), 'learning_rate': [1e-3, 1e-2, 1e-1, 0.5, 1.], 'subsample': np.arange(0.05, 1.01, 0.05), 'min_child_weight': range(1, 21), 'nthread': [1] }
Preprocessors that could be applied by TPOT -
- 'sklearn.preprocessing.Binarizer': { 'threshold': np.arange(0.0, 1.01, 0.05) }
- 'sklearn.decomposition.FastICA': { 'tol': np.arange(0.0, 1.01, 0.05) }
- 'sklearn.cluster.FeatureAgglomeration': { 'linkage': ['ward', 'complete', 'average'], 'affinity': ['euclidean', 'l1', 'l2', 'manhattan', 'cosine'] }
- 'sklearn.preprocessing.MaxAbsScaler': { }
- 'sklearn.preprocessing.MinMaxScaler': { }
- 'sklearn.preprocessing.Normalizer': { 'norm': ['l1', 'l2', 'max'] }
- 'sklearn.kernel_approximation.Nystroem': {
'kernel': ['rbf', 'cosine', 'chi2', 'laplacian', 'polynomial', 'poly', 'linear', 'additive_chi2', 'sigmoid'],
'gamma': np.arange(0.0, 1.01, 0.05), 'n_components': range(1, 11)
} - 'sklearn.decomposition.PCA': {
'svd_solver': ['randomized'],
'iterated_power': range(1, 11) } - 'sklearn.preprocessing.PolynomialFeatures': { 'degree': [2], 'include_bias': [False], 'interaction_only': [False] }
- 'sklearn.kernel_approximation.RBFSampler': { 'gamma': np.arange(0.0, 1.01, 0.05) },
- 'sklearn.preprocessing.RobustScaler': { },
- 'sklearn.preprocessing.StandardScaler': { },
- 'tpot.builtins.ZeroCount': { },
- 'tpot.builtins.OneHotEncoder': { 'minimum_fraction': [0.05, 0.1, 0.15, 0.2, 0.25], 'sparse': [False] } (emphasis mine)
Prediction
Y_prediction = tpot.predict(X_test)
submission = pd.DataFrame({
"PassengerId": test_df["PassengerId"],
"Survived": Y_prediction
})
Limitation
Running TPOT isn't as simple as fitting one model on the dataset. It is considering multiple machine learning algorithms (random forests, linear models, SVMs, etc.) in a pipeline with numerous preprocessing steps (missing value imputation, scaling, PCA, feature selection, etc.), the hyper-parameters for all of the models and preprocessing steps, as well as multiple ways to ensemble or stack the algorithms within the pipeline. That's why it usually takes a long time to execute and isn't feasible for large datasets.
Summary
All the methods of AutoML are developed to support data scientists, not to replace them. Such methods can free the data scientist from complicated tasks that can be solved better by machines. But analysing and drawing conclusions still has to be done by data scientists who also knows the application domain.
Thank you sir! Please tell which book I have to follow for Machine Learning which contain proof with mathematics.
Aminzai, There’s no single book that can help you master ML as it’s a complicated subject that spans many topics, purposes, and benefits in real-world applications. Though I didn’t read most of them but would like to suggest “Machine Learning by Peter Flach”. It’s made for intermediate-to-advanced goes into a greater amount of detail than other books.