Making your First Machine Learning Classifier in Scikit-learn (Python)
One of the most amazing things about Python's scikit-learn library is that is has a 4-step modeling pattern that makes it easy to code a machine learning classifier. While this tutorial uses a classifier called Logistic Regression, the coding process in this tutorial applies to other classifiers in sklearn (Decision Tree, K-Nearest Neighbors etc). In this tutorial, we use Logistic Regression to predict digit labels based on images. The image above shows a bunch of training digits (observations) from the MNIST dataset whose category membership is known (labels 0–9). After training a model with logistic regression, it can be used to predict an image label (labels 0–9) given an image.
The first part of this tutorial post goes over a toy dataset (digits dataset) to show quickly illustrate scikit-learn's 4 step modeling pattern and show the behavior of the logistic regression algorthm. The second part of the tutorial goes over a more realistic dataset (MNIST dataset) to briefly show how changing a model's default parameters can effect performance (both in timing and accuracy of the model).
With that, lets get started. If you get lost, I recommend opening the video above in a separate tab. The code used in this tutorial is available in the table below.
Digits Dataset | MNIST |
---|---|
Digits Logistic Regression | MNIST Logistic Regression |
Getting Started (Prerequisites)
If you already have anaconda installed, skip to the next section. I recommend having anaconda installed (either Python 2 or 3 works well for this tutorial) so you won't have any issue importing libraries.
You can either download anaconda from the official site and install on your own or you can follow these anaconda installation tutorials below to set up anaconda on your operating system.
Operating System | Blog Post | Youtube Video |
---|---|---|
Mac | Install Anaconda on Mac | Youtube Video |
Windows | Install Anaconda on Windows | Youtube Video |
Ubuntu | Install Anaconda on Ubuntu | Youtube Video |
All | Environment Management with Conda (Python 2 + 3, Configuring Jupyter Notebooks) | Youtube Video |
Logistic Regression on Digits Dataset
Loading the Data (Digits Dataset)
The digits dataset is one of datasets scikit-learn comes with that do not require the downloading of any file from some external website. The code below will load the digits dataset.
from sklearn.datasets import load_digits
digits = load_digits()
Now that you have the dataset loaded you can use the commands below
# Print to show there are 1797 images (8 by 8 images for a dimensionality of 64)
print("Image Data Shape" , digits.data.shape)
# Print to show there are 1797 labels (integers from 0-9)
print("Label Data Shape", digits.target.shape)
to see that there are 1797 images and 1797 labels in the dataset
Showing the Images and the Labels (Digits Dataset)
This section is really just to show what the images and labels look like. It usually helps to visualize your data to see what you are working with.
import numpy as np
import matplotlib.pyplot as plt
plt.figure(figsize=(20,4))
for index, (image, label) in enumerate(zip(digits.data[0:5], digits.target[0:5])):
plt.subplot(1, 5, index + 1)
plt.imshow(np.reshape(image, (8,8)), cmap=plt.cm.gray)
plt.title('Training: %i\n' % label, fontsize = 20)
Visualizing the Images and Labels in our Dataset
Splitting Data into Training and Test Sets (Digits Dataset)
We make training and test sets to make sure that after we train our classification algorithm, it is able to generalize well to new data.
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(digits.data, digits.target, test_size=0.25, random_state=0)
Scikit-learn 4-Step Modeling Pattern (Digits Dataset)
Step 1. Import the model you want to use
In sklearn, all machine learning models are implemented as Python classes
from sklearn.linear_model import LogisticRegression
Step 2. Make an instance of the Model
# all parameters not specified are set to their defaults
logisticRegr = LogisticRegression()
Step 3. Training the model on the data, storing the information learned from the data
Model is learning the relationship between digits (x_train) and labels (y_train)
logisticRegr.fit(x_train, y_train)
Step 4. Predict the labels of new data (new images)
Uses the information the model learned during the model training process
# Returns a NumPy Array
# Predict for One Observation (image)
logisticRegr.predict(x_test[0].reshape(1,-1))
# Predict for Multiple Observations (images) at Once
logisticRegr.predict(x_test[0:10])
# Make predictions on entire test data
predictions = logisticRegr.predict(x_test)
Measuring Model Performance (Digits Dataset)
While there are other ways of measuring model performance, we are going to keep this simple and use accuracy as our metric.
To do this are going to see how the model performs on the new data (test set)
accuracy is defined as:
(fraction of correct predictions): correct predictions / total number of data points
# Use score method to get accuracy of model
score = logisticRegr.score(x_test, y_test)
print(score)
Our accuracy was 95.3%.
Confusion Matrix (Digits Dataset)
A confusion matrix is a table that is often used to describe the performance of a classification model (or "classifier") on a set of test data for which the true values are known. In this section, I am just showing two python packages (Seaborn and Matplotlib) for making confusion matrices.
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import metrics
Method 1 (Seaborn)
cm = metrics.confusion_matrix(y_test, predictions)
plt.figure(figsize=(9,9))
sns.heatmap(cm, annot=True, fmt=".3f", linewidths=.5, square = True, cmap = 'Blues_r');
plt.ylabel('Actual label');
plt.xlabel('Predicted label');
all_sample_title = 'Accuracy Score: {0}'.format(score)
plt.title(all_sample_title, size = 15);
Confusion Matrix using Seaborn
Method 2 (Matplotlib)
This method is clearly a lot more code. I just wanted to show people how to do it in matplotlib as well.
cm = metrics.confusion_matrix(y_test, predictions)
plt.figure(figsize=(9,9))
plt.imshow(cm, interpolation='nearest', cmap='Pastel1')
plt.title('Confusion matrix', size = 15)
plt.colorbar()
tick_marks = np.arange(10)
plt.xticks(tick_marks, ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"], rotation=45, size = 10)
plt.yticks(tick_marks, ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"], size = 10)
plt.tight_layout()
plt.ylabel('Actual label', size = 15)
plt.xlabel('Predicted label', size = 15)
width, height = cm.shape
for x in xrange(width):
for y in xrange(height):
plt.annotate(str(cm[x][y]), xy=(y, x),
horizontalalignment='center',
verticalalignment='center')
Confusion Matrix using Matplotlib
Logistic Regression (MNIST)
One important point to emphasize that the digit dataset contained in sklearn is too small to be representative of a real world machine learning task.
We are going to use the MNIST dataset because it is for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting. One of the things we will notice is that parameter tuning can greatly speed up a machine learning algorithm's training time.
Downloading the Data (MNIST)
The digits dataset is one of datasets scikit-learn comes with that do not require the downloading of any file from some external website. The code below will load the digits dataset.
from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original')
Now that you have the dataset loaded you can use the commands below
# These are the images
# There are 70,000 images (28 by 28 images for a dimensionality of 784)
print(mnist.data.shape)
# These are the labels
print(mnist.target.shape)
to see that there are 70000 images and 70000 labels in the dataset
Splitting Data into Training and Test Sets (MNIST)
The code below splits the data into training and test data sets. The test_size = 1/7.0
makes the training set size 60,000 images and the test set size of 10,000.
from sklearn.model_selection import train_test_split
train_img, test_img, train_lbl, test_lbl = train_test_split(
mnist.data, mnist.target, test_size=1/7.0, random_state=0)
Showing the Images and Labels (MNIST)
import numpy as np
import matplotlib.pyplot as plt
plt.figure(figsize=(20,4))
for index, (image, label) in enumerate(zip(train_img[0:5], train_lbl[0:5])):
plt.subplot(1, 5, index + 1)
plt.imshow(np.reshape(image, (28,28)), cmap=plt.cm.gray)
plt.title('Training: %i\n' % label, fontsize = 20)
Visualizing the Images and Labels in our Dataset
Scikit-learn 4-Step Modeling Pattern (MNIST)
One thing I like to mention is the importance of parameter tuning. While it may not have mattered much for the smaller digits dataset, it makes a bigger difference on larger and more complex datasets. While usually one adjusts parameters for the sake of accuracy, in the case below, we are adjusting the parameter solver to speed up the fitting of the model.
Step 1. Import the model you want to use
In sklearn, all machine learning models are implemented as Python classes
from sklearn.linear_model import LogisticRegression
Step 2. Make an instance of the Model
Please see the documentation if you are curious what changing solver does. Essentially, we are changing the optimization algorithm.
# all parameters not specified are set to their defaults
# default solver is incredibly slow thats why we change it
logisticRegr = LogisticRegression(solver = 'lbfgs')
Step 3. Training the model on the data, storing the information learned from the data
Model is learning the relationship between x (digits) and y (labels)
logisticRegr.fit(train_img, train_lbl)
Step 4. Predict the labels of new data (new images)
Uses the information the model learned during the model training process
# Returns a NumPy Array
# Predict for One Observation (image)
logisticRegr.predict(test_img[0].reshape(1,-1))
# Predict for Multiple Observations (images) at Once
logisticRegr.predict(test_img[0:10])
# Make predictions on entire test data
predictions = logisticRegr.predict(test_img)
Measuring Model Performance (MNIST)
While there are other ways of measuring model performance, we are going to keep this simple and use accuracy as our metric.
To do this are going to see how the model performs on the new data (test set)
accuracy is defined as:
(fraction of correct predictions): correct predictions / total number of data points
score = logisticRegr.score(test_img, test_lbl)
print(score)
One thing I briefly want to mention is that is the default optimization algorithm parameter was set to solver = 'liblinear'
. The table and image below is just showing that some optimization algorithms take longer. It had a minor effect on accuracy, but at least it was a lot faster.
Optimization Algorithm Parameter | Time on my Macbook | Accuracy |
---|---|---|
liblinear | 2893.1 seconds | 91.45 |
lbfgs | 52.86 seconds | 91.3 |
This gif just shows that some optimization algorithms take longer image source
Display Misclassified images with Predicted Labels (MNIST)
While I could show another confusion matrix, I figured people would rather see misclassified images on the off chance someone finds it interesting.
import numpy as np
import matplotlib.pyplot as plt
index = 0
misclassifiedIndexes = []
for label, predict in zip(test_lbl, predictions):
if label != predict:
misclassifiedIndexes.append(index)
index +=1
plt.figure(figsize=(20,4))
for plotIndex, badIndex in enumerate(misclassifiedIndexes[0:5]):
plt.subplot(1, 5, plotIndex + 1)
plt.imshow(np.reshape(test_img[badIndex], (28,28)), cmap=plt.cm.gray)
(https://github.com/mGalarnyk/Python_Tutorials/tree/master/Scrapy/fundrazr/fundrazr). The file is called MiniMorningScrape.csv (it is a large file). plt.title('Predicted: {}, Actual: {}'.format(predictions[badIndex], test_lbl[badIndex]), fontsize = 15)
Showing Misclassified Digits
Closing Thoughts
The important thing to note here is that making a machine learning model in scikit-learn is not a lot of work. I hope this post helps you with whatever you are working on. My next machine learning tutorial goes over PCA using Python. Please let me know if you have any questions either here or on the youtube video page!
This article originally appeared on my medium blog
Hi Michael.
Can you help me for change the code from Pl Php db to Pl Perl.
Hi Michael.
Can you help me for change the code from pl php db to pl perl db.
Very thorough post and super helpful… thanks Michael!
I am glad you enjoyed it :)
wait are you the author of this blog post? https://nicoletteshasky.com/2017/11/02/the-triple-threat-starter-pack-for-the-newbie-freelancer/?lipi=urn%3Ali%3Apage%3Ad_flagship3_feed%3BqEdWqZuYT5%2BdgbgbCepfkA%3D%3D
My friends are using it to start their side gig life. So many of my friends have so many side programming jobs. Bet you didnt know your post would be useful for programmers
Damn, that’s some powerful stalking skills!