Evaluate the Classification Performance of Machine Learning Models using TensorFlow: A Computer Vision Example
Classification of visual data is one of the fundamental tasks within the computer vision domain. From machine learning and deep learning perspective, evaluating the trained model for classification performance is always critical since it demonstrates the capability of the model to perform in the real-world application. There are several evaluation measures reported in the literature; however, for this tutorial, I am using the ones which are most commonly used by researchers. These measures include loss, accuracy, precision score, recall score, F1 score, Jaccard Index, Type I error and Type II error. A brief definition of each is presented as follows:
Loss is the simplest of the measure to evaluate model training and testing performance. It is the measure of how many instances are classified incorrectly and is the ratio of the number of incorrect predictions over total predictions. The minimum value of loss indicates better performance.
Accuracy: In contrast to loss, accuracy is the measure of how much percentage of data instances are classified correctly. It is the ratio of the number of correct predictions over total predictions. A high value of accuracy represents better performance.
Precision Score: Precision measures the ability of a model to not to classify a negative instance as positive. It answers the question that from all the positive predicted instances by model, how many were actually positive. The equation below presents the expression for the precision score.
Precision Score=(True Positive (TP))/(True Positive (TP)+False Positive (FP))
Recall Score: Recall answers the question that from all the positive instances, how many were correctly classified by the model. The expression for recall score is given as follows.
Recall Score=(False Positive (FP))/(True Positive (TP)+False Positive (FP))
**F1 score ** is the single measure which combines both precision and recall by harmonic mean and range between 0 and 1. A higher F1 score indicates the better performance of the model. The expression for the F1 score is given as follows.
F1 Score=2×(Precision×Recall)/(Precision+Recall)
Jaccard Index: In the context of classification, the Jaccard similarity index score measures the similarity between predicted labels and actual labels. Mathematically, let y ̂ denote the predicted label and y denotes the actual label, then J index can be expressed as follows. A higher J index indicates better performance of the model.
J Index Score=(|y ̂∩y|)/(|y ̂∪y|)=(|y ̂∩y|)/(|y ̂ |+|y|-|y ̂∩y|)
Type I and Type II Error: Type I (False Positive) and Type II (False Negative) errors are commonly used terms in machine learning and the primary goal of the algorithm is to minimize one of these two errors, depending on the context that which error is more critical in the given task. By definition, a type I error is concluding the existence of a relationship while in fact it does not exist e.g., In context, if detecting cats in images, classifying an image as “cat” while there is no cat. Similarly, a type II error is the rejection of the existence of a relationship while in fact it exists e.g., classifying an image as “dog” while there is a cat.
DogsVsCats Example
Taking standard “DogsVsCats” dataset, the following code portions demonstrate how to evaluate a trained model for the above-mentioned evaluation measures. It is pretty easy as far as essential libraries are included carefully and the dataset is prepared correctly. For this tutorial, I have trained a MobileNet CNN model initialized with ImageNet weights over the DogsVsCats dataset. The dataset split I used is 60% for training, 20% for validation and 20% for the testing. Training and validation performance was acceptable for the MobileNet model; however, for this tutorial, I will be evaluating the trained model over the test dataset.
Step 1: Load the relevant libraries required to run the script. For this tutorial, essential libraries included tensorflow, numpy, sklearn, scipy, keras, and matplotlib.
#LOAD LIBRARIES
import math, os, json, sys, re
from glob import glob
import numpy as np
from matplotlib import pyplot as plt
import pandas as pd
import tensorflow as tf
import PIL
from PIL import Image
from numpy.random import random, permutation, randn, normal, uniform, choice
from numpy import newaxis
import scipy
from scipy import misc, ndimage
from scipy.ndimage.interpolation import zoom
#from scipy.ndimage import imread
from sklearn.metrics import confusion_matrix
import keras
from keras import backend as K
from keras.utils.data_utils import get_file
from keras.utils import np_utils
from keras.utils.np_utils import to_categorical
from tensorflow.keras.models import Sequential, Model
from keras.layers import Input, Embedding, Reshape, merge, LSTM, Bidirectional
from keras.layers import TimeDistributed, Activation, SimpleRNN, GRU
from tensorflow.keras.layers import Flatten, Dense, Dropout, Lambda
from tensorflow.keras.layers import BatchNormalization
from tensorflow.keras.optimizers import SGD, RMSprop, Adam
from tensorflow.keras.metrics import categorical_crossentropy, categorical_accuracy
from keras.layers.convolutional import *
from tensorflow.keras.preprocessing import image, sequence
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.callbacks import TensorBoard, EarlyStopping, ReduceLROnPlateau
from tensorflow.keras.callbacks import ModelCheckpoint
import matplotlib.pyplot as plt
import matplotlib.patheffects as PathEffects
import matplotlib
from tensorflow.keras.preprocessing.image import ImageDataGenerator
Step 2: Define test dataset directory and data generator functions. Load the test dataset and save labels. Functions of get_batches, get_data and get_classes are defined to load the test data for evaluation. There could be other approaches for loading the dataset and choice is purely subjective.
test_dir = PATH_TO_TEST_DIR #'C:/Users/ui010/Desktop/DogsVsCats/DogsvsCats_Dataset/test'
# Function for generating batches
def get_batches(dirname, gen=image.ImageDataGenerator(), shuffle=False, batch_size=32, class_mode='categorical',
target_size=(224,224)):
return gen.flow_from_directory(dirname, target_size=target_size,
class_mode=class_mode, shuffle=shuffle, batch_size=batch_size)
# Function for getting classes
def get_classes():
batches = get_batches(train_dir, shuffle=False, batch_size=1)
test_batches = get_batches(test_dir, shuffle=False, batch_size=1)
val_batches = get_batches(val_dir, shuffle=False, batch_size=1)
return (batches.classes, onehot(batches.classes), batches.filenames, onehot(val_batches.classes), val_batches.filenames, onehot(test_batches.classes), test_batches.filenames)
# Load all data
def get_data(path, target_size=(224,224)):
batches = get_batches(path, shuffle=False, batch_size=1, class_mode=None, target_size=target_size)
return np.concatenate([batches.next() for i in range(batches.samples)])
#LOAD TEST DATA
test_gen=get_batches(test_dir,batch_size=32)
test_img=get_data(test_dir)
#TEST LABELS
(test_labels, test_filenames)=get_classes()
Step 3: Review the training performance of model to assess and compare with the test performance. For the MobileNet transfer learned model, training and validation plots shows the typical example of overfitting, which usually happens either the data is not sufficient for training or data is too easy for the model to be learned. In this case, since the images are cropped containing only the single target object, therefore, it can be assumed that overfitting is happening because the dataset is too easy for the model. If the test performance is also similar as validation performance, it will approve the assumption.
Step 4: Load trained model and evaluate on test data. From the test results, across 98% accuracy and 0.04 loss is achieved which is indication of very good performance of model.
#LOAD TRAINED MODEL AND EVALUATE
from tensorflow.keras.models import load_model
sm_model = load_model('MobileNet_Best.hdf5')
test_score = sm_model.evaluate(test_img, test_labels, verbose=1)
print('Test Loss:', test_score[0])
print('Test Accuracy:', test_score[1])
Step 5: Plot the confusion matrix to determine the false positives, false negative, type I and type II errors. From the confusion matrix, it can be observed that both type I and type II errors are small and model performed pretty well.
from sklearn.metrics import confusion_matrix
import numpy as np
import itertools
def plot_confusion_matrix(cm, classes,
normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues,
verbose=False):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
if verbose:
print("Normalized confusion matrix")
else:
if verbose:
print('Confusion matrix, without normalization')
if verbose:
print(cm)
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes, rotation=45)
fmt = '.2f' if normalize else 'd'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
#plt.rcParams["figure.figsize"] = [4, 4]
plt.savefig('CM_MobileNet_Best.pdf', bbox_inches = 'tight')
plt.show()
predicted=sm_model.predict(test_img)
classes=predicted.argmax(axis=-1)
real=test_labels.argmax(axis=-1)
cm=confusion_matrix(classes, real)
classes=["Cat","Dog"]
plot_confusion_matrix(cm, classes, normalize=True, cmap=plt.cm.Blues)
Step 6: Determine the F1 Score, Precision Score, Recall Score and Jaccard Score for the trained model.
from sklearn.metrics import classification_report
print(classification_report(real,y_classes))
from sklearn.metrics import jaccard_score
j_index = jaccard_score(real,y_classes,average='micro')
round(j_index,2)
Conclusion
In machine learning and deep learning, it is always an important aspect to investigate the performance of trained model over unseen test dataset. It is normal that overfitting occurs during the training of model which can only be properly identified if model fails to generalize or perform over unseen dataset. Therefore, evaluation of trained models using standard measures is an essential stage in the pipeline. In this article, I explored the opportunity to inform and guide data scientists about standard evaluation measures. I demonstrated the implementation of each using DogsVsCats dataset example I prepared specifically for this tutorial. I provided the step by step guide on how to implement from loading the dataset, loading the model and evaluating using multiple measures in Python.
I thoroughly enjoyed reading your blog post as it provided me with valuable insights and information. Additionally, I would like to contribute some supplementary points that could benefit your readers:
-Anderson-Darling Test: Assesses if a sample comes from a specific distribution.
-Pareto Efficiency: Balancing multiple objectives.
-Exploratory Data Analysis (EDA) Impact: How insights affect model.
-Human-AI Agreement: Agreement between model and human predictions.
-Feature Correlation: Relationship among input features.
-Convergence Rate: Speed at which an optimization algorithm reaches a solution.
These points offer additional perspectives on the discussed topic. On a personal note, as someone without a technical background, I understand the challenges of website development. If you are facing similar difficulties, I have found it helpful to seek assistance from professional companies that specialize in web development and website design and redesign, such as Alakmalak Technologies