The Building Blocks of AI
A few weeks ago, I wrote about how and why I was learning Machine Learning, mainly through Andrew Ng’s Coursera course. Now I’m checking back in with 9 weeks under my belt.
Machine Learning is built on prerequisites, so much so that learning by first principles seems overwhelming. Do you really need to spend a month learning linear algebra? Or weeks brushing up on calculus and statistics? Absolutely not. You’ll be okay if you have some math and programming experience. You really just have to be familiar with Sigma notation and be able to express it in a for loop. Sure, your assignments will take longer to complete and the first few times you see those giant equations your head will spin, but you can do this! Calculus is not even required. Professor Ng walks you through iteratively and makes a point of avoiding prerecs.
You’re going to learn some very important concepts from the ground up. ML is powered by the triforce of Statistics, Calculus, and Linear Algebra. Stats is the core of everything you do in ML. Calculus tells us how to learn and optimize our model. Linear Algebra makes executing these algorithms feasible on massive data sets.
We’re going to talk about each of the three parts, then I’ll put it all together to show you how a neural network works.
Statistics
It turns out that Data Science and Machine Learning are the same thing, they just refer to different levels of scale. A data science tends to focus on smaller projects and visualization/communication support, as opposed to machine learning which is used with large data sets and massive data engineering efforts. So keep in mind that you’re building a big scalable version of something that could be put together in a few days using R, Matlab, or GASP — Excel.
I had a great stats teacher in business school who bore an uncanny resemblance to Mario Van Peebles. So I came in understanding regressions despite working with a weak toolkit (Excel + a statistical add-on). It was hard to figure out which features were important, so I mostly flailed, eyeballed it and used correlation analysis. Then I was supposed to eyeball the scatterplots and have intuition about the quality of my high R-squared value. With machine learning, I turned in an old musket for a plasma rifle.
So where do we start with statistics? Regressions, of course!
Linear Regression
A linear regression is simple. Picture a scatterplot of any two related items that are quantifiable. There’s often a pattern between them. Do the math, and you end up with a best-fit trendline of the form Y = MX + B
, with M being a weight you apply to each X value and B being a constant.
Professor Ng’s textbook example is on real estate prices in Portland, OR. We use the price per square foot as the X value and the price of the house as the Y value. We use a learning algorithm called Gradient Descent to determine a line that best fits our data. Gradient Descent is an optimization algorithm that tries to minimize the difference in the predicted and actual price of a house in our training set.
Living Area (sq. feet) | Price (000s)
-----------------------------------------------------
1124 | 192
2480 | 471
1678 | 275
1970 | 318
This is cool, since Linear Regression allows us to estimate the value of a new house. We simply plug our new X into the equation and out comes an estimated price for the house. After about 3 weeks in a machine learning class and you can create the next Zillow.
However, our linear model is very simplistic. We all know that more factors affect the price of a home, so they should factor into our algorithm, as well. That leads us to multivariate regression, in which multiple variables affect the output.
Multivariate Regression
Multivariate sounds fancy and complicated, right? It’s just a subset of linear regressions where there are multiple input variables, which we call features. Our training set now becomes a M x I
matrix of M examples that have I features. Instead of a single variable with a weight, each of the features has a weight. We call each weight theta
. We use our learning algorithm to compute theta, which is a vector of optimal weights that best fits our training set.
We can extend our Portland house prices example to include new features like the number of bedrooms, number of bathrooms, and the average price of homes within a mile. Then we end up with an equation like this:
y = x0 + x1 * theta1 + x2 * theta2 + x3 * theta3 + x4 * theta4
These three new features can improve our predictions, but they can also introduce new problems. First, a really high number like square footage can overpower something small like the number of bedrooms. Next, we can add so many features that we overfit the model to the training set and it poorly predicts new data in a test set. We can also add so many features that we exhaust our system resources and can’t process the data with our learning algorithm.
Luckily, there are solutions to these problems. We can normalize our data and put it on a level playing field where each X fits between 0 and 1. We can reduce the impact of a model that has too many features using a coefficient called lambda. And we can use normal data processing or tricks like principal component analysis to reduce our set of features into something manageable. Think about how much easier it is to process a grayscale 150x150 image than a full color 1080p one.
Still, linear regressions give us a function that spits out a quantitative value. What about classifying if an email is or isn’t spam?
Logistic Regression
A logistic regression is used when data is categorical (E.g., true/false, positive/negative/neutral, etc). It is typically used when data is binary (yes/no), but can also be used to classify across a group of items. Professor Ng starts us with a simple model that predicts if a tumor is malignant or benign based on the size of the tumor. Rather than predict a value, we’re predicting the probability of an occurrence. Since probability goes between 0% and 100%, we can no longer use a line that stretches infinitely and we’re left with a threshold. Past point X, we are more likely than not looking at a tumor.
The other big idea about a logistic regression is the logarithmic scale. This is the log in logistic. In practice, most learning algorithms use the Sigmoid function for a logistic regression. It gives us a S-shaped curve that much more accurately describes our errors than a straight line function. If we aren’t very confident, say a tumor is 60% likely and we get it wrong and it is benign, that’s understandable. However, if we say there is a 2% chance and it’s actually a tumor, then we made a huge mistake. Using a log scale function, we capture this intuition and punish the model for being confident and wrong.
Our learning algorithm runs over our training set multiple times and determine weights for our sigmoid function. We can use it to make predictions on the likelihood of something. If we are trying to classify a farm animal as a horse, cow, pig or sheep based on a picture, we end up running 4 different logistic regressions (one for each category) and then choose the one with the highest probability.
Now that we’ve covered the statistics, I’ll talk about how the learning algorithms work.
Calculus
Calculus powers our learning algorithms. Statistics tells us our goals, but it doesn’t do the learning. And linear algebra quickly does the calculations for our learning algorithms.
I had a good bit of calculus fifteen years ago in college and I didn’t do that great in it then. I have a hard time remembering the lower case names for greek letters, so the partial derivative symbol was particularly scary. I now know how to use it, but still don’t understand it. And that’s okay! If you want to pick up ML without a MS/Phd, you’re not going to know everything. However, if you are rusty or want to pick up a little more math, just go to Khan Academy.
Most of you have used a solver before. We provide the solver a function, plus the input data. Calculus is how the solver works. You can run a linear regression through one and get the best fit. However they are slow, so we use a different approach and iterate repeatedly over our training set and calculate the best fit using simpler functions.
Cost Function
The cost function, also called the error function or loss function, is the most important concept in supervised machine learning. In order to learn you have to have to know how you are performing. Are your predictions good or bad? And if they’re bad, how bad are they?
With a linear regression we use a technique called least-squares to see how far off the actual Y value is from our predicted Y value. We iterate through our training set and sum up the (predicted — actual) ^ 2
. Formally our cost function is:
The wrinkle here is the hypothesis function, noted by H sub Theta
. In a linear regression, this is your regression function, typically Prediction = Theta0 + Theta1 * x
. We iterate through every example in our training set and sum up the squared error in order to get our cost for this set of theta parameters. Theta parameters are your weights inside your linear regression equation. The goal of our learning algorithm is to minimize the cost function, so we tinker with the theta parameters (E.g., weights) in order to optimize the model.
Thankfully, there is a great way to tinker with the model and minimize our cost function.
Gradient Descent
With a simple linear regression, we have two theta parameters, M*x
and B in the line equation, and then J(theta), the cost function. A 3-dimensional plot can show us how changes in the two weights affect our cost function. Hence, we can either use calculus to draw the full bowl shaped gradient and solve for the minimum, or we can start at a random point and use an iterative algorithm called Gradient Descent to find the minimum.
And that’s exactly how gradient descent works. We initialize our theta parameters, then we learn by following the partial derivitive of the cost function when applied to a specific theta parameter. This gives us a function:
Here we iterate through all of our theta parameters and sum up the result of the error times the X value of that theta param. Alpha is the learning rate, which lets us set the speed of our descent to the bottom of the bowl. We can’t set alpha too high or we’ll overshoot our goal. We have to iterate since the angle of our descent will change as we approach the global minimum.
And that’s how machine learning works. It’s not crazy or magical, it’s just a lot of applied math. We can use the same exact technique on a multivariate linear regression and iterate over our data set to learn the optimum value for each of the features. We’ll always find the minimum when using a linear regression. However, if we make things more complicated and use polynomial coefficients, the 3d plot is no longer a bowl and we might head downward towards a local minimum instead of a global minimum. For situations like that, we can use techniques like support vector machines.
We can also use gradient descent for logistic regressions. However, we have to change up our hypothesis function and thus cost and learning functions, since we are no longer aiming for a straight line. We use the sigmoid function as our hypothesis function and it looks like this:
We do the same thing in our cost function, where we compare the hypothesis against the actual value. However, we have a binary goal of 1 or 0 in our actual Y values. We have to test to see how close we came to a true positive and how close we came to a true negative. In the cost function, we can reduce those two parts to:
Our learning algorithm is still gradient descent and it looks almost identical to that of linear regression. We iterate over all the theta parameters and descend towards a global minimum. We merely use a different hypothesis function that is based on the log scale.
Linear Algebra
Anyone can run machine learning algorithms without using linear algebra. Sigma notation is everywhere, so the algorithms are just a bunch of for loops. However, nested for loops are slow when you run over them thousands or millions of times. Fortunately, there is a fix for that: Linear Algebra.
The linear algebra knowledge you’ll need is matrix addition and matrix multiplication. Addition, subtraction and other elementwise operations (multiply by 5, get the square root, etc) are easy to pick up. Matrixes must be the same size and you do that operation between corresponding cells in each matrix. For instance, [1 2; 3 4]+ [1 1; 1 1] = [2 3; 4 5]
Matrix multiplication, on the other hand, requires a lot more intuition. This might be a good time to watch some videos on Khan Academy or read up on a guide to linear algebra for programmers if you’re a coder. It’s a bit too long for me to explain the details, so here were my two AHA learnings:
-
The rows and columns of your matrices must be aligned. If you are multiplying A * B, then the columns of your first matrix must match the rows of your second matrix. Much of your time will spent arranging them backwards and forwards and alternating which one is transposed.
-
Most expressions in Sigma notation can be reduced to matrix multiplication. Matrix math is similar to the reduce/inject iterators. Iterate over a series of numbers. At each step multiply two elements together and keep adding each sum to the previous one. Then you’re left with a a single sum. That’s how we can reduce the learning algorithms down to linear algebra.
Putting it all together
Everything you learn in the Stanford Machine Learning class builds on itself. Later on you’ll learn specialized algorithms to improve your data quality, speed up your algorithms (principal component analysis), and find interesting things from unstructured data. But the most powerful thing you’ll learn is how to create and learn from a neural network.
Neural Networks
Neural Networks are nothing more than logistic regressions connected to each other. We refer to each individual logit (logistic regression unit) as a node and group the nodes into layers. When we have multiple hidden layers we call this deep learning. That’s also why we can’t explain what is happening in the hidden layers in the middle.
We are doing a lot more calculations than before, since we’re computing a single logistic regression for each node, plus we’re also pushing the results of that node forwards or backwards depending if we’re predicting (forward from inputs) or learning (backwards from output and cost function). Those actions are called forward propagation and back-propagation.
Speed is extremely important and that’s why linear algebra is required. That’s also why ML is done on high end graphics cards or Tensor Processing Units which are specialized for matrix ops instead of standard CPUs. In the past 10–15 years we have just gotten our hardware and software fast enough to run large scale neural networks. That’s the reason the AI winter ended and we are now in an explosion of growth in machine learning.
Thank you for the post!
Great article! I have a question regarding the linear regression. Is a linear regression appropriate for all data or just some? Additionally, when data are dependent on one another, would trying to force a linear regression be inappropriate? Is our cost functions meant to visualize our variance from the linear regression? This sounds a bit like principle component analysis