Chapter 4: Training Models

4 min readJun 12, 2021

A Review of Hands-On Machine Learning with Scikit-Learn, Keras & Tensorflow by Aurélien Géron

A visualization of Non-Convex Gradient Descent from ResearchGate

Linear Regression

Out of all of the topics discussed so far this is the one you are most likely familiar. At its essence linear regression is all about finding the line of best fit by minimizing the total error between the data points and best fit line. Sklearn uses Ordinary Least Squares under the hood which applies linear algebra through the Normal Equation. While being extremely simple this learning method is extremely powerful and is usually a good starting place when deciding what method to use.

Ridge Regression

Ridge Regression is a regularized version of version of Linear Regression. This means that our model uses a regularization term alpha. So, when we create a model in sklearn we specify an alpha value that will determine how much the model is effected (think flattened). This works because we add a regularization term directly to the cost function, so when we are minimizing the cost upon training the line of best fit will be less effected by each individual data point.

Lasso Regression

Lasso stands for Least Absolute Shrinkage and Selection Operator Regression and is another regularized version of Linear Regression. The main difference between Lasso and Ridge is that Lasso uses the full weight vector rather than half of the squared value like in Ridge. The effect of this is that Lasso decreases, or completely eliminates, the effect of the least important data points in the data set. Due to this Lasso effectively automatically performs feature selection and outputs a sparse model.

Gradient Descent

Gradient Descent is a generic optimization algorithm and is one of the most important concepts in Machine Learning. The generic optimization is the basis for many algorithms and powers neural networks.

So what is it? It simply is the process of finding the global minimum of a cost function through taking incremental steps that attempt to minimize the cost function.

Three-Dimensional Gradient Descent from Wikipedia

The above photo shows an example of three-dimensional gradient descent represented as a topographical cost function with the largest circle being the highest cost and the smallest circle having the lowest cost. In this example we see our algorithm progressively working its way toward the central circle at each step. As if progresses you see the gradient descent self correct and curve towards the center rather than just continue down its original trajectory.

Polynomial Regression

So, what happens if we can’t describe our line of best fit as a straight line? Well we can still use a Linear Regression as long as we preprocess our data using a polynomial representation. Then, we can actually use Linear Regression to optimize our cost function on this polynomial data. However, it is important that we choose the right degrees because using too many or too few can lead to under or overfitting. So we will have to use the evaluation methods that were talked about in the first few chapters of the book.

Logistic Regression

Logistic or Logit Regression uses log values to create a binary classifier. This is done by using the generic logistic function (1/(1+e^-t)) to classify values into a sigmoid function.

From this image we can see that all values exist between 0 and 1 where the 0.5 value of Y takes place where X equals 0. So, in a Logistic Regression learning method we can classify any data that falls below 0.5 as a 0 and any data point that is above 0.5 as a 1, representing any sort of binary class like detecting spam and not spam.

My Thoughts

A lot of this was review for me from the Applied Machine Learning Coursera Course I took earlier this year but I am happy to have read this because it actually got into the equations that support these algorithms. Even though I have taken linear algebra in the past I don’t recall learning about the Normal Equation. I feel that I have a much stronger grasp of these learning algorithms now and will be able to apply them better in the future having read this chapter (this will also be a good chapter to use as a reference in the future for these concepts).

Thanks for reading!

If you have any questions or feedback please reach out to me on twitter @wtothdev or leave a comment!

Additionally, I wanted to give a huge thanks to Aurélien Géron for writing such an excellent book. You can purchase said book here (non-affiliate).

Disclaimer: I don’t make any money from any of the services referenced and chose to read and review this book under my own free will.