Chapter 11: Training Deep Neural Networks

4 min readJun 19, 2021

A Review of Hands-On Machine Learning with Scikit-Learn, Keras & Tensorflow by Aurélien Géron

Summmary

This chapter focuses on Deep Learning and techniques that can be used to keep neural networks from getting out of hand as their complexities get deeper. Traditionally Deep Learning is defined as a neural network that contains 3 or more layers. But, with this addition of layers comes additional complexity and with complexity comes more ways for a project to break. Most of this chapter deals with introducing us to the techniques that we can use to minimize these breakages when training deep models.

Vanishing/Exploding Gradients

Neural Networks are trained through backpropagation using gradient descent to adjust their weighting so that we get the intended result. However, as neural nets get deeper they can run into an issue where these weights can grow so small or so large that they no longer, or overly, impact the model. But because this issue is so common there are many ways that we can prevent them as our models get deeper and deeper.

Batch Normalization is the most common way that we can address the vanishing/exploding gradient problem. It accomplishes this by adding a batch normalization layer, usually before, each training later to re-center and re-scale the data. This normalizes the data even as the data gets deeper and deeper preventing any singular nodes weight from growing either so large or so small that is causes the model to diverge. In building your models you use keras.layers.BatchNormalization() before each layer (although whether it goes before or after layers is somewhat disputed).

Gradient Clipping is a technique where quite literally clip the gradient of any step at a certain value. For example keras.optimizers.SGD(clipvalue=1.0) will clip every component of the gradient vector to between -1 and 1. These bounds help to prevent the exploding gradients problem in Recurrent Neural Networks in particular.

Reusing Pretrained Layers

This technique is exactly what it sounds like. You quite literally use Pretrained Layers that others have trained before you on similar problems to expedite the time it takes you to build the solution to your problem or when you build models that lack large amounts of data. One example of this might be that your company requires you to create a model to check for the difference between good highway infrastructure and bad highway infrastructure. So, rather than you completely building up your own model you could steal Tesla or Waymo’s first few layers that focus on road conditions and then build your detection of potholes and other such safety hazards on top of those first few layers. This example wouldn’t work because the data mentioned is highly classified but there are many open source resources out there like TensorFlow Hub and PyTorch Hub that offer pretrained models.

To actually implement this in TensorFlow 2.0 you need to first import the model you want to use through

model1 = keras.models.load_model(<model name>)

and then add create a model with those layers with

model2 = keras.models.Sequential(model1.layers[<slice the layers you need>])

and then add on your own layers

model2.add(<your own layer>)

Then finally you have to turn off the trainable attribute of the preloaded layer so that you can just train the new layers that you have implemented

for layer in model2.layers[<slice the model 1 layers>]:
   layer.trainable = False

From here you will be able to run your pretrained models!

(but first you should take a clone of your original model so that you don’t lose it’s original architecture

model1_clone = keras.models.clone_model(model1)
model1_clone = set_weights(model1.get_weights())

Faster Optimizers

While using the previous methods will speed up our training time we can also speed up the optimizer we are using. While regular Gradient Descent is the basis for most of these algorithms we can use faster optimizers like momentum optimization, Nesterov Accelerated Gradient, AdaGrad, RMSProp, and Adam or Nadam optimization. These faster optimizers are currently among the best stable optimizers but there will be more very soon as Machine Learning continues its rapid growth.

My Thoughts

This chapter was our first introduction to deep learning and the author made the choice to focus on the errors that we will encounter when we build deep learning models rather than teaching us how to create deep learning models and explaining the issues after. I think that this was an interesting approach but will hopefully be one that helps me keep these techniques in mind as we continue to learn about more and more complex methods. The only thing I wished about this chapter was more clarity into what the actual problems were. I understood them all at a high level but I wish I would have seen examples of some of them as they occur in practice.

Thanks for reading!

If you have any questions or feedback please reach out to me on twitter @wtothdev or leave a comment!

Additionally, I wanted to give a huge thanks to Aurélien Géron for writing such an excellent book. You can purchase said book here (non-affiliate).

Disclaimer: I don’t make any money from any of the services referenced and chose to read and review this book under my own free will.