Chapter 13: Loading and Preprocessing Data with TensorFlow

4 min readJun 22, 2021

A Review of Hands-On Machine Learning with Scikit-Learn, Keras & Tensorflow by Aurélien Géron

Orthogonal Projection of a 10-Dimension hypercube

Summary

This chapter focused on the different ways that TensorFlow allows you to store and process data using built-in functionality. And there is good reason that this chapter is 40 pages worth of it, although it certainly could be much longer.

The Data API

The basis of data storage in TensorFlow centers around the Data API. And in this Data API the basic unit is a dataset (tf.data.dataset). This dataset is a way to process our data in an easy way even if our data is so large that it doesn’t fit in memory. To transform this dataset we can easily call built in functions like repeat, batch, map, and apply. The last two allow us to run predefined functions on either certain parts of our data (map) or all of our data (apply).

As we learned in previous chapters, shuffling our data is one of the best ways to ensure that our data is trained evenly and efficiently through Gradient Descent. Because this is so essential to neural networks the Data API has a shuffle function already built in (dataset.shuffle). Going beyond the basic shuffle the Data API also has a built in function called interleave that is used when importing data to interleave (join in alternating rows) multiple files together. Combining shuffle and interleave when working with neural nets helps to make sure that your data is independent and identically distributed before you start training.

This chapter also introduced prefetching which is a form of parallelism. Due to processing large amounts of data in training you will break up your data into smaller batches that run in memory. So, traditionally you would fetch the first batch via a cpu and then train it on a gpu, then fetch the next batch via cpu and train it on a gpu over and over in a linear fashion fetch with cpu, train with gpu, fetch with cpu, train with gpu… So, to make this process more efficient you use prefetching. Prefetching is a form of cpu gpu parallelism where as the gpu trains batch n the cpu (rather than waiting) fetching batch n+1 so that when the gpu is done with batch n it can immediately move on to the next batch. This cuts out the down time for each processor and uses them far more efficiently.

Encoding

Encoding is an essential topic that we touched on earlier in the book, it is essentially how we transform non-numeric data into numeric data so that Machine Learning algorithms can correctly interpret them. Like in sklearn Tensorflow has built in functionality for one-hot encoding. primarily the tf.one_hot() function that allows us to easily encode our values from a list of our categories.

TensorFlow also gives us built in functionality to just normally encode values but this time it is very different than in sklearn. In TensorFlow you can represent encoding through embedding your data in many dimensions. To do this you need to use the keras.layers.Embedding() function that will plot your data in a trainable multi-dimensional space.

My Thoughts

This chapter was honestly a bit of a tough read with many pages of full text and covering one of the not so glamorous topics of Machine Learning, how to work with data efficiently. But, even if reading it might have been a bit of a slog I think that this chapter is one that I will be looking back on time and time again due to a wealth of knowledge. This chapter breaks down how TensorFlow stores and processes data at a low level and explains many concepts, like how Protobufs work, at a very deep level. Additionally, this chapter serves as a primer for some of the more complex GPU and parallel processing that will become extremely important down the road. Overall I think that there was great information in this chapter even if I had a bit of difficulty sitting down and grinding through it in one go (I ended up taking a walk about halfway through so that I could refocus).

Thanks for reading!

If you have any questions or feedback please reach out to me on twitter @wtothdev or leave a comment!

Additionally, I wanted to give a huge thanks to Aurélien Géron for writing such an excellent book. You can purchase said book here (non-affiliate).

Disclaimer: I don’t make any money from any of the services referenced and chose to read and review this book under my own free will.