A Review of Hands-On Machine Learning with Scikit-Learn, Keras & Tensorflow by Aurélien Géron
Summary
This chapter focuses on working through a full machine learning project. More specifically the author proposes a fictional situation where you assigned to a project to attempt to predict the value of houses in California so that you company can improve your current investment strategy.
Getting and Evaluating Your Data
So, to do this you start setting up a pipeline so that you can reproducibly collect data that you will be working with from the US Census and import it into a Pandas Dataframe (for those unfamiliar the author also provides an introduction to jupyter notebooks and virtual environments).
Then, an extremely important step is undertaken, dividing the data into a training and test set to be used late on for evaluation (sklearn.model_selection.train_test_split()). From here, we are introduced to a variety of statistical and visual methods that we can use to evaluate our data in our preprocessing step. Some of the most important being Matplotlib Histograms (plt.hist()), the standard correlation coefficient (pd.DataFrame.corr()), and the pandas scatter matrices (pandas.plotting.scatter_matrix()).
Data Cleaning and Preprocessing
The biggest techniques covered are null handling, encoding text/categorical data, and feature scaling.
- Here we are introduced to the DataFrame.dropna(), DataFrame.drop(), and DataFrame.fillna() methods which do exactly what they sound like, they either drop or fill null, or in the case of numpy Nan (not a number), values.
- Encoding is a bit more interesting though. Machine Learning Algorithms can only interpret quantifiable data so we must convert text or categorical values like “Los Angeles” and other cities to a quantifiable value. We do this through encoding and in this case our author used the sklearn.preprocessing.OrdinalEncoder() to translate values into number. For example if we had “Los Angeles”, “Burbank”, and “Palo Alto” they could be encoded to 1, 2, and 3 so that our algorithm could interpret them. After this our author introduces the concept of One-Hot Encoding. This is a form of binary encoding that works well for small amounts of categorical data. An example of this would be if we had a backyard column with the possible values of “no backyard”, “grass”, and “pool” we would turn this into three true/false columns. So each house would have Column “No Backyard” with a value of 0 or 1, “Grass” with a value of 0 or 1, and Column “Pool” with a value of 0 or 1, creating three binary columns rather than 1 categorical column.
- Feature Scaling is when you attempt to standardize the values in your model to a similar scale. This can make machine learning models to be more accurate due to how the underlying models interact with the data. Géron presents two types of Feature Scaling to us here. First, min-max scaling (aka normalization) which sounds exactly like what it is, scaling the possible values based on the min and max of the data (a column in this case). So you take the value you are scaling, subtract the min value from it and then divide that value by the max minus the min (use MinMaxScaler in sklearn). The other type is standardization (use StandardScaler in sklearn). To use standardization you would subtract the mean from your current value and then divide the new value by the standard deviation. The benefit of this approach is that the values are less effected by outliers but can cause issues in algorithms, like some neural networks, that expect values between 0 and 1.
Select, Train, and Tune the Model
In the last section of chapter two we get to the meat and potatoes of machine learning, the actual learning. Having spent so much time evaluating and visualizing our data we should have a pretty good understanding of what type of algorithm we should use, something we will learn more about in future chapters, and can apply it. Sklearn makes this process extremely easy to implement and usually just involves declaring what model we are going to use and then running fit and predict on our training data.
Now, the fun starts. Here we will use some of the evaluation techniques learned in the previous chapter like cross validation (we will spend a great deal of time on this in future chapters). And then once we have evaluated our model we can decide whether we need to use another because it is under or over-fitting and fine tune the model with techniques like Grid or Randomized Search.
Major Takeaways
The Focus on Data
This is among the longer chapters in Hands on Machine Learning coming in at 49 pages and yet only 3 pages are dedicated to Selecting and Training your model followed by 4 on Fine-Tuning the Model. So in a chapter dedicated to the complete machine-learning process only 1/9th is dedicated to models and training.
Reproducibility
It was stressed throughout this chapter that to do good machine learning everything that we need to do needs to be reproducible. Starting with the data pipeline to the Train_Test_Split (hint: use random_state variables) and eventually with the models everything needs to be reproducible. Even if you get trash results you should save that model because at the absolute worst you could eventually reuse that code elsewhere for another project where it will be useful.
Your Results Won’t Always be Lifechanging
The project tackled in this chapter was not a success depending on how you evaluate it. The goal for this project was to be able to predict the value of a house in California better than the current analysts were. But, in this chapter we were only able to get comparable results to what the analysts were achieving. Even though we weren’t able to exceed the human capabilities when it came to this problem we were able to create a program that could do it in a more efficient and reproducible way than the analysts and could likely save the company money by doing it in a faster and cheaper way.
My Thoughts
The Focus on Data (my thoughts)
I think that Géron makes this a focus of the first two chapters to instill in the readers, who I believe, like me, are early in their Machine Learning careers, that data is just as important as the algorithms are. A great deal of time is spent gathering the data, which can be hard and/or expensive to attain, but also evaluating and preparing that data for machine learning.
The Whole Process
There is a great deal of statistics involved with what we have learned so far in these last two chapters and honestly a lot of what I have seen so far relates to what I learned in my Econometrics class in college. The methodologies at this stage seem to be extremely similar to how Economists use data in their analysis (possibly why so many Econ PhD.’s go on to work in Data Science).
Thanks for reading!
If you have any questions or feedback please reach out to me on twitter @wtothdev or leave a comment!
Additionally, I wanted to give a huge thanks to Aurélien Géron for writing such an excellent book. You can purchase said book here (non-affiliate).
Disclaimer: I don’t make any money from any of the services referenced and chose to read and review this book under my own free will.