Stay organized with collections
Save and categorize content based on your preferences.
Machine learning helps us find patterns in data—patterns we then use to
make predictions about new data points. To get those predictions right,
we must construct the data
set
and transform the data correctly.
This course covers these two key steps. We'll also see how
training/serving considerations play into these steps.
Why Learn About Data Preparation and
Feature Engineering?
You can think of feature engineering as helping the model to
understand the data set in the same way you do. Learners often come to a machine
learning course focused on model building, but end up spending much more time
focusing on data.
For the following question,
click the desired arrow to check your answer:
If you had to prioritize improving one of the areas below in your machine
learning project, which would have the most impact?
The quality and size of your data
Data trumps all. It's true that updating your learning algorithm or
model architecture will let you learn different types of patterns, but if
your data is bad, you will end up building functions that fit the wrong
thing. The quality and size of the data set matters much more than which
shiny algorithm you use.
Using the latest optimization algorithm
You could definitely see some gains in pushing optimizers, but it
wouldn't have as significant an impact on your model as another item
in this list.
A deeper network
While a deeper network may improve your model, the impact
won't be as significant as another item in this list.
A more clever loss function
Close! A better loss function can give you a big win, but
it's still second to another item in this list.
Why is Collecting a Good Data Set Important?
Google Translate
"...one of our most impactful quality advances since neural machine translation
has been in identifying the best subset of our training data to use"
- Software Engineer, Google Translate
The Google Translate team has more training data than they can use.
Rather than tuning their model, the team has
earned bigger wins by using the best features in their data.
"...most of the times when I tried to manually debug interesting-looking
errors they could be traced back to issues with the training data."
- Software Engineer, Google Translate
"Interesting-looking" errors are typically caused by
the data. Faulty data may cause your model to learn the wrong patterns,
regardless of what modeling techniques you try.
Brain's Diabetic Retinopathy Project
Google Brain's diabetic retinopathy project employed a neural network
architecture, known as Inception.
to detect disease by classifying images. The team didn't tweak models.
Instead, they succeeded by creating a data set of 120,000 examples labeled
by ophthalmologists. (Learn more at
https://research.google.com/pubs/pub43022.html.)