Transforming Your Data: Check Your Understanding

For the following questions, click the desired arrow to check your answer:

You’re preprocessing data for a regression model. What transformations are mandatory? Check all that apply.
Converting all non-numeric features into numeric features.
Correct. This is a mandatory transformation. You must convert strings to some numeric representation because you can’t do matrix multiplication on a string.
Normalize numeric data.
Normalizing numeric data could help, but it’s an optional quality transformation.

 

Consider the chart below. Which data transformation technique would likely be the most productive to start with and why? Assume your goal is to find a linear relationship between roomsPerPerson and house price.
Z-score
Z-score is a good choice if the outliers aren’t extreme. However, the outliers are extreme here.
Clipping
Clipping is a good choice here because the data set contains extreme outliers. You should fix extreme outliers before applying other normalizations.
Log Scaling
Log scaling is a good choice if your data confirms to the power law distribution. However, this data conforms to a normal distribution rather than a power law distribution.
Bucketing (binning) with quantile boundaries
Quantile bucketing can be a good approach for skewed data, but in this case, this skew is due in part to a few extreme outliers. Also, you want the model to learn a linear relationship. Therefore, you should keep roomsPerPerson numeric rather than transform it to categories, which is what bucketing does. Instead, try a normalization technique.

A chart showing the relative frequency of different RoomsPerPerson, where
RoomsPerPerson is number of rooms in a residence divided by number of people in
that residence.  Most of the data is distributed between 0 and 5 with a
smattering of points from 5 to 55.

 

Consider the chart below. Which data transformation technique would likely be the most productive to start with and why?
Z-score
Z-score is a good choice if the outliers aren’t so extreme that you need clipping. That’s not the case here. The way the data is skewed should be a hint.
Clipping
Clipping is a good choice when there are extreme outliers. This chart, however, is showing a power law distribution, and there’s another normalization technique that’s better for addressing that.
Log Scaling
Log scaling is a good choice here because the data conforms to the power law distribution.
Bucketing (binning) with quantile boundaries
Quantile bucketing can be a good approach for skewed data. However, you are looking for the model to learn a linear relationship. Therefore, you should keep your data numeric and avoid putting it in buckets. Try a normalization technique instead.

A bar graph whose bars are heavily concentrated at the low end. The first
bar has a magnitude of 1,200, the second bar has a magnitude of 460, the third
bar has a magnitude of 300. By the 15th bar, the magnitude is down to about
30. A very long tail continues for another 90 bars with the magnitude of the
long tail never rising beyond 10.

 

Consider the chart below. Would a linear model make a good prediction about the relationship between compression-ratio and city-mpg? If not, how might you transform the data to better train the model?
Yes, the model would probably find a linear relationship and make pretty accurate predictions.
While the model would find a linear relationship, the model wouldn’t make very accurate predictions. You can try training this data set in the Data Modeling exercise to better understand why.
No. The model would probably be more accurate after scaling.
You could apply linear scaling, but the slope of the relationship between compression-ratio and city-mpg would look the same. What would help you more is to see two separate slopes--one for the cluster of points in the lower compression-ratio and another for the higher.
No. There seems to be two different behaviors happening. Setting a threshold in the middle and using a bucketized feature might help you better understand what's happening in those two areas.
Correct. It’s important to be clear about why and how you are setting the boundaries. In the Data Modeling exercise, you’ll learn more about exactly how this approach can help you create a better model.

A scatterplot showing highway-mpg against compression-ratio. Two distinct
clumps of data, one clump much bigger than the other, appear on opposite
ends of the compression-ratio axis. The bigger clump covers the
compression-ratio range 7-12; the smaller clump covers the compression-ratio
range 21-23. The highway-mpg is generally a little lower in the bigger clump
than in the smaller
clump.

 

A peer team is telling you about the progress they’ve made on their ML project. They computed a vocabulary and trained a model offline. They want to avoid staleness issues, however, so they’re now about to train a different model online. What might happen next?
The model will stay up to date as new data arrives. The other team will need to continually monitor the input data.
Although avoiding model staleness is the main benefit of dynamic training, using a vocabulary with a model trained offline will lead to problems.
They may find that the indices they’re using don’t correspond to the vocab.
Correct. Warn your colleagues about the perils of training/serving skew, and then recommend that they take Google’s course on Data Preparation and Feature Engineering for ML to learn more.