Numerical data: Qualities of good numerical features

Page Summary

Good feature vectors require features that are clearly named and have obvious meanings to anyone on the project.
Data should be checked and tested for bad data or outliers like inappropriate values before being used for training.
Features should be sensible, avoiding "magic values" that create discontinuities; instead, use separate boolean features or new discrete values to indicate missing data.
Continuous features should not have magic values representing the absence of measurement, but rather use separate Boolean features or discrete values.
Discrete numerical features with missing values should be assigned a new value within the finite set, enabling the model to learn weights for each value including missing features.

This unit has explored ways to map raw data into suitable feature vectors. Good numerical features share the qualities described in this section.

Clearly named

Each feature should have a clear, sensible, and obvious meaning to any human on the project. For example, the meaning of the following feature value is confusing:

Not recommended

house_age: 851472000

In contrast, the following feature name and value are far clearer:

Recommended

house_age_years: 27

Checked or tested before training

Although this module has devoted a lot of time to outliers, the topic is important enough to warrant one final mention. In some cases, bad data (rather than bad engineering choices) causes unclear values. For example, the following user_age_in_years came from a source that didn't check for appropriate values:

Not recommended

user_age_in_years: 224

But people can be 24 years old:

Recommended

user_age_in_years: 24

Check your data!

Sensible

A "magic value" is a purposeful discontinuity in an otherwise continuous feature. For example, suppose a continuous feature named watch_time_in_seconds can hold any floating-point value between 0 and 30 but represents the absence of a measurement with the magic value -1:

Not recommended

watch_time_in_seconds: -1

A watch_time_in_seconds of -1 would force the model to try to figure out what it means to watch a movie backwards in time. The resulting model would probably not make good predictions.

A better technique is to create a separate Boolean feature that indicates whether or not a watch_time_in_seconds value is supplied. For example:

Recommended

watch_time_in_seconds: 4.82
is_watch_time_in_seconds_defined=True

watch_time_in_seconds: 0
is_watch_time_in_seconds_defined=False

This is a way to handle a continuous dataset with missing values. Now consider a discrete numerical feature, like product_category, whose values must belong to a finite set of values. In this case, when a value is missing, signify that missing value using a new value in the finite set. With a discrete feature, the model will learn different weights for each value, including original weights for missing features.

For example, we can imagine possible values fitting in the set:

{0: 'electronics', 1: 'books', 2: 'clothing', 3: 'missing_category'}.

Help Center

Scrubbing (5 min)

Polynomial transforms (5 min)

Numerical data: Qualities of good numerical features Stay organized with collections Save and categorize content based on your preferences.

Page Summary

Clearly named

Checked or tested before training

Sensible

Numerical data: Qualities of good numerical features