Representation

A machine learning model can't directly see, hear, or sense input examples. Instead, you must create a representation of the data to provide the model with a useful vantage point into the data's key qualities. That is, in order to train a model, you must choose the set of features that best represent the data.

Representation

From Raw Data to Features

The idea is to map each part of the vector on the left into one or more fields into the feature vector on the right.

Raw data is mapped to a feature vector through a process called feature engineering.

From Raw Data to Features

An example of a feature that can be copied directly from the raw data

From Raw Data to Features

An example of a string feature (street name) that cannot be copied directly from the raw data

From Raw Data to Features

Dictionary maps each street name to an int in {0, ...,V-1}
Now represent one-hot vector above as <i>

Properties of a Good Feature

Feature values should appear with non-zero value more than a small handful of times in the dataset.

my_device_id:8SK982ZZ1242Z

device_model:galaxy_s6

Properties of a Good Feature

Features should have a clear, obvious meaning.

user_age:23

user_age:123456789

Properties of a Good Feature

Features shouldn't take on "magic" values

(use an additional boolean feature like watch_time_is_defined instead!)

watch_time: -1.0

watch_time: 1.023

watch_time_is_defined: 1.0

Properties of a Good Feature

The definition of a feature shouldn't change over time.

(Beware of depending on other ML systems!)

city_id:"br/sao_paulo"

inferred_city_cluster_id:219

Properties of a Good Feature

Distribution should not have extreme outliers

Ideally all features transformed to a similar range, like (-1, 1) or (0, 5).

Distribution with outliers and a distribution with a cap

The Binning Trick

Graph showing a distribution with a fitting curve based on location

The Binning Trick

Create several boolean bins, each mapping to a new unique feature
Allows model to fit a different value for each bin

Good Habits

KNOW YOUR DATA

Visualize: Plot histograms, rank most to least common.
Debug: Duplicate examples? Missing values? Outliers? Data agrees with dashboards? Training and Validation data similar?
Monitor: Feature quantiles, number of examples over time?

Help Center

Programming Exercise

Feature Engineering