Representation

A machine learning model can't directly see, hear, or sense input examples. Instead, you must create a representation of the data to provide the model with a useful vantage point into the data's key qualities. That is, in order to train a model, you must choose the set of features that best represent the data.

Representation

From Raw Data to Features

The idea is to map each part of the vector on the left into one or more fields into the feature vector on the right.

Raw data is mapped to a feature vector through a process called feature engineering.

From Raw Data to Features

An example of a feature that can be copied directly from the raw data

From Raw Data to Features

An example of a string feature (street name) that cannot be copied directly from the raw data

From Raw Data to Features

Mapping a string value ("Main Street") to a sparse vector, via one-hot encoding.
  • Dictionary maps each street name to an int in {0, ...,V-1}
  • Now represent one-hot vector above as <i>

Properties of a Good Feature

Feature values should appear with non-zero value more than a small handful of times in the dataset.

my_device_id:8SK982ZZ1242Z

device_model:galaxy_s6

Properties of a Good Feature

Features should have a clear, obvious meaning.

user_age:23

user_age:123456789

Properties of a Good Feature

Features shouldn't take on "magic" values

(use an additional boolean feature like is_watch_time_defined instead!)

watch_time: -1.0

watch_time: 1.023

watch_time_is_defined: 1.0

Properties of a Good Feature

The definition of a feature shouldn't change over time.

(Beware of depending on other ML systems!)

city_id:"br/sao_paulo"

inferred_city_cluster_id:219

Properties of a Good Feature

Distribution should not have crazy outliers

Ideally all features transformed to a similar range, like (-1, 1) or (0, 5).

Distribution with outliers and a distribution with a cap

The Binning Trick

Graph showing a distribution with a fitting curve based on location

The Binning Trick

Graph showing a distribution with a fitting curve based on location
  • Create several boolean bins, each mapping to a new unique feature
  • Allows model to fit a different value for each bin

Good Habits

KNOW YOUR DATA

  • Visualize: Plot histograms, rank most to least common.
  • Debug: Duplicate examples? Missing values? Outliers? Data agrees with dashboards? Training and Validation data similar?
  • Monitor: Feature quantiles, number of examples over time?

Send feedback about...

Machine Learning Crash Course