Representation

A machine learning model can't directly see, hear, or sense input examples. Instead, you must create a representation of the data to provide the model with a useful vantage point into the data's key qualities. That is, in order to train a model, you must choose the set of features that best represent the data.

Representation

The idea is to map each part of the vector on the left into one or more fields into the feature vector on the right.

Raw data is mapped to a feature vector through a process called feature engineering.
An example of a feature that can be copied directly from the raw data
An example of a string feature (street name) that cannot be copied directly from the raw data
Mapping a string value ("Main Street") to a sparse vector, via one-hot encoding.
  • Dictionary maps each street name to an int in {0, ...,V-1}
  • Now represent one-hot vector above as <i>

Feature values should appear with non-zero value more than a small handful of times in the dataset.

my_device_id:8SK982ZZ1242Z

device_model:galaxy_s6

Features should have a clear, obvious meaning.

user_age:23

user_age:123456789

Features shouldn't take on "magic" values

(use an additional boolean feature like is_watch_time_defined instead!)

watch_time: -1.0

watch_time: 1.023

watch_time_is_defined: 1.0

The definition of a feature shouldn't change over time.

(Beware of depending on other ML systems!)

city_id:"br/sao_paulo"

inferred_city_cluster_id:219

Distribution should not have extreme outliers

Ideally all features transformed to a similar range, like (-1, 1) or (0, 5).

Distribution with outliers and a distribution with a cap
Graph showing a distribution with a fitting curve based on location
Graph showing a distribution with a fitting curve based on location
  • Create several boolean bins, each mapping to a new unique feature
  • Allows model to fit a different value for each bin

KNOW YOUR DATA

  • Visualize: Plot histograms, rank most to least common.
  • Debug: Duplicate examples? Missing values? Outliers? Data agrees with dashboards? Training and Validation data similar?
  • Monitor: Feature quantiles, number of examples over time?