A machine learning model can't directly see, hear, or sense input examples. Instead, you must create a representation of the data to provide the model with a useful vantage point into the data's key qualities. That is, in order to train a model, you must choose the set of features that best represent the data.


The idea is to map each part of the vector on the left into one or more fields into the feature vector on the right.

Raw data is mapped to a feature vector through a process called feature engineering.
An example of a feature that can be copied directly from the raw data
An example of a string feature (street name) that cannot be copied directly from the raw data
Mapping a string value (
  • Dictionary maps each street name to an int in {0, ...,V-1}
  • Now represent one-hot vector above as <i>

Feature values should appear with non-zero value more than a small handful of times in the dataset.



Features should have a clear, obvious meaning.



Features shouldn't take on "magic" values

(use an additional boolean feature like watch_time_is_defined instead!)

watch_time: -1.0

watch_time: 1.023

watch_time_is_defined: 1.0

The definition of a feature shouldn't change over time.

(Beware of depending on other ML systems!)



Distribution should not have extreme outliers

Ideally all features transformed to a similar range, like (-1, 1) or (0, 5).

Distribution with outliers and a distribution with a cap
Graph showing a distribution with a fitting curve based on location
Graph showing a distribution with a fitting curve based on location
  • Create several boolean bins, each mapping to a new unique feature
  • Allows model to fit a different value for each bin


  • Visualize: Plot histograms, rank most to least common.
  • Debug: Duplicate examples? Missing values? Outliers? Data agrees with dashboards? Training and Validation data similar?
  • Monitor: Feature quantiles, number of examples over time?