Representation

A machine learning model can't directly see, hear, or sense input examples. Instead, you must create a representation of the data to provide the model with a useful vantage point into the data's key qualities. That is, in order to train a model, you must choose the set of features that best represent the data.

Representation

From Raw Data to Features

The idea is to map each part of the vector on the left into one or more fields into the feature vector on the right.

Raw data is mapped to a feature vector through a process called feature engineering.image/svg+xml0 : { house_info : { num_rooms: 6 num_bedrooms: 3 street_name: "Main Street" num_basement_rooms: -1 ... }} [ 6.0, 1.0, 0.0, 0.0, 0.0, 9.321, -2.20, 1.01, 0.0, ...,] Raw data doesn't come to us as feature vectors. Feature Engineering Process of creating features from raw data is feature engineering. Raw Data Feature Vector

From Raw Data to Features

An example of a feature that can be copied directly from the raw dataimage/svg+xml0 : { house_info : { num_rooms: 6 num_bedrooms: 3 street_name: "Main Street" num_basement_rooms: -1 ... }} num_rooms_feature = [ 6.0 ] Feature Engineering Raw Data Feature Real-valued features can be copied over directly.

From Raw Data to Features

image/svg+xml0 : { house_info : { num_rooms: 6 num_bedrooms: 3 street_name: "Main Street" num_basement_rooms: -1 ... }}

From Raw Data to Features

Mapping a string value ("Main Street") to a sparse vector, via one-hot encoding.image/svg+xmlFeature Engineering 0 : { house_info : { num_rooms: 6 num_bedrooms: 3 street_name: "Main Street" num_basement_rooms: -1 ... }} String Features can be handled with one-hot encoding. street_name feature = [0, 0, ..., 0, 1, 0, ..., 0] V: number of unique vocab items (streets) One-hot encodingThis has a 1 for "Main Street" and 0 for all others. Raw Data Feature
  • Dictionary maps each street name to an int in {0, ...,V-1}
  • Now represent one-hot vector above as <i>

Properties of a Good Feature

Feature values should appear with non-zero value more than a small handful of times in the dataset.

my_device_id:8SK982ZZ1242Z

device_model:galaxy_s6

Properties of a Good Feature

Features should have a clear, obvious meaning.

user_age:23

user_age:123456789

Properties of a Good Feature

Features shouldn't take on "magic" values

(use an additional boolean feature like is_watch_time_defined instead!)

watch_time: -1.0

watch_time: 1.023

watch_time_is_defined: 1.0

Properties of a Good Feature

The definition of a feature shouldn't change over time.

(Beware of depending on other ML systems!)

city_id:"br/sao_paulo"

inferred_city_cluster_id:219

Properties of a Good Feature

Distribution should not have crazy outliers

Ideally all features transformed to a similar range, like (-1, 1) or (0, 5).

Distribution with outliers and a distribution with a cap image/svg+xml 50 rooms per person!? roomsPerPerson roomsPerPerson Same feature,capped tomax of 4.0

The Binning Trick

Graph showing a distribution with a fitting curve based on location image/svg+xml Latitude

The Binning Trick

Graph showing a distribution with a fitting curve based on location image/svg+xml LatitudeBin1 = 32 < latitude <= 33 LatitudeBin6 = 37 < latitude <= 38 Latitude
  • Create several boolean bins, each mapping to a new unique feature
  • Allows model to fit a different value for each bin

Good Habits

KNOW YOUR DATA

  • Visualize: Plot histograms, rank most to least common.
  • Debug: Duplicate examples? Missing values? Outliers? Data agrees with dashboards? Training and Validation data similar?
  • Monitor: Feature quantiles, number of examples over time?

Send feedback about...

Machine Learning Crash Course