# Representation

A machine learning model can't directly see, hear, or sense input examples. Instead, you must create a representation of the data to provide the model with a useful vantage point into the data's key qualities. That is, in order to train a model, you must choose the set of features that best represent the data.

# Representation

## From Raw Data to Features

The idea is to map each part of the vector on the left into one or more fields into the feature vector on the right.

## From Raw Data to Features

• Dictionary maps each street name to an int in {0, ...,V-1}
• Now represent one-hot vector above as <i>

## Properties of a Good Feature

Feature values should appear with non-zero value more than a small handful of times in the dataset.

my_device_id:8SK982ZZ1242Z

device_model:galaxy_s6

## Properties of a Good Feature

Features should have a clear, obvious meaning.

user_age:23

user_age:123456789

## Properties of a Good Feature

Features shouldn't take on "magic" values

watch_time: -1.0

watch_time: 1.023

watch_time_is_defined: 1.0

## Properties of a Good Feature

The definition of a feature shouldn't change over time.

(Beware of depending on other ML systems!)

city_id:"br/sao_paulo"

inferred_city_cluster_id:219

## Properties of a Good Feature

Distribution should not have crazy outliers

Ideally all features transformed to a similar range, like (-1, 1) or (0, 5).

## The Binning Trick

• Create several boolean bins, each mapping to a new unique feature
• Allows model to fit a different value for each bin