Embeddings

An embedding is a relatively low-dimensional space into which you can translate high-dimensional vectors. Embeddings make it easier to do machine learning on large inputs like sparse vectors representing words. Ideally, an embedding captures some of the semantics of the input by placing semantically similar inputs close together in the embedding space. An embedding can be learned and reused across models.

Embeddings

Motivation From Collaborative Filtering

Input: 1,000,000 movies that 500,000 users have chosen to watch
Task: Recommend movies to users

To solve this problem some method is needed to determine which movies are similar to each other.

Organizing Movies by Similarity (1d)

A list of movies ordered in a single line from left to right. Starting with the left, 'Shrek', 'The Incredibles', 'The Triplets of Belleville', 'Harry Potter', 'Star Wars', 'Bleu', 'The Dark Knight Rises', and 'Memento'

Organizing Movies by Similarity (2d)

The same list of movies in the previous slide but arranged across two dimensions, so for example 'Shrek' is to the left and above of 'The Incredibles

Two-Dimensional Embedding

The same arrangement as the last slide. 'Shrek' and 'Bleu' are highlighted as examples of their coordinates in the 2d embedding plane.

d-Dimensional Embeddings

Assumes user interest in movies can be roughly explained by d aspects
Each movie becomes a d-dimensional point where the value in dimension d represents how much the movie fits that aspect
Embeddings can be learned from data

Learning Embeddings in a Deep Network

No separate training process needed -- the embedding layer is just a hidden layer with one unit per dimension
Supervised information (e.g. users watched the same two movies) tailors the learned embeddings for the desired task
Intuitively the hidden units discover how to organize the items in the d-dimensional space in a way to best optimize the final objective

Input Representation

Each example (a row in this matrix) is a sparse vector of features (movies) that have been watched by the user
Dense representation of this example as: (0, 1, 0, 1, 0, 0, 0, 1)

Is not efficient in terms of space and time.

A table where each column header is a movie and each row represents a user and the movies they have watched.

Input Representation

Build a dictionary mapping each feature to an integer from 0, ..., # movies - 1
Efficiently represent the sparse vector as just the movies the user watched. This might be represented as:

A sparse vector represented as a table with each column representing a movie and each row representing a user. The table contains the movies from the previous diagrams and is numbered from 1 to 999999. Each cell of the table is checked if a user has watched a movie.

An Embedding Layer in a Deep Network

Regression problem to predict home sales prices:

A diagram of a deep neural network used to predict home sale prices

An Embedding Layer in a Deep Network

Regression problem to predict home sales prices:

A diagram of a deep neural network used to predict home sale prices
(sparse vector encoding highlighted)

An Embedding Layer in a Deep Network

Regression problem to predict home sales prices:

A diagram of a deep neural network used to predict home sale prices
(hidden three-dimensional embedding layer highlighted)

An Embedding Layer in a Deep Network

Regression problem to predict home sales prices:

A diagram of a deep neural network used to predict home sale prices
(additional latitude and longitude input features highlighted)

An Embedding Layer in a Deep Network

Regression problem to predict home sales prices:

A diagram of a deep neural network used to predict home sale prices
(input features feeding into multiple hidden layers highlighted)