Embeddings

An embedding is a relatively low-dimensional space into which you can translate high-dimensional vectors. Embeddings make it easier to do machine learning on large inputs like sparse vectors representing words. Ideally, an embedding captures some of the semantics of the input by placing semantically similar inputs close together in the embedding space. An embedding can be learned and reused across models.

Embeddings

  • Input: 1,000,000 movies that 500,000 users have chosen to watch
  • Task: Recommend movies to users

To solve this problem some method is needed to determine which movies are similar to each other.

A list of movies ordered in a single line from left to right. Starting with the left, 'Shrek', 'The Incredibles', 'The Triplets of Belleville', 'Harry Potter', 'Star Wars', 'Bleu', 'The Dark Knight Rises', and 'Memento'

The same list of movies in the previous slide but arranged across two dimensions, so for example 'Shrek' is to the left and above of 'The Incredibles

Similar to the previous diagram but with axis and labels for each quadrant. The arrangement of the movies is the following: in the first, upper-right quadrant is Adult Blockbusters containing 'Star Wars' and 'The Dark Knight Rises' with the movies 'Hero' and 'Crouching Tiger, Hidden Dragon' were added to the Adult Blockbusters quadrant. The second, lower-right quadrant is Adult Arthouse containing the movies 'Bleu' and 'Memento' with 'Waking Life' added to the Adult Arthouse quadrant. The third, lower-left quadrant is Children Arthouse and it contains the movie 'The Triplets of Belleville' and 'Wallace and Gromit' is added to the Children Arthouse quadrant. The fourth and final quadrant in the upper-left is Children Blockbusters containing 'Shrek', 'The Incredibles', and 'Harry Potter' and the movie 'School of Rock' is added to the Children Blockbusters quadrant.

The same arrangement as the last slide. 'Shrek' and 'Bleu' are highlighted as examples of their coordinates in the 2d embedding plane.

  • Assumes user interest in movies can be roughly explained by d aspects
  • Each movie becomes a d-dimensional point where the value in dimension d represents how much the movie fits that aspect
  • Embeddings can be learned from data
  • No separate training process needed -- the embedding layer is just a hidden layer with one unit per dimension
  • Supervised information (e.g. users watched the same two movies) tailors the learned embeddings for the desired task
  • Intuitively the hidden units discover how to organize the items in the d-dimensional space in a way to best optimize the final objective
  • Each example (a row in this matrix) is a sparse vector of features (movies) that have been watched by the user
  • Dense representation of this example as: (0, 1, 0, 1, 0, 0, 0, 1)

Is not efficient in terms of space and time.

A table where each column header is a movie and each row represents a user and the movies they have watched.
  • Build a dictionary mapping each feature to an integer from 0, ..., # movies - 1
  • Efficiently represent the sparse vector as just the movies the user watched. This might be represented as: Based on the column position of the movies in the sparse vector displayed on the right, the movies 'The Triplets from Belleville', 'Wallace and Gromit', and 'Memento' can be efficiently represented as (0,1, 999999)
A sparse vector represented as a table with each column representing a movie and each row representing a user. The table contains the movies from the previous diagrams and is numbered from 1 to 999999. Each cell of the table is checked if a user has watched a movie.

Regression problem to predict home sales prices:

A diagram of a deep neural network used to predict home sale prices

Regression problem to predict home sales prices:

A diagram of a deep neural network used to predict home sale prices
  (sparse vector encoding highlighted)

Regression problem to predict home sales prices:

A diagram of a deep neural network used to predict home sale prices
  (hidden three-dimensional embedding layer highlighted)

Regression problem to predict home sales prices:

A diagram of a deep neural network used to predict home sale prices
  (additional latitude and longitude input features highlighted)

Regression problem to predict home sales prices:

A diagram of a deep neural network used to predict home sale prices
  (input features feeding into multiple hidden layers highlighted)

Regression problem to predict home sales prices:

A diagram of a deep neural network used to predict home sale prices
  (output of the deep neural network highlighted)

Multiclass Classification to predict a handwritten digit:

Diagram of a deep neural network used to predict handwritten digits

Multiclass Classification to predict a handwritten digit:

Diagram of a deep neural network used to predict handwritten digits
  (input sparse vector encoding highlighted)

Multiclass Classification to predict a handwritten digit:

Diagram of a deep neural network used to predict handwritten digits
  (other features highlighted)

Multiclass Classification to predict a handwritten digit:

Diagram of a deep neural network used to predict handwritten digits
  (three-dimensional embedding highlighted)

Multiclass Classification to predict a handwritten digit:

Diagram of a deep neural network used to predict handwritten digits
  (hidden layers highlighted)

Multiclass Classification to predict a handwritten digit:

Diagram of a deep neural network used to predict handwritten digits
  (logit layer highlighted)

Multiclass Classification to predict a handwritten digit:

Diagram of a deep neural network used to predict handwritten digits
  (target class layer highlighted)

Collaborative Filtering to predict movies to recommend:

Diagram of a deep neural network used to predict which movies to recommend

Collaborative Filtering to predict movies to recommend:

Diagram of a deep neural network used to predict which movies to recommend
  (target class layer highlighted)

Collaborative Filtering to predict movies to recommend:

Diagram of a deep neural network used to predict which movies to recommend
  (sparse vector encoding highlighted)

Collaborative Filtering to predict movies to recommend:

Diagram of a deep neural network used to predict which movies to recommend
  (three-dimensional embedding highlighted)

Collaborative Filtering to predict movies to recommend:

Diagram of a deep neural network used to predict which movies to recommend
  (other features highlighted)

Collaborative Filtering to predict movies to recommend:

Diagram of a deep neural network used to predict which movies to recommend
  (hidden layers highlighted)

Collaborative Filtering to predict movies to recommend:

Diagram of a deep neural network used to predict which movies to recommend
  (logit layer highlighted)

Deep Network

  • Each of hidden units corresponds to a dimension (latent feature)
  • Edge weights between a movie and hidden layer are coordinate values
  • A tree diagram of a deep neural network with a nodes in the lowest layer connected to three points in next higher layer

Geometric view of a single movie embedding

A point in 3 dimensional space corresponding to the lower layer node in the deep neural network disagram.
  • Higher-dimensional embeddings can more accurately represent the relationships between input values
  • But more dimensions increases the chance of overfitting and leads to slower training
  • Empirical rule-of-thumb (a good starting point but should be tuned using the validation data):
  • $$ dimensions \approx \sqrt[4]{possible\;values} $$
  • Embeddings map items (e.g. movies, text,...) to low-dimensional real vectors in a way that similar items are close to each other
  • Embeddings can also be applied to dense data (e.g. audio) to create a meaningful similarity metric
  • Jointly embedding diverse data types (e.g. text, images, audio, ...) define a similarity between them