Generating Embeddings Example

This example shows how to generate the embeddings used in a supervised similarity measure.

Imagine you have the same housing data set that you used when creating a manual similarity measure:

Feature	Type
Price	Positive integer
Size	Positive floating-point value in units of square meters
Postal code	Integer
Number of bedrooms	Integer
Type of house	A text value from “single_family," “multi-family," “apartment,” “condo”
Garage	0/1 for no/yes
Colors	Multivalent categorical: one or more values from standard colors “white,” ”yellow,” ”green,” etc.

Preprocessing Data

Before you use feature data as input, you need to preprocess the data. The preprocessing steps are based on the steps you took when creating a manual similarity measure. Here's a summary:

Feature	Type or Distribution	Action
Price	Poisson distribution	Quantize and scale to [0,1].
Size	Poisson distribution	Quantize and scale to [0,1].
Postal code	Categorical	Convert to longitude and latitude, quantize and scale to [0,1].
Number of bedrooms	Integer	Clip outliers and scale to [0,1].
Type of house	Categorical	Convert to one-hot encoding..
Garage	0 or 1	Leave as is.
Colors	Categorical	Convert to RGB values and process as numeric data.

For more information on one-hot encoding, see Embeddings: Categorical Input Data.

Choose Predictor or Autoencoder

To generate embeddings, you can choose either an autoencoder or a predictor. Remember, your default choice is an autoencoder. You choose a predictor instead if specific features in your dataset determine similarity. For completeness, let's look at both cases.

Train a Predictor

You need to choose those features as training labels for your DNN that are important in determining similarity between your examples. Let's assume price is most important in determining similarity between houses.

Choose price as the training label, and remove it from the input feature data to the DNN. Train the DNN by using all other features as input data. For training, the loss function is simply the MSE between predicted and actual price. To learn how to train a DNN, see Training Neural Networks.

Train an Autoencoder

Train an autoencoder on our dataset by following these steps:

Ensure the hidden layers of the autoencoder are smaller than the input and output layers.
Calculate the loss for each output as described in Supervised Similarity Measure.
Create the loss function by summing the losses for each output. Ensure you weight the loss equally for every feature. For example, because color data is processed into RGB, weight each of the RGB outputs by 1/3rd.
Train the DNN.

Extracting Embeddings from the DNN

After training your DNN, whether predictor or autoencoder, extract the embedding for an example from the DNN. Extract the embedding by using the feature data of the example as input, and read the outputs of the final hidden layer. These outputs form the embedding vector. Remember, the vectors for similar houses should be closer together than vectors for dissimilar houses.

Next, you'll see how to quantify the similarity for pairs of examples by using their embedding vectors.

Supervised Similarity Measure

Measuring Similarity