This example shows how to generate the embeddings used in a supervised similarity measure.
Imagine you have the same housing data set that you used when creating a manual similarity measure:
Feature | Type |
---|---|
Price | Positive integer |
Size | Positive floating-point value in units of square meters |
Postal code | Integer |
Number of bedrooms | Integer |
Type of house | A text value from “single_family," “multi-family," “apartment,” “condo” |
Garage | 0/1 for no/yes |
Colors | Multivalent categorical: one or more values from standard colors “white,” ”yellow,” ”green,” etc. |
Preprocessing Data
Before you use feature data as input, you need to preprocess the data. The preprocessing steps are based on the steps you took when creating a manual similarity measure. Here's a summary:
Feature | Type or Distribution | Action |
---|---|---|
Price | Poisson distribution | Quantize and scale to [0,1]. |
Size | Poisson distribution | Quantize and scale to [0,1]. |
Postal code | Categorical | Convert to longitude and latitude, quantize and scale to [0,1]. |
Number of bedrooms | Integer | Clip outliers and scale to [0,1]. |
Type of house | Categorical | Convert to one-hot encoding.. |
Garage | 0 or 1 | Leave as is. |
Colors | Categorical | Convert to RGB values and process as numeric data. |
For more information on one-hot encoding, see Embeddings: Categorical Input Data.
Choose Predictor or Autoencoder
To generate embeddings, you can choose either an autoencoder or a predictor. Remember, your default choice is an autoencoder. You choose a predictor instead if specific features in your dataset determine similarity. For completeness, let's look at both cases.
Train a Predictor
You need to choose those features as training labels for your DNN that are important in determining similarity between your examples. Let's assume price is most important in determining similarity between houses.
Choose price as the training label, and remove it from the input feature data to the DNN. Train the DNN by using all other features as input data. For training, the loss function is simply the MSE between predicted and actual price. To learn how to train a DNN, see Training Neural Networks.
Train an Autoencoder
Train an autoencoder on our dataset by following these steps:
- Ensure the hidden layers of the autoencoder are smaller than the input and output layers.
- Calculate the loss for each output as described in Supervised Similarity Measure.
- Create the loss function by summing the losses for each output. Ensure you weight the loss equally for every feature. For example, because color data is processed into RGB, weight each of the RGB outputs by 1/3rd.
- Train the DNN.
Extracting Embeddings from the DNN
After training your DNN, whether predictor or autoencoder, extract the embedding for an example from the DNN. Extract the embedding by using the feature data of the example as input, and read the outputs of the final hidden layer. These outputs form the embedding vector. Remember, the vectors for similar houses should be closer together than vectors for dissimilar houses.
Next, you'll see how to quantify the similarity for pairs of examples by using their embedding vectors.