Create a Manual Similarity Measure

To calculate the similarity between two examples, you need to combine all the feature data for those two examples into a single numeric value.

For instance, consider a shoe data set with only one feature: shoe size. You can quantify how similar two shoes are by calculating the difference between their sizes. The smaller the numerical difference between sizes, the greater the similarity between shoes. Such a handcrafted similarity measure is called a manual similarity measure.

What if you wanted to find similarities between shoes by using both size and color? Color is categorical data, and is harder to combine with the numerical size data. We will see that as data becomes more complex, creating a manual similarity measure becomes harder. When your data becomes complex enough, you won’t be able to create a manual measure. That’s when you switch to a supervised similarity measure, where a supervised machine learning model calculates the similarity.

We’ll leave the supervised similarity measure for later and focus on the manual measure here. For now, remember that you switch to a supervised similarity measure when you have trouble creating a manual similarity measure.

To understand how a manual similarity measure works, let's look at our example of shoes. Suppose the model has two features: shoe size and shoe price data. Since both features are numeric, you can combine them into a single number representing similarity as follows.

  • Size (s): Shoe size probably forms a Gaussian distribution. Confirm this. Then normalize the data.
  • Price (p): The data is probably a Poisson distribution. Confirm this. If you have enough data, convert the data to quantiles and scale to \([0,1]\).
  • Combine the data by using root mean squared error (RMSE). Here, the similarity is \(\sqrt{\frac{s^2+p^2}{2}}\).

For a simplified example, let’s calculate similarity for two shoes with US sizes 8 and 11, and prices 120 and 150. Since we don’t have enough data to understand the distribution, we’ll simply scale the data without normalizing or using quantiles.

ActionMethod
Scale the size. Assume a maximum possible shoe size of 20. Divide 8 and 11 by the maximum size 20 to get 0.4 and 0.55.
Scale the price. Divide 120 and 150 by the maximum price 150 to get 0.8 and 1.
Find the difference in size. \(0.55 - 0.4 = 0.15\)
Find the difference in price. \(1 - 0.8 = 0.2\)
Find the RMSE. \(\sqrt{\frac{0.2^2+0.15^2}{2}} = 0.17\)

Intuitively, your measured similarity should increase when feature data becomes similar. Instead, your measured similarity actually decreases. Make your measured similarity follow your intuition by subtracting it from 1.

\[\text{Similarity} = 1 - 0.17 = 0.83\]

In general, you can prepare numerical data as described in Prepare data, and then combine the data by using Euclidean distance.

What if you have categorical data? Categorical data can either be:

  • Single valued (univalent), such as a car's color ("white" or "blue" but never both)
  • Multi-valued (multivalent), such as a movie's genre (can be "action" and "comedy" simultaneously, or just "action")

If univalent data matches, the similarity is 1; otherwise, it's 0. Multivalent data is harder to deal with. For example, movie genres can be a challenge to work with. To handle this problem, suppose movies are assigned genres from a fixed set of genres. Calculate similarity using the ratio of common values, called Jaccard similarity.

Examples:

  • [“comedy”,”action”] and [“comedy”,”action”] = 1
  • [“comedy”,”action”] and [“action”] = ½
  • [“comedy”,”action”] and [“action”, "drama"] = ⅓
  • [“comedy”,”action”] and [“non-fiction”,”biographical”] = 0

The following table provides a few more examples of how to deal with categorical data.

Examples
Postal code Postal codes representing areas that are close to each other should have a higher similarity. To encode the info required to calculate this similarity accurately, you can convert the postal codes into latitude and longitude. For a pair of postal codes, separately calculate the difference between their latitude and their longitude. Then add the differences to get a single numeric value.
Color Assume you have color data as text. Convert the textual values into numeric RGB values. Now you can find the difference in red, green, and blue values for two colors, and combine the differences into a numeric value by using the Euclidean distance.

In general, your similarity measure must directly correspond to the actual similarity. If your metric does not, then it isn’t encoding the necessary information. The preceding example converted postal codes into latitude and longitude because postal codes by themselves did not encode the necessary information.

Before creating your similarity measure, process your data carefully. Although the examples on this page relied on a small, simple data set, most real-world data sets are far bigger and far more complex. Remember that quantiles are a good default choice for processing numeric data.