Manual Similarity Measure Exercise

The following exercise walks you through the process of manually creating a similarity measure.

Imagine you have a simple dataset on houses as follows:

FeatureType
PricePositive integer
Size Positive floating-point value in units of square meters
Postal codeInteger
Number of bedroomsInteger
Type of houseA text value from “single_family," “multi-family," “apartment,” “condo”
Garage0/1 for no/yes
ColorsMultivalent categorical: one or more values from standard colors “white,” ”yellow,” ”green,” etc.

Preprocessing

The first step is preprocessing the numerical features: price, size, number of bedrooms, and postal code. For each of these features you will have to perform a different operation. For example, in this case, assume that pricing data follows a bimodal distribution. What should you do next?

Which action should you take if your data follows a bimodal distribution?
Create quantiles from the data and scale to [0,1].
This is the correct step to take when data follows a bimodal distribution.
Log transform and scale to [0,1].
This is actually the step to take when data follows a Power-law distribution.
Normalize and scale to [0,1].
This is the step you would take when data follows a Gaussian distribution.

In the field below, try explaining how you would process size data.

In the field below, try explaining what how you would process data on the number of bedrooms.

How should you represent postal codes? Convert postal codes to longitude and latitude. Then process those values as you would process other numeric values.

Calculating Similarity per Feature

Now it is time to calculate the similarity per feature. For numeric features, you simply find the difference. For binary features, such as if a house has a garage, you can also find the difference to get 0 or 1. But what about categorical features? Answer the questions below to find out.

Which of these features is multivalent (can have multiple values)?
Color
A given residence can be more than one color, for example, blue with white trim. Therefore, color is a multivalent feature.
Postal code
Any dwelling can only have one postal code. This is a univalent feature.
Type
Your home can only be one type, house, apartment, condo, etc, which means it is a univalent feature.
Which type of similarity measure should you use for calculating the similarity for a multivalent feature?
Jaccard similarity
Suppose homes are assigned colors from a fixed set of colors. Then, calculate similarity using the ratio of common values (Jaccard similarity).
Euclidean distance
For the features “postal code” and “type” that have only one value (univalent features), if the feature matches, the similarity measure is 0; otherwise, the similarity measure is 1.

Calculating Overall Similarity

You have numerically calculated the similarity for every feature. But the clustering algorithm requires the overall similarity to cluster houses. Calculate the overall similarity between a pair of houses by combining the per- feature similarity using root mean squared error (RMSE). That is, where \(s_1,s_2,\ldots,s_N\) represent the similarities for \(N\) features:

\[\text{RMSE} = \sqrt{\frac{s_1^2+s_2^2+\ldots+s_N^2}{N}}\]

Limitations of Manual Similarity Measure

As this exercise demonstrated, when data gets complex, it is increasingly hard to process and combine the data to accurately measure similarity in a semantically meaningful way. Consider the color data. Should color really be categorical? Or should we assign colors like red and maroon to have higher similarity than black and white? And regarding combining data, we just weighted the garage feature equally with house price. However, house price is far more important than having a garage. Does it really make sense to weigh them equally?

If you create a similarity measure that doesn’t truly reflect the similarity between examples, your derived clusters will not be meaningful. This is often the case with categorical data and brings us to a supervised measure.