Data preparation

This section reviews the data preparation steps most relevant to clustering from the Working with numerical data module in Machine Learning Crash Course.

In clustering, you calculate the similarity between two examples by combining all the feature data for those examples into a numeric value. This requires the features to have the same scale, which can be accomplished by normalizing, transforming, or creating quantiles. If you want to transform your data without inspecting its distribution, you can default to quantiles.

Normalizing data

You can transform data for multiple features to the same scale by normalizing the data.

Z-scores

Whenever you see a dataset roughly shaped like a Gaussian distribution, you should calculate z-scores for the data. Z-scores are the number of standard deviations a value is from the mean. You can also use z-scores when the dataset isn't large enough for quantiles.

See Z-score scaling to review the steps.

Here is a visualization of two features of a dataset before and after z-score scaling:

Two graphs comparing feature data before and after normalization
Figure 1: A comparison of feature data before and after normalization.

In the unnormalized dataset on the left, Feature 1 and Feature 2, respectively graphed on the x and y axes, don't have the same scale. On the left, the red example appears closer, or more similar, to blue than to yellow. On the right, after z-score scaling, Feature 1 and Feature 2 have the same scale, and the red example appears closer to the yellow example. The normalized dataset gives a more accurate measure of similarity between points.

Log transforms

When a dataset perfectly conforms to a power law distribution, where data is heavily clumped at the lowest values, use a log transform. See Log scaling to review the steps.

Here is a visualization of a power-law dataset before and after a log transform:

A barchart with the majority of data at the low end
Figure 2: A power law distribution.
A graph showing a normal (Gaussian) distribution
Figure 3: A log transform of Figure 2.

Before log scaling (Figure 2), the red example appears more similar to yellow. After log scaling (Figure 3), red appears more similar to blue.

Quantiles

Binning the data into quantiles works well when the dataset does not conform to a known distribution. Take this dataset, for example:

A graph showing a data distribution prior to any preprocessing
Figure 4: An uncategorizable distribution prior to any preprocessing.

Intuitively, two examples are more similar if only a few examples fall between them, irrespective of their values, and more dissimilar if many examples fall between them. The visualization above makes it difficult to see the total number of examples that fall between red and yellow, or between red and blue.

This understanding of similarity can be brought out by dividing the dataset into quantiles, or intervals that each contain equal numbers of examples, and assigning the quantile index to each example. See Quantile bucketing to review the steps.

Here is the previous distribution divided into quantiles, showing that red is one quantile away from yellow and three quantiles away from blue:

A graph showing the data after conversion
  into quantiles. The line represent 20 intervals.]
Figure 5: The distribution in Figure 4 after conversion into 20 quantiles.

You can choose any number \(n\) of quantiles. However, for the quantiles to meaningfully represent the underlying data, your dataset should have at least \(10n\) examples. If you don't have enough data, normalize instead.

Check your understanding

For the following questions, assume you have enough data to create quantiles.

Question one

A plot displaying three data distributions
How should you process the data distribution shown in the preceding graph?
Create quantiles.
Correct. Because the distribution does not match a standard data distribution, you should default to creating quantiles.
Normalize.
You typically normalize data if:
  • The data distribution is Gaussian.
  • You have some insight into what the data represents in the real that suggests the data shouldn't be transformed nonlinearly.
Neither case applies here. The data distribution isn't Gaussian because it isn't symmetric. And you don't know what these values represent in the real world.
Log transform.
This isn't a perfect power-law distribution, so don't use a log transform.

Question two

A plot displaying three data distributions
How would you process this data distribution?
Normalize.
Correct. This is a Gaussian distribution.
Create quantiles.
Incorrect. Since this is a Gaussian distribution, the preferred transform is normalization.
Log transform.
Incorrect. Only apply a log transform to power-law distributions.

Missing data

If your dataset has examples with missing values for a certain feature, but those examples occur rarely, you can remove these examples. If those examples occur frequently, you can either remove that feature altogether, or you can predict the missing values from other examples using a machine learning model. For example, you can impute missing numerical data by using a regression model trained on existing feature data.