Bucketing

Let's start with a quick review of a key idea from Machine Learning Crash Course. Look at the distribution in the chart below.

A plot of houses per latitude. The plot is highly irregular, containing
doldrums around latitude 36 and huge spikes around latitudes 34
and 38. Figure 1: House prices versus latitude.

For the following question, click the desired arrow to check your answer:

Consider Figure 1. If you think latitude might be a good predictor of housing values, should you leave latitude as a floating-point value? Why or why not? (Assume this is a linear model.)

Yes — if latitude is a floating-point value in the dataset, you shouldn't change it.

If you feed those floating-point values into your network, it will try to learn a linear relationship between the feature and the label. But a linear relationship isn't likely for latitude. A one-degree increase in latitude (say, from 34 to 35 degrees) may produce some amount of change in the model's output, whereas a different one-degree increase (say, from 35 to 36 degrees) may produce a different amount of change. That's non-linear behavior.

No — there's no linear relationship between latitude and the housing values.

You suspect that individual latitudes and housing values are related, but the relationship is not linear.

In cases like the latitude example, you need to divide the latitudes into buckets to learn something different about housing values for each bucket. This transformation of numeric features into categorical features, using a set of thresholds, is called bucketing (or binning). In this bucketing example, the boundaries are equally spaced.

The same plot of latitude vs. housing prices as the previous figure. This
time, however, the plot is divided into 11 "bins" between whole
number latitudes.

Figure 2: House prices versus latitude, now divided into buckets.

Quantile Bucketing

Let's revisit our car price dataset with buckets added. With one feature per bucket, the model uses as much capacity for a single example in the >45000 range as for all the examples in the 5000-10000 range. This seems wasteful. How might we improve this situation?

A plot of car price per number of cars sold at that price. The plot is divided
into 10 equally-sized buckets with a range of 5000 (car price). The first three
buckets contain many examples, but the final seven buckets contain very few
examples.

Figure 3: Number of cars sold at different prices.

The problem is that equally spaced buckets don’t capture this distribution well. The solution lies in creating buckets that each have the same number of points. This technique is called quantile bucketing. For example, the following figure divides car prices into quantile buckets. In order to get the same number of examples in each bucket, some of the buckets encompass a narrow price span while others encompass a very wide price span.

Same as Figure 3, except with quantile buckets. That is, the
buckets now have different sizes. The smallest bucket has a range of about
1000 dollars and the largest bucket has a range of about 25000 dollars.
The number of cars in each bucket is now about the
same.

Figure 4: Quantile bucketing gives each bucket about the same number of cars.

Bucketing Summary

If you choose to bucketize your numerical features, be clear about how you are setting the boundaries and which type of bucketing you’re applying:

Buckets with equally spaced boundaries: the boundaries are fixed and encompass the same range (for example, 0-4 degrees, 5-9 degrees, and 10-14 degrees, or $5,000-$9,999, $10,000-$14,999, and $15,000-$19,999). Some buckets could contain many points, while others could have few or none.
Buckets with quantile boundaries: each bucket has the same number of points. The boundaries are not fixed and could encompass a narrow or wide span of values.

Normalization

Transforming Categorical Data