Sampling and Splitting: Check Your Understanding

For the following questions, click the desired arrow to check your answer:

Imagine that you have a dataset with a 1:1000 positive-negative ratio. Unfortunately, your model is always predicting the majority class. What technique would best help you deal with this problem? Note that you want the model to report a calibrated probability.

Just downsample the negative examples.

That's a good start, but you'll alter the model's base rate, so it is no longer calibrated.

Downsample the negative examples (the majority class). Then upweight the downsampled class by the same factor.

This is an effective way to deal with imbalanced data and still get the real distribution of labels. Note that it matters whether you care if the model reports a calibrated probability or not. If it doesn't need to be calibrated, you don't need to worry about changing the base rate.

Which techniques lose data from the tail of a dataset? Check all that apply.

PII filtering

Filtering PII from your data can remove information in the tail, skewing your distribution.

Weighting

Example weighting changes the importance of different examples, but it doesn't lose information. In fact, adding weight to the tail examples can help your model learn behavior about the tail.

Downsampling

The tail of feature distributions will lose information in downsampling. However, since we typically downsample the majority class, this loss isn't usually a big problem.

Normalization

Normalization operates on individual examples, so it doesn't cause sampling bias.

You are working on a classification problem, and you randomly split the data into training, evaluation, and testing sets. Your classifier looks like it’s working perfectly! But in production, the classifier is a total failure. You later discover that the problem was caused by the random split. What kinds of data are susceptible to this problem?

Time series data

Random splitting divides each cluster across the test/train split, providing a “sneak preview” to the model that won’t be available in production.

Data that doesn't change much over time

If your data doesn't change very much over time, you'll have better chances with a random split. For example, you might want to identify the breed of dog in photos, or predict patients at risk for heart defect based on past data of biometrics. In both cases, the data generally doesn't change over time, so random splitting shouldn't cause a problem.

Groupings of data

The test set will always be too similar to the training set because clusters of similar data are in both sets. The model will appear to have better predictive power than it does.

Data with burstiness (data arriving in intermittent bursts as opposed to a continuous stream)

Clusters of similar data (the bursts) will show up in both training and testing. The model will make better predictions in testing than with new data.

Randomization

Introduction to Transforming Data