Introduction to Constructing Your Dataset

Steps to Constructing Your Dataset

To construct your dataset (and before doing data transformation), you should:

Collect the raw data.
Identify feature and label sources.
Select a sampling strategy.
Split the data.

These steps depend a lot on how you’ve framed your ML problem. Use the self-check below to refresh your memory about problem framing and to check your assumptions about data collection.

Self-check of Problem Framing and Data Collection Concepts

For the following questions, click the desired arrow to check your answer:

You’re on a brand new machine learning project, about to select your first features. How many features should you pick?

Pick 1-3 features that seem to have strong predictive power.

It’s best for your data collection pipeline to start with only one or two features. This will help you confirm that the ML model works as intended. Also, when you build a baseline from a couple of features, you'll feel like you're making progress!

Pick 4-6 features that seem to have strong predictive power.

You might eventually use this many features, but it's still better to start with fewer. Fewer features usually means fewer unnecessary complications.

Pick as many features as you can, so you can start observing which features have the strongest predictive power.

Start smaller. Every new feature adds a new dimension to your training data set. When the dimensionality increases, the volume of the space increases so fast that the available training data become sparse. The sparser your data, the harder it is for a model to learn the relationship between the features that actually matter and the label. This phenomenon is called "the curse of dimensionality."

Your friend Sam is excited about the initial results of his statistical analysis. He says that the data show a positive correlation between the number of app downloads and the number of app review impressions. But he's not sure whether they would have downloaded it anyway without seeing the review. What response would be most helpful to Sam?

You could run an experiment to compare the behavior of users who didn't see the review with similar users who did.

Correct! If Sam observes that users who saw the positive review were more likely to download the app than those who didn't, then he has reasonable evidence to suggest that the positive review is encouraging people to get the app.

Trust the data. It's clear that that great review is the reason users are downloading the app.

Incorrect. This response wouldn't lead Sam in the right direction. You can't determine causation from only observational data. Sam is seeing a correlation (that is, a statistical dependency between the numbers) that may or may not indicate causation. Don't let your analyses join the ranks of spurious correlations.

The Process

Size and Quality of a Dataset