To construct your dataset (and before doing data transformation), you should:
Collect the raw data.
Identify feature and label sources.
Select a sampling strategy.
Split the data.
These steps depend a lot on how you’ve framed your ML problem. Use the
self-check below to refresh your memory about problem framing and to check your
assumptions about data collection.
Self-check of Problem Framing and Data Collection
For the following questions,
click the desired arrow to check your answer:
You’re on a brand new machine learning project, about to select your
first features. How many features should you pick?
Pick 1-3 features that seem to have strong predictive power.
It’s best for your data collection pipeline to start with only one or
two features. This will help you confirm that the ML model works as intended.
Also, when you build a baseline from a couple of features,
you'll feel like you're making progress!
Pick 4-6 features that seem to have strong predictive power.
You might eventually use this many features, but it's still better to
start with fewer. Fewer features usually means fewer unnecessary
Pick as many features as you can, so you can start observing which
features have the strongest predictive power.
Start smaller. Every new feature adds a new dimension to your training
data set. When the dimensionality increases, the volume of the space
increases so fast that the available training data become sparse. The
sparser your data, the harder it is for a model to learn the relationship
between the features that actually matter and the label. This phenomenon
is called "the curse of dimensionality."
Your friend Sam is excited about the initial results of his statistical
analysis. He says that the data show a positive correlation between the
number of app downloads and the number of app review impressions. But
he's not sure whether they would have downloaded it anyway without seeing
the review. What response would be most helpful to Sam?
You could run an experiment to compare the behavior of users who
didn't see the review with similar users who did.
Correct! If Sam observes that users who saw the positive review were
more likely to download the app than those who didn't, then he has
reasonable evidence to suggest that the positive review is encouraging
people to get the app.
Trust the data. It's clear that that great review is the reason users
are downloading the app.
Incorrect. This response wouldn't lead Sam in the right direction.
You can't determine causation from only observational data. Sam is
seeing a correlation (that is, a statistical dependency between the
numbers) that may or may not indicate causation. Don't let your analyses
join the ranks of