To understand the problem, perform the following tasks:
- State the goal for the product you are developing or refactoring.
- Determine whether the goal is best solved using ML.
- Verify you have the data required to train a model.
State the goal
Begin by stating your goal in non-ML terms. The goal is the answer to the question, "What am I trying to accomplish?"
The following table clearly states goals for hypothetical apps:
Application | Goal |
---|---|
Weather app | Calculate precipitation in six-hour increments for a geographic region. |
Video app | Recommend useful videos. |
Mail app | Detect spam. |
Map app | Calculate travel time. |
Banking app | Identify fraudulent transactions. |
Dining app | Identify cuisine by a restaurant's menu. |
Clear use case for ML
Some view ML as a universal tool that can be applied to all problems. In reality, ML is a specialized tool suitable only for particular problems. You don't want to implement a complex ML solution when a simpler non-ML solution will work.
To confirm that ML is the right approach, first verify that your current non-ML solution is optimized. If you don't have a non-ML solution implemented, try solving the problem manually using a heuristic.
The non-ML solution is the benchmark you'll use to determine whether ML is a good use case for your problem. Consider the following questions when comparing a non-ML approach to an ML one:
Quality. How much better do you think an ML solution can be? If you think an ML solution might be only a small improvement, that might indicate the current solution is the best one.
Cost and maintenance. How expensive is the ML solution in both the short- and long-term? In some cases, it costs significantly more in terms of compute resources and time to implement ML. Consider the following questions:
- Can the ML solution justify the increase in cost? Note that small improvements in large systems can easily justify the cost and maintenance of implementing an ML solution.
- How much maintenance will the solution require? In many cases, ML implementations need dedicated long-term maintenance.
- Does your product have the resources to support training or hiring people with ML expertise?
Check Your Understanding
Data
Data is the driving force of ML. To make good predictions, you need data that contains features with predictive power. Your data should have the following characteristics:
Abundant. The more relevant and useful examples in your dataset, the better your model will be.
Consistent and reliable. Having data that's consistently and reliably collected will produce a better model. For example, an ML-based weather model will benefit from data gathered over many years from the same reliable instruments.
Trusted. Understand where your data will come from. Will the data be from trusted sources you control, like logs from your product, or will it be from sources you don't have much insight into, like the output from another ML system?
Available. Make sure all inputs are available at prediction time in the correct format. If it will be difficult to obtain certain feature values at prediction time, omit those features from your datasets.
Correct. In large datasets, it's inevitable that some labels will have incorrect values, but if more than a small percentage of labels are incorrect, the model will produce poor predictions.
Representative. The datasets should be as representative of the real world as possible. In other words, the datasets should accurately reflect the events, user behaviors, and/or the phenomena of the real world being modeled. Training on unrepresentative datasets can cause poor performance when the model is asked to make real-world predictions.
If you can't get the data you need in the required format, your model will make poor predictions.
Predictive power
For a model to make good predictions, the features in your dataset should have predictive power. The more correlated a feature is with a label, the more likely it is to predict it.
Some features will have more predictive power than others. For example, in a
weather dataset, features such as cloud_coverage
, temperature
, and
dew_point
would be better predictors of rain than moon_phase
or
day_of_week
. For the video app example, you could hypothesize that features
such as video_description
, length
and views
might be good predictors for
which videos a user would want to watch.
Be aware that a feature's predictive power can change because the context or
domain changes. For example, in the video app, a feature like upload_date
might—in general—be weakly correlated with the label. However, in
the sub-domain of gaming videos, upload_date
might be strongly correlated with
the label.
Determining which features have predictive power can be a time consuming process. You can manually explore a feature's predictive power by removing and adding it while training a model. You can automate finding a feature's predictive power by using algorithms such as Pearson correlation, Adjusted mutual information (AMI), and Shapley value, which provide a numerical assessment for analyzing the predictive power of a feature.
Check Your Understanding
For more guidance on analyzing and preparing your datasets, see Data Preparation and Feature Engineering for Machine Learning.
Predictions vs. actions
There's no value in predicting something if you can't turn the prediction into an action that helps users. That is, your product should take action from the model's output.
For example, a model that predicts whether a user will find a video useful should feed into an app that recommends useful videos. A model that predicts whether it will rain should feed into a weather app.
Check Your Understanding
Based on the following scenario, determine if using ML is the best approach to the problem.
An engineering team at a large organization is responsible for managing incoming phone calls.
The goal: To inform callers how long they'll wait on hold given the current call volume.
They don't have any solution in place, but they think a heuristic would be to divide the number of employees answering phones by the current number of customers on hold, and then multiply by 10 minutes. However, they know that some customers have their issues resolved in two minutes, while others can take up to 45 minutes or longer.
Their heuristic probably won't get them a precise enough number. They
can create a dataset with the following columns:
number_of_callcenter_phones
, user_issue
,
time_to_resolve
, call_time
,
time_on_hold
.
time_on_hold
.