Understand the problem

To understand the problem, perform the following tasks:

  • State the goal for the product you are developing or refactoring.
  • Determine whether the goal is best solved using, predictive ML, generative AI, or a non-ML solution.
  • Verify you have the data required to train a model if you're using a predictive ML approach.

State the goal

Begin by stating your goal in non-ML terms. The goal is the answer to the question, "What am I trying to accomplish?"

The following table clearly states goals for hypothetical apps:

Application Goal
Weather app Calculate precipitation in six-hour increments for a geographic region.
Fashion app Generate a variety of shirt designs.
Video app Recommend useful videos.
Mail app Detect spam.
Financial app Summarize financial information from multiple news sources.
Map app Calculate travel time.
Banking app Identify fraudulent transactions.
Dining app Identify cuisine by a restaurant's menu.
Ecommerce app Reply to reviews with helpful answers.

Clear use case for ML

Some view ML as a universal tool that can be applied to all problems. In reality, ML is a specialized tool suitable only for particular problems. You don't want to implement a complex ML solution when a simpler non-ML solution will work.

ML systems can be divided into two broad categories: predictive ML and generative AI. The following table lists their defining characteristics:

Input Output Training technique
Predictive ML Text
Image
Audio
Video
Numerical
Makes a prediction, for example, classifying an email as spam or not spam, guessing tomorrow's rainfall, or predicting the price of a stock. The output can typically be verified against reality. Typically uses lots of data to train a supervised, unsupervised, or reinforcement learning model to perform a specific task.
Generative AI Text
Image
Audio
Video
Numerical
Generates output based on the user's intent, for example, summarizing an article, or producing an audio clip or short video. Typically uses lots of unlabeled data to train a large language model or image generator to fill in missing data. The model can then be used for tasks that can be framed as fill-in-the-blank tasks, or it can be fine-tuned by training it on labeled data for some specific task, like classification.

To confirm that ML is the right approach, first verify that your current non-ML solution is optimized. If you don't have a non-ML solution implemented, try solving the problem manually using a heuristic.

The non-ML solution is the benchmark you'll use to determine whether ML is a good use case for your problem. Consider the following questions when comparing a non-ML approach to an ML one:

  • Quality. How much better do you think an ML solution can be? If you think an ML solution might be only a small improvement, that might indicate the current solution is the best one.

  • Cost and maintenance. How expensive is the ML solution in both the short- and long-term? In some cases, it costs significantly more in terms of compute resources and time to implement ML. Consider the following questions:

    • Can the ML solution justify the increase in cost? Note that small improvements in large systems can easily justify the cost and maintenance of implementing an ML solution.
    • How much maintenance will the solution require? In many cases, ML implementations need dedicated long-term maintenance.
    • Does your product have the resources to support training or hiring people with ML expertise?

Check Your Understanding

Why is it important to have a non-ML solution or heuristic in place before analyzing an ML solution?
A non-ML solution is the benchmark to measure an ML solution against.
Non-ML solutions help you determine how much an ML solution will cost.

Predictive ML and data

Data is the driving force of predictive ML. To make good predictions, you need data that contains features with predictive power. Your data should have the following characteristics:

  • Abundant. The more relevant and useful examples in your dataset, the better your model will be.

  • Consistent and reliable. Having data that's consistently and reliably collected will produce a better model. For example, an ML-based weather model will benefit from data gathered over many years from the same reliable instruments.

  • Trusted. Understand where your data will come from. Will the data be from trusted sources you control, like logs from your product, or will it be from sources you don't have much insight into, like the output from another ML system?

  • Available. Make sure all inputs are available at prediction time in the correct format. If it will be difficult to obtain certain feature values at prediction time, omit those features from your datasets.

  • Correct. In large datasets, it's inevitable that some labels will have incorrect values, but if more than a small percentage of labels are incorrect, the model will produce poor predictions.

  • Representative. The datasets should be as representative of the real world as possible. In other words, the datasets should accurately reflect the events, user behaviors, and/or the phenomena of the real world being modeled. Training on unrepresentative datasets can cause poor performance when the model is asked to make real-world predictions.

If you can't get the data you need in the required format, your model will make poor predictions.

Predictive power

For a model to make good predictions, the features in your dataset should have predictive power. The more correlated a feature is with a label, the more likely it is to predict it.

Some features will have more predictive power than others. For example, in a weather dataset, features such as cloud_coverage, temperature, and dew_point would be better predictors of rain than moon_phase or day_of_week. For the video app example, you could hypothesize that features such as video_description, length and views might be good predictors for which videos a user would want to watch.

Be aware that a feature's predictive power can change because the context or domain changes. For example, in the video app, a feature like upload_date might—in general—be weakly correlated with the label. However, in the sub-domain of gaming videos, upload_date might be strongly correlated with the label.

Determining which features have predictive power can be a time consuming process. You can manually explore a feature's predictive power by removing and adding it while training a model. You can automate finding a feature's predictive power by using algorithms such as Pearson correlation, Adjusted mutual information (AMI), and Shapley value, which provide a numerical assessment for analyzing the predictive power of a feature.

Check Your Understanding

When analyzing your datasets, what are three key attributes you should look for?
Representative of the real world.
Contains correct values.
Features have predictive power for the label.
Small enough to load onto a local machine.
Gathered from a variety of unpredictable sources.

For more guidance on analyzing and preparing your datasets, see Data Preparation and Feature Engineering for Machine Learning.

Predictions vs. actions

There's no value in predicting something if you can't turn the prediction into an action that helps users. That is, your product should take action from the model's output.

For example, a model that predicts whether a user will find a video useful should feed into an app that recommends useful videos. A model that predicts whether it will rain should feed into a weather app.

Check Your Understanding

Based on the following scenario, determine if using ML is the best approach to the problem.

An engineering team at a large organization is responsible for managing incoming phone calls.

The goal: To inform callers how long they'll wait on hold given the current call volume.

They don't have any solution in place, but they think a heuristic would be to divide the current number of customers on hold by the number of employees answering phones, and then multiply by 10 minutes. However, they know that some customers have their issues resolved in two minutes, while others can take up to 45 minutes or longer.

Their heuristic probably won't get them a precise enough number. They can create a dataset with the following columns: number_of_callcenter_phones, user_issue, time_to_resolve, call_time, time_on_hold.

Use ML. The engineering team has a clearly defined goal. Their heuristic won't be good enough for their use case. The dataset appears to have predictive features for the label, time_on_hold.
Don't use ML. Although they have a clearly defined goal, they should implement and optimize a non-ML solution first. Also, their dataset doesn't appear to contain enough features with predictive power.