Datasets, generalization, and overfitting

Learning objectives

Identify four different characteristics of data and datasets.
Identify at least four different causes of data unreliability.
Determine when to discard missing data and when to impute it.
Differentiate between direct and derived labels.
Identify two different ways to improve the quality of human-rated labels.
Explain why to subdivide a dataset into a training set, validation set, and test set; identify a potential problem in data splits.
Explain overfitting and identify three possible causes for it.
Explain the concept of regularization. In particular, explain the following:
- Bias versus variance (adaptation to outliers…)
- L₂ regularization, including Lambda (regularization rate)
- Early stopping
Interpret different kinds of loss curves; detect convergence and overfitting in loss curves.

Introduction

This module begins with a leading question. Choose one of the following answers:

If you had to prioritize improving one of the following areas in your machine learning project, which would have the most impact?

Improving the quality of your dataset

Data trumps all. The quality and size of the dataset matters much more than which shiny algorithm you use to build your model.

Applying a more clever loss function to training your model

True, a better loss function can help a model train faster, but it's still a distant second to another item in this list.

And here's an even more leading question:

Take a guess: In your machine learning project, how much time do you typically spend on data preparation and transformation?

More than half of the project time

Yes, ML practitioners spend the majority of their time constructing datasets and doing feature engineering.

Less than half of the project time

Plan for more! Typically, 80% of the time on a machine learning project is spent constructing datasets and transforming data.

In this module, you'll learn more about the characteristics of machine learning datasets, and how to prepare your data to ensure high-quality results when training and evaluating your model.

Help Center

Test your knowledge (10 min)

Data characteristics (10 min)

Datasets, generalization, and overfitting Stay organized with collections Save and categorize content based on your preferences.

Introduction

Datasets, generalization, and overfitting