Validation Set: Another Partition

The previous module introduced partitioning a data set into a training set and a test set. This partitioning enabled you to train on one set of examples and then to test the model against a different set of examples. With two partitions, the workflow could look as follows:

A workflow diagram consisting of three stages. 1. Train model on training set. 2. Evaluate model on test set. 3. Tweak model according to results on test set. Iterate on 1, 2, and 3, ultimately picking the model that does best on the test set.

Figure 1. A possible workflow?

In the figure, "Tweak model" means adjusting anything about the model you can dream up—from changing the learning rate, to adding or removing features, to designing a completely new model from scratch. At the end of this workflow, you pick the model that does best on the test set.

Dividing the data set into two sets is a good idea, but not a panacea. You can greatly reduce your chances of overfitting by partitioning the data set into the three subsets shown in the following figure:

A horizontal bar divided into three pieces: 70% of which is the training set, 15% the validation set, and 15% the test set

Figure 2. Slicing a single data set into three subsets.

Use the validation set to evaluate results from the training set. Then, use the test set to double-check your evaluation after the model has "passed" the validation set. The following figure shows this new workflow:

Similar workflow to Figure 1, except that instead of evaluating the model against the test set, the workflow evaluates the model against the validation set. Then, once the training set and validation set more-or-less agree, confirm the model against the test set.

Figure 3. A better workflow.

In this improved workflow:

  1. Pick the model that does best on the validation set.
  2. Double-check that model against the test set.

This is a better workflow because it creates fewer exposures to the test set.