Machine Learning | Google for Developers

Training Sets and Test Sets

We return to Playground to experiment with training sets and test sets.

Click the plus icon for a reminder of what the orange and blue dots mean.

In the visualization:

Each blue dot signifies one example of one class of data (for example, spam).
Each orange dot signifies one example of another class of data (for example, not spam).
The background color represents the model's prediction of where examples of that color should be found. A blue background around a blue dot means that the model is correctly predicting that example. Conversely, an orange background around a blue dot means that the model is making an incorrect prediction for that example.

This exercise provides both a test set and a training set, both drawn from the same data set. By default, the visualization shows only the training set. If you'd like to also see the test set, click the Show test data checkbox just below the visualization. In the visualization, note the following distinction:

The training examples have a white outline.
The test examples have a black outline.

Task 1: Run Playground with the given settings by doing the following:

Click the Run/Pause button:
Watch the Test loss and Training loss values change.
When the Test loss and Training loss values stop changing or only change once in a while, press the Run/Pause button again to pause Playground.

Note the delta between the Test loss and Training loss. We'll try to reduce this delta in the following tasks.

Task 2: Do the following:

Press the Reset button.
Modify the Learning rate.
Press the Run/Pause button:
Let Playground run for at least 150 epochs.

Is the delta between Test loss and Training loss lower or higher with this new Learning rate? What happens if you modify both Learning rate and batch size?

Optional Task 3: A slider labeled Training data percentage lets you control the proportion of training data to test data. For example, when set to 90%, then 90% of the data is used for the training set and the remaining 10% is used for the test set.

Do the following:

Reduce the "Training data percentage" from 50% to 10%.
Experiment with Learning rate and Batch size, taking notes on your findings.

Does altering the training data percentage change the optimal learning settings that you discovered in Task 2? If so, why?

Click the plus icon for the answer to Task 1.

With learning rate set to 3 (the initial setting), Test loss is significantly higher than Training loss.

Click the plus icon for the answer to Task 2.

By reducing learning rate (for example, to 0.001), Test loss drops to a value much closer to Training loss. In most runs, increasing Batch size does not influence Training loss or Test loss significantly. However, in a small percentage of runs, increasing Batch size to 20 or greater causes Test loss to drop slightly below Training loss.

Playground's data sets are randomly generated. Consequently, our answers may not always agree exactly with yours.

Click the plus icon for the answer to Task 3.

Reducing the Training data percentage from 50% to 10% dramatically lowers the number of data points in the training set. With so little data, high batch size and high learning rate cause the training model to jump around chaotically (jumping repeatedly over the minimum point).