Machine learning would be a breeze if all our loss curves
looked like this the first time we trained our model:

But in reality, loss curves can be quite challenging to interpret. Use your
understanding of loss curves to answer the following questions.

1. My Model Won't Train!

Your friend Mel and you continue working on a unicorn appearance predictor.
Here's your first loss curve.

Describe the problem and how Mel could fix it:

Click on the plus icon to expand the section
and reveal the answer.

Your model is not converging. Try these debugging steps:

Check if your features can predict the labels by following the
steps in
Model Debugging.

Check your data against a data schema to detect bad examples.

If training looks unstable, as in this plot, then reduce your
learning rate to prevent the model from bouncing around in parameter
space.

Simplify your dataset to 10 examples that you
know your model can predict on. Obtain a very low loss on
the reduced dataset. Then continue debugging your model
on the full dataset.

Simplify your model and ensure the model outperforms
your baseline. Then incrementally add complexity to the model.

2. My Loss Exploded!

Mel shows you another curve. What’s going wrong here and how can she fix it?
Write your answer below.

Click on the plus icon to expand the section and reveal
the answer.

A large increase in loss is typically caused by anomalous values in
input data. Possible causes are:

NaNs in input data.

Exploding gradient due to anomalous data.

Division by zero.

Logarithm of zero or negative numbers.

To fix an exploding loss, check for anomalous data in your batches,
and in your engineered data. If the anomaly appears problematic, then
investigate the cause. Otherwise, if the anomaly looks
like outlying data, then ensure the outliers are evenly
distributed between batches by shuffling your data.

3. My Metrics are Contradictory!

Mel wants your take on another curve. What’s going wrong and
how can she fix it? Write your answer below.

Describe the problem and how Mel could fix it:

Click on the plus icon to expand
the section and reveal the answer.

Recall is stuck at 0 because your examples' classification probability
is never higher than the threshold
for positive classification. This situation often occurs
in problems with a large
class imbalance. Remember that
ML libraries, such as TF Keras, typically use a default threshold of 0.5 to
calculate classification metrics.

Try these steps:

Lower your classification threshold.

Check threshold-invariant metrics, such as AUC.

4. Testing Loss is Too Damn High!

Mel shows you the loss curves for training and testing datasets and asks
"What's wrong?” Write your answer below.

Describe the problem and how Mel could fix it:

Click on the plus icon to expand the section and reveal
the answer.

Your model is overfitting to the training data. Try these steps:

Reduce model capacity.

Add regularization.

Check that the training and test splits are
statistically equivalent.

5. My Model Gets Stuck

You're patient when Mel returns a few days later with yet another curve. What's
going wrong here and how can Mel fix it?

Describe the problem and how Mel could fix it:

Click on the plus icon to expand the section and reveal
the answer.

Your loss is showing repetitive, step-like behavior. It's probable
that the input data seen by your model is itself exhibiting
repetitive behavior. Ensure that shuffling is removing repetitive
behavior from input data.

It's Working!

"It's working perfectly now!" Mel exclaims. She leans back into her chair
triumphantly and heaves a big sigh. The curve looks great and you beam with
accomplishment. Mel and you take a moment to discuss the following
additional checks for validating your model.