Model Debugging

The first step in debugging your model is Data Debugging. After debugging your data, follow these steps to continue debugging your model, detailed in the following sections:

  1. Check that the data can predict the labels.
  2. Establish a baseline.
  3. Write and run tests.
  4. Adjust your hyperparameter values.

Check that the Model can Predict Labels

Before debugging your model, try to determine whether your features encode predictive signals. You can find linear correlations between individual features and labels by using correlation matrices. For an example of using correlation matrices, see this Colab.

However, correlation matrices will not detect nonlinear correlations between features and labels. Instead, choose 10 examples from your dataset that your model can easily learn from. Alternatively, use synthetic data that is easily learnable. For instance, a classifier can easily learn linearly-separable examples while a regressor can easily learn labels that correlate highly with a feature cross. Then, ensure your model can achieve very small loss on these 10 easily-learnable examples.

Using a few examples that are easily learnable simplifies debugging by reducing the opportunities for bugs. You can further simplify your model by switching to the simpler gradient descent algorithm instead of a more advanced optimization algorithm.

Establish a Baseline

Comparing your model against a baseline is a quick test of the model's quality. When developing a new model, define a baseline by using a simple heuristic to predict the label. If your trained model performs worse than its baseline, you need to improve your model.

Examples of baselines are:

  • Using a linear model trained solely on the most predictive feature.
  • In classification, always predicting the most common label.
  • In regression, always predicting the mean value.

Once you validate a version of your model in production, you can use that model version as a baseline for newer model versions. Therefore, you can have multiple baselines of different complexities. Testing against baselines helps justify adding complexity to your model. A more complex model should always perform better than a less complex model or baseline.

Implement Tests for ML Code

The testing process to catch bugs in ML code is similar to the testing process in traditional debugging. You'll write unit tests to detect bugs. Examples of code bugs in ML are:

  • Hidden layers that are configured incorrectly.
  • Data normalization code that returns NaNs.

A sanity check for the presence of code bugs is to include your label in your features and train your model. If your model does not work, then it definitely has a bug.

Adjust Hyperparameter Values

The table below explains how to adjust values for your hyperparameters.

Hyperparameter Description
Learning Rate Typically, ML libraries will automatically set the learning rate. For example, in TensorFlow, most TF Estimators use the AdagradOptimizer, which sets the learning rate at 0.05 and then adaptively modifies the learning rate during training. The other popular optimizer, AdamOptimizer, uses an initial learning rate of 0.001. However, if your model does not converge with the default values, then manually choose a value between 0.0001 and 1.0, and increase or decrease the value on a logarithmic scale until your model converges. Remember that the more difficult your problem, the more epochs your model must train for before loss starts to decrease.
Regularization

First, ensure your model can predict without regularization on the training data. Then add regularization only if your model is overfitting on training data. Regularization methods differ for linear and nonlinear models.

For linear models, choose L1 regularization if you need to reduce your model's size. Choose L2 regularization if you prefer increased model stability. Increasing your model's stability makes your model training more reproducible. Find the correct value of the regularization rate, \(\lambda\), by starting at 1e-5 and tuning that value through trial and error.

To regularize a deep neural network model, use Dropout regularization. Dropout removes a random selection of a fixed percentage of the neurons in a network layer for a single gradient step. Typically, dropout will improve generalization at a dropout rate of between 10% and 50% of neurons.

Training epochs You should train for at least one epoch, and continue to train so long as you are not overfitting.
Batch size Typically, the batch size of a mini-batch is between 10 and 1000. For SGD, the batch size is 1. The upper bound on your batch size is limited by the amount of data that can fit in your machine's memory. The lower bound on batch size depends on your data and algorithm. However, using a smaller batch size lets your gradient update more often per epoch, which can result in a larger decrease in loss per epoch. Furthermore, models trained using smaller batches generalize better. For details, see On large-batch training for deep learning: Generalization gap and sharp minima N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang. ICLR, 2017. Prefer using the smallest batch sizes that result in stable training.
Depth and width of layers In a neural network, depth refers to the number of layers, and width refers to the number of neurons per layer. Increase depth and width as the complexity of the corresponding problem increases. Adjust your depth and width by following these steps:
  1. Start with 1 fully-connected hidden layer with the same width as your input layer.
  2. For regression, set the output layer's width to 1. For classification, set the output layer's width to the number of classes.
  3. If your model does not work, and you think your model needs to be deeper to learn your problem, then increase depth linearly by adding a fully-connected hidden layer at a time. The hidden layer's width depends on your problem. A commonly-used approach is to use the same width as the previous hidden layer, and then discover the appropriate width through trial-and-error.
The change in width of successive layers also depends on your problem. A practice drawn from common observation is to set a layer's width equal to or less than the width of the previous layer. Remember, the depth and width don't have to be exactly right. You'll tune their values later when you optimize your model.