Reducing Loss

To train a model, we need a good way to reduce the model’s loss. An iterative approach is one widely used method for reducing loss, and is as easy and efficient as walking down a hill.

Reducing Loss

  • Hyperparameters are the configuration settings used to tune how the model is trained.
  • Derivative of (y - y')2 with respect to the weights and biases tells us how loss changes for a given example
    • Simple to compute and convex
  • So we repeatedly take small steps in the direction that minimizes loss
    • We call these Gradient Steps (But they're really negative Gradient Steps)
    • This strategy is called Gradient Descent
The cycle of moving from features and labels to models and predictions.
  • For convex problems, weights can start anywhere (say, all 0s)
    • Convex: think of a bowl shape
    • Just one minimum
Convex bowl shaped graph
  • For convex problems, weights can start anywhere (say, all 0s)
    • Convex: think of a bowl shape
    • Just one minimum
  • Foreshadowing: not true for neural nets
    • Non-convex: think of an egg crate
    • More than one minimum
    • Strong dependency on initial values
Convex bowl shaped graph and graph with multiple local minima
  • Could compute gradient over entire data set on each step, but this turns out to be unnecessary
  • Computing gradient on small data samples works well
    • On every step, get a new random sample
  • Stochastic Gradient Descent: one example at a time
  • Mini-Batch Gradient Descent: batches of 10-1000
    • Loss & gradients are averaged over the batch