Reducing Loss

To train a model, we need a good way to reduce the model’s loss. An iterative approach is one widely used method for reducing loss, and is as easy and efficient as walking down a hill.

Reducing Loss

How do we reduce loss?

  • Hyperparameters are the configuration settings used to tune how the model is trained.
  • Derivative of (y - y')2 with respect to the weights and biases tells us how loss changes for a given example
    • Simple to compute and convex
  • So we repeatedly take small steps in the direction that minimizes loss
    • We call these Gradient Steps (But they're really negative Gradient Steps)
    • This strategy is called Gradient Descent

Block Diagram of Gradient Descent

The cycle of moving from features and labels to models and predictions. Compute parameter updates ComputeLoss Model(PredictionFunction) Features Label Inference: Make Predictions

Weight Initialization

  • For convex problems, weights can start anywhere (say, all 0s)
    • Convex: think of a bowl shape
    • Just one minimum
Convex bowl shaped graph

Weight Initialization

  • For convex problems, weights can start anywhere (say, all 0s)
    • Convex: think of a bowl shape
    • Just one minimum
  • Foreshadowing: not true for neural nets
    • Non-convex: think of an egg crate
    • More than one minimum
    • Strong dependency on initial values
Convex bowl shaped graph and graph with multiple local minima

SGD & Mini-Batch Gradient Descent

  • Could compute gradient over entire data set on each step, but this turns out to be unnecessary
  • Computing gradient on small data samples works well
    • On every step, get a new random sample
  • Stochastic Gradient Descent: one example at a time
  • Mini-Batch Gradient Descent: batches of 10-1000
    • Loss & gradients are averaged over the batch

Send feedback about...

Machine Learning Crash Course