To train a model, we need a good way to reduce the model’s loss. An iterative approach is one widely used method for reducing loss, and is as easy and efficient as walking down a hill.
Reducing Loss
How do we reduce loss?
- Hyperparameters are the configuration settings used to tune how the model is trained.
- Derivative of (y - y')2 with respect to the weights and biases tells us how loss changes for a given example
- Simple to compute and convex
- So we repeatedly take small steps in the direction that minimizes loss
- We call these Gradient Steps (But they're really negative Gradient Steps)
- This strategy is called Gradient Descent
Block Diagram of Gradient Descent
- Try the Gradient Descent Exercise
- When you have finished the exercise, press play ▶ to continue
Weight Initialization
- For convex problems, weights can start anywhere (say, all 0s)
- Convex: think of a bowl shape
- Just one minimum

Weight Initialization
- For convex problems, weights can start anywhere (say, all 0s)
- Convex: think of a bowl shape
- Just one minimum
- Foreshadowing: not true for neural nets
- Non-convex: think of an egg crate
- More than one minimum
- Strong dependency on initial values

SGD & Mini-Batch Gradient Descent
- Could compute gradient over entire data set on each step, but this turns out to be unnecessary
- Computing gradient on small data samples works well
- On every step, get a new random sample
- Stochastic Gradient Descent: one example at a time
- Mini-Batch Gradient Descent: batches of 10-1000
- Loss & gradients are averaged over the batch