The iterative approach diagram (Figure 1) contained a green hand-wavy box entitled "Compute parameter updates." We'll now replace that algorithmic fairy dust with something more substantial.
Suppose we had the time and the computing resources to calculate the loss for all possible values of \(w_1\). For the kind of regression problems we've been examining, the resulting plot of loss vs. \(w_1\) will always be convex. In other words, the plot will always be bowl-shaped, kind of like this:
Figure 2. Regression problems yield convex loss vs weight plots.
Convex problems have only one minimum; that is, only one place where the slope is exactly 0. That minimum is where the loss function converges.
Calculating the loss function for every conceivable value of \(w_1\) over the entire data set would be an inefficient way of finding the convergence point. Let's examine a better mechanism—very popular in machine learning—called gradient descent.
The first stage in gradient descent is to pick a starting value (a starting point) for \(w_1\). The starting point doesn't matter much; therefore, many algorithms simply set \(w_1\) to 0 or pick a random value. The following figure shows that we've picked a starting point slightly greater than 0:
Figure 3. A starting point for gradient descent.
The gradient descent algorithm then calculates the gradient of the loss curve at the starting point. Here in Figure 3, the gradient of loss is equal to the derivative (slope) of the curve, and tells you which way is "warmer" or "colder." When there are multiple weights, the gradient is a vector of partial derivatives with respect to the weights.
Note that a gradient is a vector, so it has both of the following characteristics:
- a direction
- a magnitude
The gradient always points in the direction of steepest increase in the loss function. The gradient descent algorithm takes a step in the direction of the negative gradient in order to reduce loss as quickly as possible.
Figure 4. Gradient descent relies on negative gradients.
To determine the next point along the loss function curve, the gradient descent algorithm adds some fraction of the gradient's magnitude to the starting point as shown in the following figure:
Figure 5. A gradient step moves us to the next point on the loss curve.
The gradient descent then repeats this process, edging ever closer to the minimum.