Reducing Loss: Gradient Descent

The iterative approach diagram (Figure 1) contained a green hand-wavy box entitled "Compute parameter updates." We'll now replace that algorithmic fairy dust with something more substantial.

Suppose we had the time and the computing resources to calculate the loss for all possible values of \(w_1\). For the kind of regression problems we've been examining, the resulting plot of loss vs. \(w_1\) will always be convex. In other words, the plot will always be bowl-shaped, kind of like this:

A plot of a U-shaped curve, with the vertical axis labeled as 'loss' and the horizontal axis labeled as value of weight w i.

Figure 2. Regression problems yield convex loss vs. weight plots.

 

Convex problems have only one minimum; that is, only one place where the slope is exactly 0. That minimum is where the loss function converges.

Calculating the loss function for every conceivable value of \(w_1\) over the entire data set would be an inefficient way of finding the convergence point. Let's examine a better mechanism—very popular in machine learning—called gradient descent.

The first stage in gradient descent is to pick a starting value (a starting point) for \(w_1\). The starting point doesn't matter much; therefore, many algorithms simply set \(w_1\) to 0 or pick a random value. The following figure shows that we've picked a starting point slightly greater than 0:

A plot of a U-shaped curve. A point about halfway up the left side of the curve is labeled 'Starting Point'.

Figure 3. A starting point for gradient descent.

The gradient descent algorithm then calculates the gradient of the loss curve at the starting point. Here in Figure 3, the gradient of the loss is equal to the derivative (slope) of the curve, and tells you which way is "warmer" or "colder." When there are multiple weights, the gradient is a vector of partial derivatives with respect to the weights.

Note that a gradient is a vector, so it has both of the following characteristics:

  • a direction
  • a magnitude

The gradient always points in the direction of steepest increase in the loss function. The gradient descent algorithm takes a step in the direction of the negative gradient in order to reduce loss as quickly as possible.

A plot of a U-shaped curve. A point on the left side of the curve is labeled 'Starting Point'. An arrow labeled 'negative gradient' points from this point to the right.

Figure 4. Gradient descent relies on negative gradients.

To determine the next point along the loss function curve, the gradient descent algorithm adds some fraction of the gradient's magnitude to the starting point as shown in the following figure:

A plot of a U-shaped curve. A point on the left side of the curve is labeled 'Starting Point'. An arrow labeled 'negative gradient' points from this point to the right. Another arrow points from the tip of the first arrow down to a second point on the curve. The second point is labeled 'next point'.

Figure 5. A gradient step moves us to the next point on the loss curve.

The gradient descent then repeats this process, edging ever closer to the minimum.