Logistic Regression: Loss and Regularization

Loss function for Logistic Regression

The loss function for linear regression is squared loss. The loss function for logistic regression is Log Loss, which is defined as follows:

$$\text{Log Loss} = \sum_{(x,y)\in D} -y\log(y') - (1 - y)\log(1 - y')$$


  • \((x,y)\in D\) is the data set containing many labeled examples, which are \((x,y)\) pairs.
  • \(y\) is the label in a labeled example. Since this is logistic regression, every value of \(y\) must either be 0 or 1.
  • \(y'\) is the predicted value (somewhere between 0 and 1), given the set of features in \(x\).

The equation for Log Loss is closely related to Shannon's Entropy measure from Information Theory. It is also the negative logarithm of the likelihood function, assuming a Bernoulli distribution of \(y\). Indeed, minimizing the loss function yields a maximum likelihood estimate.

Regularization in Logistic Regression

Regularization is extremely important in logistic regression modeling. Without regularization, the asymptotic nature of logistic regression would keep driving loss towards 0 in high dimensions. Consequently, most logistic regression models use one of the following two strategies to dampen model complexity:

  • L2 regularization.
  • Early stopping, that is, limiting the number of training steps or the learning rate.

(We'll discuss a third strategy—L1 regularization—in a later module.)

Imagine that you assign a unique id to each example, and map each id to its own feature. If you don't specify a regularization function, the model will become completely overfit. That's because the model would try to drive loss to zero on all examples and never get there, driving the weights for each indicator feature to +infinity or -infinity. This can happen in high dimensional data with feature crosses, when there’s a huge mass of rare crosses that happen only on one example each.

Fortunately, using L2 or early stopping will prevent this problem.


Send feedback about...

Machine Learning Crash Course