Logistic Regression: Calculating a Probability

Many problems require a probability estimate as output. Logistic regression is an extremely efficient mechanism for calculating probabilities. Practically speaking, you can use the returned probability in either of the following two ways:

  • "As is"
  • Converted to a binary category.

Let's consider how we might use the probability "as is." Suppose we create a logistic regression model to predict the probability that a dog will bark during the middle of the night. We'll call that probability:

\[p(bark | night)\]

If the logistic regression model predicts \(p(bark | night) = 0.05\), then over a year, the dog's owners should be startled awake approximately 18 times:

\[\begin{align} startled &= p(bark | night) \cdot nights \\ &= 0.05 \cdot 365 \\ &~= 18 \end{align} \]

In many cases, you'll map the logistic regression output into the solution to a binary classification problem, in which the goal is to correctly predict one of two possible labels (e.g., "spam" or "not spam"). A later module focuses on that.

You might be wondering how a logistic regression model can ensure output that always falls between 0 and 1. As it happens, a sigmoid function, defined as follows, produces output having those same characteristics:

$$y = \frac{1}{1 + e^{-z}}$$

The sigmoid function yields the following plot:

Sigmoid function. The x axis is the raw inference value. The y axis extends from 0 to +1, exclusive.

Figure 1: Sigmoid function.

If \(z\) represents the output of the linear layer of a model trained with logistic regression, then \(sigmoid(z)\) will yield a value (a probability) between 0 and 1. In mathematical terms:

$$y' = \frac{1}{1 + e^{-z}}$$

where:

  • \(y'\) is the output of the logistic regression model for a particular example.
  • \(z = b + w_1x_1 + w_2x_2 + \ldots + w_Nx_N\)
    • The \(w\) values are the model's learned weights, and \(b\) is the bias.
    • The \(x\) values are the feature values for a particular example.

Note that \(z\) is also referred to as the log-odds because the inverse of the sigmoid states that \(z\) can be defined as the log of the probability of the \(1\) label (e.g., "dog barks") divided by the probability of the \(0\) label (e.g., "dog doesn't bark"):

$$ z = \log\left(\frac{y}{1-y}\right) $$

Here is the sigmoid function with ML labels:

The Sigmoid function with the x-axis labeled as the sum of all the weights and features (plus the bias); the y-axis is labeled Probability Output.

Figure 2: Logistic regression output.