From a line to a probability: logistic regression

Linear regression gave us a way to predict a number. But step back and notice how many real questions are not numbers at all, they are yes-or-no. Will this customer cancel? Is this email spam? Does this transaction look like fraud? These are classification problems, the other half of supervised learning, and they need a different kind of answer: not a quantity, but a probability and a decision.

The natural first instinct is to reuse the line. Code the answer as 1 for yes and 0 for no, fit a straight line, and read off the prediction. It is a reasonable thought, and it fails in an instructive way. Understanding why it fails is the fastest route to understanding logistic regression, which is the fix.

Why a straight line fails for yes/no

Suppose you label spam as 1 and not-spam as 0, and you fit a line to predict that label from some feature. Two problems show up immediately.

First, a line does not stay in bounds. It keeps rising forever to the right and falling forever to the left, so it will happily predict 1.4 or negative 0.3. As a probability, those are nonsense; a probability has to live between 0 and 1.

Second, the line is the wrong shape. Real yes/no data tends to be mostly 0 for a while, then transition, then mostly 1. A straight line cannot capture that flat-rise-flat shape, and a few extreme points drag the whole line around. We need an output that is bounded between 0 and 1 and that bends smoothly from one to the other.

The squashing function

Logistic regression keeps the useful part of the line and fixes the output. It still computes the same kind of weighted sum of the inputs that linear regression did:

z = intercept + (coefficient * feature)     (the linear part, same as before)

Then, instead of using the linear score z directly, it passes z through an S-shaped curve called the sigmoid (or logistic function) that squashes any number into the range 0 to 1:

probability = sigmoid(z)

  z very negative  ->  near 0
  z = 0            ->  exactly 0.5
  z very positive  ->  near 1

That is the whole trick. The linear part decides how strongly the inputs point toward “yes.” The sigmoid turns that strength into a proper probability. A big positive z becomes a confident “almost certainly yes” near 1, a big negative z becomes “almost certainly no” near 0, and z equal to zero is the model on the fence at 0.5.

From a probability to a decision: the boundary

The model outputs a probability, but eventually you have to commit to a yes or a no. You do that with a threshold, usually 0.5: if the predicted probability is at least 0.5, predict yes, otherwise predict no.

Here is the key geometric fact. The probability equals exactly 0.5 at the point where z equals zero, that is, where the linear part (intercept plus coefficient times feature) equals zero. That set of points is the decision boundary: the line (or, with more features, the flat surface) that separates the region the model calls “yes” from the region it calls “no.” On one side z is positive and the probability is above 0.5; on the other side it is below. So underneath the curved probabilities, logistic regression is still drawing a straight boundary between the classes.

Worked example: studying for an exam

Take one feature, hours studied, predicting whether a student passes. Suppose the fitted model is:

z = -4 + (1 * hours)
probability of passing = sigmoid(z)

Run a few students through it:

hours = 2  ->  z = -4 + 2 = -2   ->  sigmoid(-2) ~ 0.12   ->  predict FAIL
hours = 4  ->  z = -4 + 4 =  0   ->  sigmoid(0)  = 0.50   ->  exactly on the fence
hours = 6  ->  z = -4 + 6 =  2   ->  sigmoid(2)  ~ 0.88   ->  predict PASS

Two hours of study gives only a 12 percent chance of passing; six hours gives 88 percent. And the decision boundary sits at exactly 4 hours, the point where z equals zero and the probability is 0.50. Below 4 hours the model predicts fail, above it predicts pass. The S-curve gives you a smooth, bounded probability at every point, and the threshold turns it into a clean decision.

How it is fit, and reading the coefficients

There is no tidy least-squares formula for the best logistic regression, the way there was for a straight line. So this is the first place we cash in the previous lesson: logistic regression is fit by gradient descent, searching for the coefficients that minimize a loss built for probabilities (cross-entropy loss, equivalently maximum-likelihood estimation; it rewards confident-correct predictions and punishes confident-wrong ones especially hard). The procedure is exactly the downhill walk from lesson 3, applied to a different loss.

The coefficients read much like linear regression’s, with one twist. A positive coefficient means that as the feature increases, z increases, which pushes the probability of “yes” up. A negative coefficient pushes it down. The size still measures strength. (Strictly, a coefficient changes the log-odds rather than the probability directly, which is why the effect on the probability is largest in the middle of the S-curve and smaller out at the flat ends, but the direction reads the same: positive pushes toward yes.)

More than two classes

Everything so far decides between two outcomes, yes or no. Plenty of problems have more: which of five products, which of ten handwritten digits. Logistic regression extends in two standard ways. The simpler is one-vs-rest: train one yes/no logistic model per class (“is it a 3, or not?”, “is it a 7, or not?”), then pick the class whose model is most confident. The direct generalization is the softmax, which produces a probability for every class at once, with all of them summing to 1. The two-class sigmoid you just met is simply the special case of softmax with two outcomes. This matters beyond classical models: the softmax is exactly what sits at the output of most neural-network classifiers, turning a row of raw scores into a clean probability per class.

Why this matters when you use AI

Logistic regression is the simplest classifier, and it is hiding inside much bigger systems. The final layer of many neural-network classifiers does essentially this: it computes scores and squashes them into probabilities (with the sigmoid, or its multi-class cousin the softmax). When a model tells you it is “85 percent confident,” that number was almost certainly produced by this kind of squashing. And the threshold idea matters in practice: 0.5 is a default, not a law. For a cancer screen you might predict “positive” at a probability of 0.2, accepting more false alarms to avoid missing a real case. Moving that threshold is exactly the trade-off the evaluation phase will make precise.

Common pitfalls

Thinking “regression” means it predicts a number. The name is historical. Logistic regression is a classifier; its output is a probability of a class.
Treating the probability as gospel. A logistic regression’s 0.7 is a model estimate, not a calibrated truth. It can be confidently wrong.
Assuming 0.5 is always the right threshold. When errors have unequal costs or the classes are imbalanced, the right cutoff is rarely 0.5.
Forgetting the boundary is straight. Logistic regression separates classes with a straight line or flat surface. If the true boundary curves, it needs help (engineered features) or a different model.

What you should remember

Logistic regression is a line plus a squash: compute the same weighted sum as linear regression, then pass it through the sigmoid to get a probability between 0 and 1.
The decision boundary is where the probability is 0.5, which is where the linear part equals zero. Underneath, it is still a straight boundary.
It is fit by gradient descent, not a closed-form formula, minimizing a loss suited to probabilities.
A positive coefficient pushes the probability of “yes” up; the threshold (default 0.5) turns the probability into a decision and can be moved when costs are unequal.

Logistic regression draws a single straight boundary through the data. That is elegant and often enough, but plenty of problems cannot be split by one straight line. The next lesson takes a completely different approach to classification: instead of one boundary, it asks a sequence of simple yes/no questions, carving the space into boxes. That is the decision tree.