Skip to content

Lesson: Fitting a line: linear regression

You have a scatter of points: house sizes on one axis, prices on the other. The dots trend upward, bigger houses cost more, and you want to capture that trend with a single straight line so you can predict the price of a house you have not seen. Here is the problem: there are infinitely many lines you could draw through that cloud. Which one is right?

Linear regression answers that question with a precise definition of “best,” and once you have the definition, the rest is mechanics. This is the simplest supervised algorithm there is, and it is worth knowing deeply, because the idea at its core (a model is a set of numbers chosen to fit the data as closely as possible) is the same idea that scales all the way up to a neural network with billions of weights.

A straight line is described by two numbers:

prediction = intercept + slope * feature
y = b + m * x

The slope m is how much the prediction changes for each one-unit step in the input. The intercept b is the prediction when the input is zero. Feed in a house size, the line gives back a predicted price. Change the two numbers and you get a different line, a different prediction machine. Linear regression is the procedure that picks the two numbers that fit your data best.

In machine learning vocabulary, those two numbers are the model’s parameters (sometimes called weights). “Training” a linear regression model means finding the parameter values that fit the data. That framing matters: when you hear that a model “has learned,” this is the simplest possible version of what that means, two numbers tuned to the data.

To pick the best line we need to measure how badly any given line fits, then choose the line with the smallest badness. Linear regression measures badness like this:

  1. For each data point, look at the residual: the vertical distance between the actual point and the line’s prediction for it. That is the error on that one point.
  2. Square each residual. Squaring does two jobs: it makes every error positive (so errors above and below the line cannot cancel out), and it punishes big misses far more than small ones (an error of 2 contributes 4, an error of 4 contributes 16).
  3. Add up all the squared residuals. The total is the sum of squared residuals (SSR). It is one number that says how badly this particular line fits the whole dataset.

The best-fit line is the one line, out of all possible lines, that makes the sum of squared residuals as small as it can be. That is the entire definition. Because we are minimizing squared residuals, the method is also called least squares.

Take three data points and two candidate lines, and let the SSR decide between them.

Data points: (1, 2) (2, 4) (3, 5)
Candidate A: y = 1.5x + 0.5
x=1 -> predicts 2.0 actual 2 residual 0.0 squared 0.00
x=2 -> predicts 3.5 actual 4 residual 0.5 squared 0.25
x=3 -> predicts 5.0 actual 5 residual 0.0 squared 0.00
Sum of squared residuals (A) = 0.25
Candidate B: y = 2x
x=1 -> predicts 2 actual 2 residual 0 squared 0
x=2 -> predicts 4 actual 4 residual 0 squared 0
x=3 -> predicts 6 actual 5 residual -1 squared 1
Sum of squared residuals (B) = 1.00

Candidate A has an SSR of 0.25; candidate B has an SSR of 1.00. By the least-squares definition, A is the better fit, because its squared errors add up to less. That is the whole comparison: two lines, two SSR numbers, lower wins.

Note what we did not do. We did not prove A is the single best line of all. The true best-fit line is whichever one drives the SSR as low as it will go, and finding it (rather than just comparing two guesses) is the subject of the next lesson. For now the point is the criterion: best means smallest sum of squared residuals.

Once you have the best-fit line, its two numbers are not just machinery, they are the answer, and they are readable in plain language. Suppose a fitted line for predicting monthly spending from income comes out as:

spending = 200 + 0.30 * income
  • The slope, 0.30, says: for every extra dollar of income, predicted spending rises by 30 cents. The slope is the relationship, its direction (positive here) and its strength per unit.
  • The intercept, 200, says: the predicted spending when income is zero is 200 dollars. Sometimes the intercept is meaningful, sometimes it is just where the line happens to cross the axis.

This readability is linear regression’s superpower. Unlike many models you will meet later, a linear regression hands you a coefficient you can interpret and explain to someone else. That is often worth more than a small gain in accuracy from a model nobody can read.

Real problems rarely depend on a single feature. Predict house price from size and number of bedrooms and age, and the model simply gives each feature its own slope:

price = b + m1 * size + m2 * bedrooms + m3 * age

This is multiple regression, and the idea does not change: find the set of coefficients that minimizes the sum of squared residuals. Each coefficient reads the same way, the predicted change in the output for a one-unit change in that feature, holding the others fixed. The line just becomes a flat surface in more dimensions, which is hard to picture but easy to compute.

SSR tells you which line is better, but on its own a raw SSR number is hard to interpret (is 0.25 good?). A common companion measure is R-squared, which reports the fraction of the variation in the data that the line explains, on a scale from 0 to 1. An R-squared of 0.7 means the line accounts for about 70 percent of the variation, with the rest unexplained. It is a quick, scale-free way to say how much of the story the line captures. We will treat evaluation properly in Phase 4; for now, read R-squared as “higher means the line explains more.”

Linear regression is the seed of the whole field. Every weight in a neural network is a descendant of the slope you just met: a number tuned so the model’s predictions fit the training data. When you hear that a model has “parameters,” picture these two numbers, then multiply by a few billion. And the interpretability lesson carries forward as a warning: the larger models get, the more their coefficients stop being readable, which is exactly why “why did it predict that?” becomes a hard question for big models and an easy one here.

  • Extrapolating past your data. A line fit on house sizes from 800 to 3000 square feet says nothing reliable about a 20000-square-foot mansion. The relationship may not hold where you have no data.
  • Forcing a line onto a curve. If the true relationship bends, a straight line will fit it poorly no matter how you minimize SSR. Look at the data first.
  • Reading a coefficient as a cause. A slope describes association, not causation. “Income predicts spending” is not proof that income causes spending; a lurking third factor can drive both.
  • Letting outliers dominate. Because residuals are squared, one extreme point can pull the whole line toward it. Squaring is what makes least squares sensitive to outliers.
  • A linear regression is two numbers, a slope and an intercept, that together form a prediction machine.
  • “Best fit” means smallest sum of squared residuals. Square each error, add them up, choose the line that minimizes the total. That is what “least squares” means.
  • The slope is the relationship: the predicted change in the output per one-unit change in the input. It is readable, which is the method’s great advantage.
  • The same idea scales, from one slope to multiple coefficients to the billions of weights in a large model.

We now know what the best-fit line is: the one that minimizes the sum of squared residuals. What we have not shown is how you actually find it when you cannot just compare a couple of guesses. For a simple line there is a direct formula, but for most models there is not, and you have to search for the answer step by step. That search is the subject of the next lesson: gradient descent.