Skip to content

When one event tells you about another: conditional probability and independence

The previous lesson’s multiplication rule came with fine print: it only works when events are independent, when one happening tells you nothing about the other. But the events that matter most in AI are exactly the opposite. A test result changes your belief about being sick. The previous word changes the odds of the next word. Yesterday’s purchase changes the chance of tomorrow’s. These events are dependent, and the tool for reasoning about them is conditional probability, the single most important idea in this phase and the foundation under both the next lesson (Bayes) and most of machine-learning classification.

Conditional probability asks: given that one thing has happened, how likely is another? It is written

P(A | B) = the probability of A, given that B happened

and the vertical bar reads as “given.” The key mental move is that learning B narrows the world. You stop considering every possible outcome and consider only the outcomes where B is true, then ask what fraction of those also have A. Formally:

P(A | B) = P(A and B) / P(B)

The denominator, the probability of B, is the new, smaller world (the cases where B happened); the numerator, the probability of A and B, is the slice of that world where A also happened. Dividing gives the fraction of B-cases that are also A-cases.

Conditional probability is easiest to see in a table that cross-classifies a group two ways. Take 1,000 people screened for a condition that 10% of them actually have. The screening test catches 80% of true cases, and wrongly flags 10% of healthy people. Fill in the counts:

Test positive Test negative Total
Has condition 80 20 100
Healthy 90 810 900
Total 170 830 1000

(Of the 100 with the condition, 80% = 80 test positive. Of the 900 healthy, 10% = 90 test positive. The rest are negatives.)

Now read conditional probabilities straight off the table by restricting to a row or a column:

P(test positive | has condition) = 80 / 100 = 0.80 (restrict to the "has condition" row)
P(has condition | test positive) = 80 / 170 = 0.47 (restrict to the "test positive" column)

Look hard at those two numbers. The test catches 80% of real cases, yet a positive result means only a 47% chance of actually having the condition. They are different questions with different denominators: one divides by everyone with the condition, the other by everyone who tested positive. Confusing them is the costliest mistake in the whole subject, and the next section is about why.

A two-way table of 1,000 people showing P(test positive given condition) equals 0.80 differs from P(condition given test positive) equals 0.47 A two-way contingency table with rows for condition present and condition absent and columns for test positive and test negative, plus row and column totals. Out of 1,000 people, 100 have the condition (80 test positive, 20 test negative) and 900 do not (90 test positive, 810 test negative). The row "condition positive" is highlighted in teal to show the row-conditional P(test positive given condition) equals 80 over 100 equals 0.80. The column "test positive" is highlighted in amber to show the column-conditional P(condition given test positive) equals 80 over 170 equals 0.47. The two conditionals are different despite sharing the cell value 80. test + test - row total condition + condition - column total 80 20 100 90 810 900 170 830 1,000 P(test+ | condition+) = 80 / 100 = 0.80 P(condition+ | test+) = 80 / 170 = 0.47 same 80, different denominators → very different probabilities
Two conditional probabilities live in the same table, share the same numerator (80 people who have the condition AND test positive), but use different denominators. P(test positive given the condition) divides by the 100 condition-positive people: 0.80. P(condition given a positive test) divides by the 170 test-positive people: 0.47. The "given" word picks which row or column you stay inside.

The chance of A given B is not the chance of B given A

Section titled “The chance of A given B is not the chance of B given A”

The probability of A given B and the probability of B given A are different numbers, and swapping them is so common it has names (the “prosecutor’s fallacy,” “base-rate neglect”). The table just showed it: the probability of a positive test given the condition is 0.80, but the probability of the condition given a positive test is 0.47. Flipping the condition flips the denominator, and the answer changes completely.

The damage is real. “Ninety percent of sick people test positive” is not “ninety percent of people who test positive are sick.” A courtroom claim that “there is a one-in-a-million chance of this match if the defendant is innocent” is not “there is a one-in-a-million chance the defendant is innocent.” Whenever you meet a conditional claim, pin down which way the bar points before you act on it. The next lesson, Bayes’ theorem, is precisely the machine for converting one direction into the other correctly.

The general multiplication rule, and independence revisited

Section titled “The general multiplication rule, and independence revisited”

Rearranging the definition gives a multiplication rule that works for any events, dependent or not:

P(A and B) = P(B) x P(A | B)

This generalizes the previous lesson. There, for independent events, we multiplied the probability of A by the probability of B; here, for dependent events, the second factor becomes the conditional probability of A given B. The classic case is drawing without replacement:

Two aces drawn from a deck, no replacement:
P(first ace) = 4/52
P(second ace | first ace) = 3/51 (one ace gone, 51 cards left)
P(two aces) = 4/52 x 3/51 = 12/2652 = 1/221 (about 0.45%)

The second draw’s odds depend on the first, which is why you cannot just multiply 4/52 by 4/52. And this gives a clean definition of independence: A and B are independent exactly when knowing B does not change A, that is, when

P(A | B) = P(A)

If that holds, the general rule collapses back to the probability of A and B equals the probability of A times the probability of B, the simple rule from last lesson. Independence is not a separate fact; it is the special case where the conditional equals the unconditional.

Conditional probability is not a side topic for AI; it is close to the whole game.

  • Classifiers compute conditionals. A spam filter is estimating the probability of spam given the words in this email. A medical-triage model estimates the probability of a condition given the symptoms. Classification is the art of computing the probability of a label given the inputs, which is a conditional probability.
  • Language models are conditional probability machines. A language model generates text by repeatedly computing the probability of the next word given all the words so far and sampling from it. That “given the words so far” is exactly why the words are not independent, the dependency the previous lesson flagged and this lesson names.
  • The flip-the-conditional trap is everywhere. Reading a model’s or a test’s output, it is dangerously easy to hear the probability of a positive test given being sick and act as if it were the probability of being sick given a positive test. Knowing those are different, and that the base rate is what separates them, is what keeps you from over-trusting a confident-sounding detector. It is the lesson-1 base-rate point, now with the machinery to see exactly where the two numbers diverge.
  • Swapping the direction of the bar. The probability of A given B is not the probability of B given A. “Most sick people test positive” tells you almost nothing about “most positives are sick” without the base rate. This is the error to fear most.
  • Assuming independence when events are dependent. Multiplying unconditional probabilities for dependent events (drawing without replacement, consecutive words) gives the wrong answer; use the conditional factor.
  • Reading conditional probability as causation. The probability of A given B being high means B is informative about A, not that B causes A. Conditioning is association, the correlation-is-not-causation caution again.
  • Forgetting the denominator changed. A conditional probability divides by the new, smaller world (the B-cases), not the whole sample space. Losing track of the denominator is how the flip-the-bar error sneaks in.
  • Conditional probability, the probability of A given B, is the chance of A given that B happened: the probability of A and B divided by the probability of B. Learning B shrinks the world to the B-cases and asks what fraction are also A.
  • A two-way table makes conditionals concrete: restrict to a row or column and divide. The screening table gave the probability of a positive test given the condition as 0.80 but the probability of the condition given a positive test as 0.47.
  • The probability of A given B is generally not the probability of B given A. Swapping the direction (base-rate neglect, the prosecutor’s fallacy) changes the denominator and the answer; the next lesson, Bayes, converts one into the other correctly.
  • The general multiplication rule, the probability of A and B equals the probability of B times the probability of A given B, works for dependent events; independence is just the special case where the probability of A given B equals the probability of A, which returns the simple rule.
  • In AI, classifiers compute the probability of a label given the inputs and language models compute the probability of the next word given the previous words; conditional probability is the form most machine-learning prediction takes.