Probability foundations

The opening lesson of this track made the case that AI systems speak in probabilities. This phase takes that language seriously and learns its grammar. Before you can update a belief with evidence (Bayes, two lessons from now) or work with the distributions that AI relies on (the next phase), you need the basics: what a probability actually is, and the handful of rules for combining them. There are only three rules that matter, and once you have them, a surprising amount of reasoning about uncertainty becomes simple arithmetic.

What a probability is

A probability is a number between 0 and 1 that measures how likely something is. A probability of 0 means it cannot happen; a probability of 1 means it is certain; 0.5 means it is as likely as not. That is the whole scale, and anything reported outside it (a “120% chance”) is a mistake.

There are two ways to read that number, and both are useful:

As a long-run frequency. If you flip a fair coin thousands of times, heads comes up close to half the time, so the probability of heads is 0.5. Probability here is what the fraction settles toward as you repeat something many times.
As a degree of belief. “There is a 70% chance of rain tomorrow” is not something you can repeat thousands of times; it is a calibrated statement of confidence. This is the reading from lesson 1, and it is how a model’s confidence score is meant to be understood.

The two readings agree on the math. The rules below work the same whether you think in frequencies or beliefs.

Sample spaces and events

To compute a probability, you first lay out what can happen. The sample space is the set of all possible outcomes. For one roll of a six-sided die it is 6. An event is any outcome or group of outcomes you care about, a subset of the sample space: “rolling an even number” is the event 6.

When every outcome is equally likely, probability is just counting:

P(event) = (number of outcomes in the event) / (total number of outcomes)

P(even on a die)        = 3 / 6 = 1/2
P(rolling more than 4)  = {5, 6} = 2 / 6 = 1/3

That counting definition only works when the outcomes are equally likely (a fair die, a shuffled deck). For unequal cases you reach for the rules below or for data, but the intuition (favorable outcomes over total outcomes) is where it starts.

Rule one: the complement

The complement of an event is everything in the sample space where the event does not happen. Because something either happens or it does not, their probabilities add to 1:

P(not A) = 1 - P(A)

This looks trivial and is one of the most useful tricks in probability, because “the chance it does not happen” is often far easier to compute than “the chance it does.” The classic case is at least one: the probability of at least one success is best found as one minus the probability of none.

At least one head in two coin flips:
  P(no heads) = P(tails then tails) = 1/2 x 1/2 = 1/4
  P(at least one head) = 1 - 1/4 = 3/4
Check by listing: HH, HT, TH, TT -> three of the four have a head -> 3/4.

When you see “at least one” in a probability question, reach for the complement first.

Rule two: addition (the OR rule)

To find the probability that event A or event B happens, you add their probabilities, then subtract the overlap so you do not count it twice:

P(A or B) = P(A) + P(B) - P(A and B)

The subtraction is the part people forget. Draw one card from a 52-card deck and ask for the probability it is a king or a heart:

P(king)            = 4/52     (four kings)
P(heart)           = 13/52    (thirteen hearts)
P(king and heart)  = 1/52     (the king of hearts, counted in both)
P(king or heart)   = 4/52 + 13/52 - 1/52 = 16/52 = 4/13

Without subtracting that one overlapping card (the king of hearts), you would get 17/52 and be wrong. When two events cannot both happen at once (rolling a 2 and a 5 on a single die), the overlap is zero and the rule simplifies to plain addition.

The 52-card deck arranged as 4 suits by 13 ranks. The 4 kings (one teal cell per suit) plus the 13 hearts (one amber row) overlap at the King of Hearts. Counting kings + hearts straight gives 4 + 13 = 17, but only 16 distinct cards qualify because the King of Hearts was counted twice. The addition rule subtracts the overlap: P(king or heart) = 4/52 + 13/52 - 1/52 = 16/52.

Rule three: multiplication (the AND rule, for independent events)

To find the probability that two independent events both happen, multiply their probabilities:

P(A and B) = P(A) x P(B)      (only when A and B are independent)

Two events are independent when one happening tells you nothing about the other: two separate coin flips, two rolls of a die. Two flips both landing heads is 1/2 x 1/2 = 1/4. A pipeline of five independent steps that each succeed 90% of the time all succeeding is 0.9 multiplied by itself five times, which is about 0.59: five “usually works” steps chain into a coin-flip’s worth of reliability end to end, a much bigger drop than intuition expects, which is why long chains of mostly-reliable steps fail more often than you would think.

The crucial caveat: the simple multiplication only holds when the events are independent. If drawing the first card changes the odds for the second (because you did not put it back), the events are dependent, and you need the conditional probability of the next lesson. Multiplying as if independent when the events are not is one of the most common probability errors, and it is exactly the gap the next lesson fills.

Why this matters when you use AI

These three rules are not just for dice. They run underneath how AI systems handle uncertainty.

Chaining steps. An agent or pipeline that takes many steps, each with its own chance of working, has an overall success probability governed by the multiplication rule. Five independent 90% steps give about 59% end-to-end; knowing this is why reliable systems shorten chains and add checks.
At least one, via the complement. “What is the chance at least one of these independent checks catches the problem?” is a complement computation: one minus the chance they all miss. The same shape covers “at least one request in a batch fails.”
Scoring a sentence. A language model assigns a probability to a sentence by multiplying the probability of each word given the words before it. That is the multiplication rule applied to a chain, and the “given the words before it” part is the conditional probability the next lesson introduces, which is what makes the words not independent.

The grammar of probability is the grammar of how these systems reason about what is likely, so the rules here are the foundation the rest of the phase builds on.

Common pitfalls

Forgetting to subtract the overlap in the OR rule. P(A or B) double-counts the cases where both happen unless you subtract P(A and B). Only when the events cannot co-occur is the overlap zero.
Multiplying dependent events as if independent. P(A and B) = P(A) x P(B) holds only when the events are independent. When one outcome changes the odds of the other, you need conditional probability (the next lesson).
The gambler’s fallacy. Independent events have no memory. After five heads in a row, a fair coin is still 50/50 on the next flip; the coin does not owe you a tails.
Reading a probability outside 0 to 1. Every probability lives in [0, 1]. A computed value above 1 or below 0 means a rule was misapplied.
Confusing AND with OR. “Both happen” (multiply, for independent events) is a different and usually much smaller number than “either happens” (add, minus the overlap).

What you should remember

A probability is a number from 0 to 1, readable as a long-run frequency or a degree of belief; the rules work the same either way.
For equally likely outcomes, probability is favorable outcomes over total outcomes, computed from the sample space.
The complement rule (P(not A) = 1 - P(A)) is the shortcut for “at least one”: one minus the probability of none.
The addition rule finds P(A or B) by adding and subtracting the overlap; the multiplication rule finds P(A and B) for independent events by multiplying.
Independence is the fine print. The simple multiplication rule only holds when events do not influence each other; when they do, the next lesson’s conditional probability takes over.