Reinforcement learning: cheatsheet

The one idea that matters

supervised learning: study a dataset of correct answers
reinforcement learning: NO answer key. act in an environment, get rewards,
                        learn which actions pay off (like learning to ride a bike)

The loop

agent observes STATE → takes an ACTION → environment returns REWARD + new state → repeat
goal: maximize TOTAL reward over time

The agent learns a policy: for any state, which action to take.

Maze example: state = mouse’s position; actions = up/down/left/right; reward = −1 per step, +10 for cheese. Over many runs, cheese-reaching paths get reinforced until the mouse goes straight there. Nobody gave it the route.

RL vs supervised learning

	Supervised	Reinforcement
Signal	correct answer per example	a reward (how good the outcome was)
Tells you	what you should have done	only how good it was, not what to do
Learning style	match the answers	explore, observe rewards, infer what pays off

RL learns from evaluation, not instruction.

Two defining difficulties

Credit assignment: a reward at the end (win the game after 40 moves) must be traced back to the actions that earned it. Delayed, sparse rewards make this hard.
Explore vs exploit: stick with a known-good action (exploit) or try something new that might be better (explore)? The restaurant problem. Every agent must balance the two.

A note on “agent”

Here (RL): a decision-maker that learns from environment rewards.
Track 20: a language model wired to tools.
Same word, different paradigm. This track means the RL sense.

Where RL shines / strains

Shines: games (a system beat the Go world champion) and clean simulations (balancing, walking, steering) where trials are cheap.
Strains: sample-inefficient (can need millions of trials), brittle (narrow policies), and hard to deploy in the messy real world. (More in the next lesson.)
Everyday tie-in: AI assistants are often refined with human feedback as a reward signal, RL’s loop applied to behavior.

Pitfalls to dodge

“RL needs labeled data.” No. No answer key, just a reward signal.
“The reward says what to do.” No. It says how good the outcome was; the agent infers the rest.
“Credit assignment is a footnote.” No. Tracing a delayed reward to the right actions is central and hard.
“Game wins generalize everywhere.” No. Selective successes; real-world RL is sample-inefficient and brittle.

Words to use precisely

Agent / environment: the learner / the world it acts in.
State, action, reward: what it sees, what it does, the score it gets back.
Policy: the agent’s state-to-action strategy (what it is learning).
Credit assignment: figuring out which earlier actions earned a delayed reward.
Explore vs exploit: trying new actions vs repeating known-good ones.

The one-line version

Supervised learning studies an answer key; reinforcement learning has none, and learns the way living things do, by acting and feeling the consequences.