Cheatsheet: Learning by trial and reward
The one idea that matters
Section titled “The one idea that matters”supervised learning: study a dataset of correct answersreinforcement learning: NO answer key. act in an environment, get rewards, learn which actions pay off (like learning to ride a bike)The loop
Section titled “The loop”agent observes STATE → takes an ACTION → environment returns REWARD + new state → repeatgoal: maximize TOTAL reward over timeThe agent learns a policy: for any state, which action to take.
Maze example: state = mouse’s position; actions = up/down/left/right; reward = −1 per step, +10 for cheese. Over many runs, cheese-reaching paths get reinforced until the mouse goes straight there. Nobody gave it the route.
RL vs supervised learning
Section titled “RL vs supervised learning”| Supervised | Reinforcement | |
|---|---|---|
| Signal | correct answer per example | a reward (how good the outcome was) |
| Tells you | what you should have done | only how good it was, not what to do |
| Learning style | match the answers | explore, observe rewards, infer what pays off |
RL learns from evaluation, not instruction.
Two defining difficulties
Section titled “Two defining difficulties”- Credit assignment: a reward at the end (win the game after 40 moves) must be traced back to the actions that earned it. Delayed, sparse rewards make this hard.
- Explore vs exploit: stick with a known-good action (exploit) or try something new that might be better (explore)? The restaurant problem. Every agent must balance the two.
A note on “agent”
Section titled “A note on “agent””- Here (RL): a decision-maker that learns from environment rewards.
- Track 20: a language model wired to tools.
- Same word, different paradigm. This track means the RL sense.
Where RL shines / strains
Section titled “Where RL shines / strains”- Shines: games (a system beat the Go world champion) and clean simulations (balancing, walking, steering) where trials are cheap.
- Strains: sample-inefficient (can need millions of trials), brittle (narrow policies), and hard to deploy in the messy real world. (More in the next lesson.)
- Everyday tie-in: AI assistants are often refined with human feedback as a reward signal, RL’s loop applied to behavior.
Pitfalls to dodge
Section titled “Pitfalls to dodge”- “RL needs labeled data.” No. No answer key, just a reward signal.
- “The reward says what to do.” No. It says how good the outcome was; the agent infers the rest.
- “Credit assignment is a footnote.” No. Tracing a delayed reward to the right actions is central and hard.
- “Game wins generalize everywhere.” No. Selective successes; real-world RL is sample-inefficient and brittle.
Words to use precisely
Section titled “Words to use precisely”- Agent / environment: the learner / the world it acts in.
- State, action, reward: what it sees, what it does, the score it gets back.
- Policy: the agent’s state-to-action strategy (what it is learning).
- Credit assignment: figuring out which earlier actions earned a delayed reward.
- Explore vs exploit: trying new actions vs repeating known-good ones.
The one-line version
Section titled “The one-line version”Supervised learning studies an answer key; reinforcement learning has none, and learns the way living things do, by acting and feeling the consequences.