Cheatsheet: What reinforcement learning actually is
The one idea
Section titled “The one idea”Reinforcement learning is the third learning paradigm: an agent learns from interaction with an environment that returns reward, with no oracle action and a data distribution that depends on the agent’s own choices.
The three paradigms
Section titled “The three paradigms”| Paradigm | Inputs | What is learned |
|---|---|---|
| Supervised | (input, label) pairs | The input-to-label mapping (a function) |
| Unsupervised | Unlabeled data | Structure (clusters, factors, compressed codes) |
| Reinforcement | States, actions, rewards from acting | A policy that maximizes total reward over time |
The agent-environment-reward loop
Section titled “The agent-environment-reward loop” +---------------+ | ENVIRONMENT | +---------------+ ^ | | action a_t| |s_(t+1)|r_(t+1) | v v +---------------+ | AGENT | +---------------+
state s = information about the situationaction a = a choice the agent can makereward r = a number expressing how good an action's result waspolicy = the agent's rule for choosing actions from a stategoal = MAXIMIZE total reward over time (not the immediate reward)What makes RL harder than supervised learning
Section titled “What makes RL harder than supervised learning”| Difficulty | Why it bites |
|---|---|
| No oracle action | The environment returns a reward, not the action you should have taken |
| Delayed reward (credit assignment) | Important rewards arrive many steps after the actions that caused them |
| Distribution shift from the policy | The states the agent visits depend on its policy; training data drifts as the policy changes |
Exploration vs exploitation (the through-line)
Section titled “Exploration vs exploitation (the through-line)”Exploit = pick the action that looks best on current evidence (use what you've learned)Explore = pick an action you don't yet understand well (gather information)Pure exploit -> locks in on possibly-suboptimal noisy estimatesPure explore -> learns but never collects on what was learnedEvery method in this track is a precise MIX of the two.Where RL shows up
Section titled “Where RL shows up”| Domain | Example |
|---|---|
| Board / video games | AlphaGo, AlphaZero; DQN on Atari |
| Robotics and control | Walking, manipulation, flight |
| Recommendation / personalization | Long-running engagement loops |
| Resource allocation | Routing, scheduling, power balancing |
| LLM alignment | RLHF (Track 5’s rlhf-and-dpo lesson) — this track teaches the RL mechanics underneath |
Pitfalls to dodge
Section titled “Pitfalls to dodge”- Treating RL as supervised learning with hidden labels.
- Thinking RL is only for games.
- Mistaking the reward for an objective property (it is designed; bad design -> agents that game it).
- Equating exploration with pure randomness (principled methods target uncertainty).
- Underestimating distribution shift from the changing policy.
Words to use precisely
Section titled “Words to use precisely”- Agent / environment / state / action / reward: the five pieces of the loop.
- Policy: the agent’s rule for choosing actions given a state.
- Total / cumulative reward (return): what the agent maximizes over time (formalized next lesson).
- Credit assignment: figuring out which earlier actions deserve credit for a later reward.
- Exploration vs exploitation: the central tension; every method’s design is an answer to it.