Skip to content

Cheatsheet: What reinforcement learning actually is

Reinforcement learning is the third learning paradigm: an agent learns from interaction with an environment that returns reward, with no oracle action and a data distribution that depends on the agent’s own choices.

ParadigmInputsWhat is learned
Supervised(input, label) pairsThe input-to-label mapping (a function)
UnsupervisedUnlabeled dataStructure (clusters, factors, compressed codes)
ReinforcementStates, actions, rewards from actingA policy that maximizes total reward over time
+---------------+
| ENVIRONMENT |
+---------------+
^ | |
action a_t| |s_(t+1)|r_(t+1)
| v v
+---------------+
| AGENT |
+---------------+
state s = information about the situation
action a = a choice the agent can make
reward r = a number expressing how good an action's result was
policy = the agent's rule for choosing actions from a state
goal = MAXIMIZE total reward over time (not the immediate reward)

What makes RL harder than supervised learning

Section titled “What makes RL harder than supervised learning”
DifficultyWhy it bites
No oracle actionThe environment returns a reward, not the action you should have taken
Delayed reward (credit assignment)Important rewards arrive many steps after the actions that caused them
Distribution shift from the policyThe states the agent visits depend on its policy; training data drifts as the policy changes

Exploration vs exploitation (the through-line)

Section titled “Exploration vs exploitation (the through-line)”
Exploit = pick the action that looks best on current evidence (use what you've learned)
Explore = pick an action you don't yet understand well (gather information)
Pure exploit -> locks in on possibly-suboptimal noisy estimates
Pure explore -> learns but never collects on what was learned
Every method in this track is a precise MIX of the two.
DomainExample
Board / video gamesAlphaGo, AlphaZero; DQN on Atari
Robotics and controlWalking, manipulation, flight
Recommendation / personalizationLong-running engagement loops
Resource allocationRouting, scheduling, power balancing
LLM alignmentRLHF (Track 5’s rlhf-and-dpo lesson) — this track teaches the RL mechanics underneath
  • Treating RL as supervised learning with hidden labels.
  • Thinking RL is only for games.
  • Mistaking the reward for an objective property (it is designed; bad design -> agents that game it).
  • Equating exploration with pure randomness (principled methods target uncertainty).
  • Underestimating distribution shift from the changing policy.
  • Agent / environment / state / action / reward: the five pieces of the loop.
  • Policy: the agent’s rule for choosing actions given a state.
  • Total / cumulative reward (return): what the agent maximizes over time (formalized next lesson).
  • Credit assignment: figuring out which earlier actions deserve credit for a later reward.
  • Exploration vs exploitation: the central tension; every method’s design is an answer to it.