Reinforcement learning: cheatsheet

The one idea

Reinforcement learning is the third learning paradigm: an agent learns from interaction with an environment that returns reward, with no oracle action and a data distribution that depends on the agent’s own choices.

The three paradigms

Paradigm	Inputs	What is learned
Supervised	(input, label) pairs	The input-to-label mapping (a function)
Unsupervised	Unlabeled data	Structure (clusters, factors, compressed codes)
Reinforcement	States, actions, rewards from acting	A policy that maximizes total reward over time

The agent-environment-reward loop

                  +---------------+
                  |  ENVIRONMENT  |
                  +---------------+
                  ^   |       |
        action a_t|   |s_(t+1)|r_(t+1)
                  |   v       v
                  +---------------+
                  |     AGENT     |
                  +---------------+

state s   = information about the situation
action a  = a choice the agent can make
reward r  = a number expressing how good an action's result was
policy    = the agent's rule for choosing actions from a state
goal      = MAXIMIZE total reward over time (not the immediate reward)

What makes RL harder than supervised learning

Difficulty	Why it bites
No oracle action	The environment returns a reward, not the action you should have taken
Delayed reward (credit assignment)	Important rewards arrive many steps after the actions that caused them
Distribution shift from the policy	The states the agent visits depend on its policy; training data drifts as the policy changes

Exploration vs exploitation (the through-line)

Exploit  = pick the action that looks best on current evidence (use what you've learned)
Explore  = pick an action you don't yet understand well (gather information)
Pure exploit -> locks in on possibly-suboptimal noisy estimates
Pure explore -> learns but never collects on what was learned
Every method in this track is a precise MIX of the two.

Where RL shows up

Domain	Example
Board / video games	AlphaGo, AlphaZero; DQN on Atari
Robotics and control	Walking, manipulation, flight
Recommendation / personalization	Long-running engagement loops
Resource allocation	Routing, scheduling, power balancing
LLM alignment	RLHF (Track 5’s `rlhf-and-dpo` lesson) — this track teaches the RL mechanics underneath

Pitfalls to dodge

Treating RL as supervised learning with hidden labels.
Thinking RL is only for games.
Mistaking the reward for an objective property (it is designed; bad design -> agents that game it).
Equating exploration with pure randomness (principled methods target uncertainty).
Underestimating distribution shift from the changing policy.

Words to use precisely

Agent / environment / state / action / reward: the five pieces of the loop.
Policy: the agent’s rule for choosing actions given a state.
Total / cumulative reward (return): what the agent maximizes over time (formalized next lesson).
Credit assignment: figuring out which earlier actions deserve credit for a later reward.
Exploration vs exploitation: the central tension; every method’s design is an answer to it.