Cheatsheet: Introduction to deep reinforcement learning
The three ML regimes
Section titled “The three ML regimes”| Regime | Data | Asked to | Feedback |
|---|---|---|---|
| Supervised | (input, correct output) pairs, fixed | predict the output | immediate, per-example label |
| Unsupervised | unlabeled data, fixed | find structure (clusters, embeddings) | none per-example |
| Reinforcement learning | agent’s own actions in an environment, generated as it learns | choose actions over time | reward, often delayed |
RL is not supervised-with-rewards: the data is generated by the policy (moving target), the reward is delayed (credit assignment), and the agent must act.
The agent-environment loop
Section titled “The agent-environment loop” action a_t -----------> [environment][agent: policy π(s)] <----------- state s_(t+1), reward r_tAt each timestep t: observe state, pick action via policy, receive reward and next state, repeat. Episodic (terminates) or infinite-horizon (runs forever).
Vocabulary
Section titled “Vocabulary”- State
s_t: what the agent sees at timet. - Action
a_t: what the agent does. - Reward
r_t: the number the environment hands back. - Policy
π(s): the function (often a neural net) that picks the action. Deterministica = π(s)or stochasticπ(a | s). - Return
G_t: accumulated reward fromtonward. - Discount
γ: weight for future rewards,0 ≤ γ ≤ 1.
The return formula
Section titled “The return formula”G_t = r_t + γ·r_(t+1) + γ²·r_(t+2) + γ³·r_(t+3) + ...Discount keeps the sum finite on infinite horizons and weights nearer rewards more. Worked: r = (0, 0, 1), γ = 0.9 gives G_0 = 0 + 0.9·0 + 0.9²·1 = 0.81.
The agent’s goal: choose policy π to maximize the expected return (expectation over environment stochasticity and policy randomness).
Why “deep”
Section titled “Why “deep””Classical RL tabulates one value per state. Atari frames have 210·160·3 components, more possible frames than atoms in the observable universe. The fix: replace the table with a function approximator (a neural network). Gain: high-dimensional states (pixels, boards, language). Cost: classical convergence guarantees break.
What makes deep RL hard (the track’s agenda)
Section titled “What makes deep RL hard (the track’s agenda)”| Difficulty | What it is |
|---|---|
| Credit assignment | Reward arrives many steps after the action that caused it |
| Distribution shift | Policy changes during training, so the dataset is a moving target |
| Function approximation | NN approximation breaks tabular convergence proofs |
| Exploration vs exploitation | Use what works vs try what might be better |
| Sample efficiency | Acting costs real time, especially in robotics |
Every later lesson responds to one or more of these.
Where deep RL has appeared
Section titled “Where deep RL has appeared”- Atari (DQN, 2013-15): pixels → joystick.
- AlphaGo / AlphaZero (2016-17): board → move, reward = win/loss.
- Robotics: simulated training, real-robot transfer.
- Preference-based post-training in LLMs: ChatGPT, Claude, Gemini are post-trained with RLHF or related methods. Reward model trained from human preferences in the canonical recipe (or AI-generated preferences in RLAIF / Constitutional AI; DPO-style direct-preference methods now compete). Lesson 13.
Pitfalls to dodge
Section titled “Pitfalls to dodge”- “RL is supervised with rewards.” No: data is policy-generated, reward is delayed, agent must act.
- Treating reward as a label. Reward says how good your action was, not what the right action was.
- “The agent learns the environment.” Standard RL learns a policy; the environment is fixed. Learning the environment is model-based RL (lessons 9-10).
- “Deep = solved.” Function approximation scales RL but breaks the classical proofs; engineering stabilizers (replay buffers, target networks, trust regions) exist because of this.
The one-line version
Section titled “The one-line version”Reinforcement learning is the regime where an agent acts in an environment and learns a policy from delayed rewards; “deep” means a neural network replaces classical RL’s lookup table, gaining scale and losing the textbook convergence guarantees.