Deep reinforcement learning: cheatsheet

The three ML regimes

Regime	Data	Asked to	Feedback
Supervised	(input, correct output) pairs, fixed	predict the output	immediate, per-example label
Unsupervised	unlabeled data, fixed	find structure (clusters, embeddings)	none per-example
Reinforcement learning	agent’s own actions in an environment, generated as it learns	choose actions over time	reward, often delayed

RL is not supervised-with-rewards: the data is generated by the policy (moving target), the reward is delayed (credit assignment), and the agent must act.

The agent-environment loop

                 action a_t
                 -----------> [environment]
[agent: policy π(s)]
                 <-----------
                 state s_(t+1), reward r_t

At each timestep t: observe state, pick action via policy, receive reward and next state, repeat. Episodic (terminates) or infinite-horizon (runs forever).

Vocabulary

State s_t: what the agent sees at time t.
Action a_t: what the agent does.
Reward r_t: the number the environment hands back.
Policy π(s): the function (often a neural net) that picks the action. Deterministic a = π(s) or stochastic π(a | s).
Return G_t: accumulated reward from t onward.
Discount γ: weight for future rewards, 0 ≤ γ ≤ 1.

The return formula

G_t = r_t + γ·r_(t+1) + γ²·r_(t+2) + γ³·r_(t+3) + ...

Discount keeps the sum finite on infinite horizons and weights nearer rewards more. Worked: r = (0, 0, 1), γ = 0.9 gives G_0 = 0 + 0.9·0 + 0.9²·1 = 0.81.

The agent’s goal: choose policy π to maximize the expected return (expectation over environment stochasticity and policy randomness).

Why “deep”

Classical RL tabulates one value per state. Atari frames have 210·160·3 components, more possible frames than atoms in the observable universe. The fix: replace the table with a function approximator (a neural network). Gain: high-dimensional states (pixels, boards, language). Cost: classical convergence guarantees break.

What makes deep RL hard (the track’s agenda)

Difficulty	What it is
Credit assignment	Reward arrives many steps after the action that caused it
Distribution shift	Policy changes during training, so the dataset is a moving target
Function approximation	NN approximation breaks tabular convergence proofs
Exploration vs exploitation	Use what works vs try what might be better
Sample efficiency	Acting costs real time, especially in robotics

Every later lesson responds to one or more of these.

Where deep RL has appeared

Atari (DQN, 2013-15): pixels → joystick.
AlphaGo / AlphaZero (2016-17): board → move, reward = win/loss.
Robotics: simulated training, real-robot transfer.
Preference-based post-training in LLMs: ChatGPT, Claude, Gemini are post-trained with RLHF or related methods. Reward model trained from human preferences in the canonical recipe (or AI-generated preferences in RLAIF / Constitutional AI; DPO-style direct-preference methods now compete). Lesson 13.

Pitfalls to dodge

“RL is supervised with rewards.” No: data is policy-generated, reward is delayed, agent must act.
Treating reward as a label. Reward says how good your action was, not what the right action was.
“The agent learns the environment.” Standard RL learns a policy; the environment is fixed. Learning the environment is model-based RL (lessons 9-10).
“Deep = solved.” Function approximation scales RL but breaks the classical proofs; engineering stabilizers (replay buffers, target networks, trust regions) exist because of this.

The one-line version

Reinforcement learning is the regime where an agent acts in an environment and learns a policy from delayed rewards; “deep” means a neural network replaces classical RL’s lookup table, gaining scale and losing the textbook convergence guarantees.