Skip to content

Cheatsheet: Introduction to deep reinforcement learning

RegimeDataAsked toFeedback
Supervised(input, correct output) pairs, fixedpredict the outputimmediate, per-example label
Unsupervisedunlabeled data, fixedfind structure (clusters, embeddings)none per-example
Reinforcement learningagent’s own actions in an environment, generated as it learnschoose actions over timereward, often delayed

RL is not supervised-with-rewards: the data is generated by the policy (moving target), the reward is delayed (credit assignment), and the agent must act.

action a_t
-----------> [environment]
[agent: policy π(s)]
<-----------
state s_(t+1), reward r_t

At each timestep t: observe state, pick action via policy, receive reward and next state, repeat. Episodic (terminates) or infinite-horizon (runs forever).

  • State s_t: what the agent sees at time t.
  • Action a_t: what the agent does.
  • Reward r_t: the number the environment hands back.
  • Policy π(s): the function (often a neural net) that picks the action. Deterministic a = π(s) or stochastic π(a | s).
  • Return G_t: accumulated reward from t onward.
  • Discount γ: weight for future rewards, 0 ≤ γ ≤ 1.
G_t = r_t + γ·r_(t+1) + γ²·r_(t+2) + γ³·r_(t+3) + ...

Discount keeps the sum finite on infinite horizons and weights nearer rewards more. Worked: r = (0, 0, 1), γ = 0.9 gives G_0 = 0 + 0.9·0 + 0.9²·1 = 0.81.

The agent’s goal: choose policy π to maximize the expected return (expectation over environment stochasticity and policy randomness).

Classical RL tabulates one value per state. Atari frames have 210·160·3 components, more possible frames than atoms in the observable universe. The fix: replace the table with a function approximator (a neural network). Gain: high-dimensional states (pixels, boards, language). Cost: classical convergence guarantees break.

What makes deep RL hard (the track’s agenda)

Section titled “What makes deep RL hard (the track’s agenda)”
DifficultyWhat it is
Credit assignmentReward arrives many steps after the action that caused it
Distribution shiftPolicy changes during training, so the dataset is a moving target
Function approximationNN approximation breaks tabular convergence proofs
Exploration vs exploitationUse what works vs try what might be better
Sample efficiencyActing costs real time, especially in robotics

Every later lesson responds to one or more of these.

  • Atari (DQN, 2013-15): pixels → joystick.
  • AlphaGo / AlphaZero (2016-17): board → move, reward = win/loss.
  • Robotics: simulated training, real-robot transfer.
  • Preference-based post-training in LLMs: ChatGPT, Claude, Gemini are post-trained with RLHF or related methods. Reward model trained from human preferences in the canonical recipe (or AI-generated preferences in RLAIF / Constitutional AI; DPO-style direct-preference methods now compete). Lesson 13.
  • “RL is supervised with rewards.” No: data is policy-generated, reward is delayed, agent must act.
  • Treating reward as a label. Reward says how good your action was, not what the right action was.
  • “The agent learns the environment.” Standard RL learns a policy; the environment is fixed. Learning the environment is model-based RL (lessons 9-10).
  • “Deep = solved.” Function approximation scales RL but breaks the classical proofs; engineering stabilizers (replay buffers, target networks, trust regions) exist because of this.

Reinforcement learning is the regime where an agent acts in an environment and learns a policy from delayed rewards; “deep” means a neural network replaces classical RL’s lookup table, gaining scale and losing the textbook convergence guarantees.