Skip to content

Introduction to deep reinforcement learning

This is the opener of Track 18. The single capability it builds: situate deep reinforcement learning among the major regimes of machine learning, and name what makes the deep variant hard, so the rest of the track lands as responses to specific difficulties rather than a parade of acronyms.

You will distinguish RL from supervised and unsupervised learning at the level of what the data looks like and what the model is asked to do; meet the agent-environment loop (state → action → reward → next state, repeated) and its core vocabulary (state, action, reward, policy, return); compute a discounted return by hand (r = (0, 0, 1), γ = 0.9 gives G_0 = 0.81) and see what changing the discount does; understand why the “deep” in deep RL means a neural-network function approximator replacing classical RL’s lookup tables, and what that gains (scale to pixels, board positions, language tokens) and costs (classical tabular convergence guarantees no longer apply); and meet the difficulties that make deep RL its own field: credit assignment for delayed rewards, distribution shift as the policy changes during training, function-approximation breaking the textbook proofs, exploration vs exploitation, and sample efficiency. You will leave with a frame for reading later-lesson algorithms as targeted responses to these.

This is lesson 1 of Phase 1 (RL foundations), the track opener. The next four lessons in Phase 1 build the algorithmic toolkit: lesson 2 takes imitation learning (the simplest approach, ignoring the reward); lesson 3 makes the RL problem precise (MDPs, returns, value, policy); lesson 4 derives policy gradients (REINFORCE); lesson 5 introduces actor-critic. Phases 2 and 3 then expand into core deep-RL algorithms (Q-learning, advanced policy gradients, model-based RL, variational inference for RL, control as inference) and frontiers (RLHF for LLMs, offline RL, exploration, multi-task and meta-RL, open problems).

Prerequisites: none within this track (it is the opener). Background expected from earlier tracks: T11 (Neural Network Intuition), T12 (Intro to Deep Learning), or T13 (Build Neural Networks from Scratch), so “a neural network is a function with thousands of knobs you tune by gradient descent” is already familiar; and ideally T17 (RL Foundations, in parallel) for the MDP and dynamic-programming background that this track’s mathematics will lean on from lesson 3 onward. If you have not seen MDPs, lesson 3 will introduce them, but a prior pass via T17 or Sutton and Barto Chapter 3 will make the going faster. No coding, nothing installed; the practice is pen and paper with a calculator.

  • Place reinforcement learning alongside supervised and unsupervised learning, and explain why RL is not just “supervised with a reward in place of a label”
  • Draw the agent-environment loop and define its core vocabulary (state, action, reward, policy, return, discount)
  • Compute a discounted return given a reward sequence and a discount factor, and explain what changing the discount does
  • Explain why “deep” RL means a neural-network function approximator, and what that gains and costs versus classical tabular RL
  • Name the difficulties that make deep RL hard (credit assignment, distribution shift, function-approximation guarantees, exploration vs exploitation, sample efficiency) and recognize them as the track’s later-lesson agenda
  • Read time: about 12 minutes
  • Practice time: about 13 minutes (computing discounted returns at three different γ, a regime-classification drill, and flashcards)
  • Difficulty: standard (the opener of a math-heavy track; later lessons step up)