Deep reinforcement learning, in brief

What you’ll learn

This is the opener of Track 18. The single capability it builds: situate deep reinforcement learning among the major regimes of machine learning, and name what makes the deep variant hard, so the rest of the track lands as responses to specific difficulties rather than a parade of acronyms.

You will distinguish RL from supervised and unsupervised learning at the level of what the data looks like and what the model is asked to do; meet the agent-environment loop (state → action → reward → next state, repeated) and its core vocabulary (state, action, reward, policy, return); compute a discounted return by hand (r = (0, 0, 1), γ = 0.9 gives G_0 = 0.81) and see what changing the discount does; understand why the “deep” in deep RL means a neural-network function approximator replacing classical RL’s lookup tables, and what that gains (scale to pixels, board positions, language tokens) and costs (classical tabular convergence guarantees no longer apply); and meet the difficulties that make deep RL its own field: credit assignment for delayed rewards, distribution shift as the policy changes during training, function-approximation breaking the textbook proofs, exploration vs exploitation, and sample efficiency. You will leave with a frame for reading later-lesson algorithms as targeted responses to these.

Where this fits

This is lesson 1 of Phase 1 (RL foundations), the track opener. The next four lessons in Phase 1 build the algorithmic toolkit: lesson 2 takes imitation learning (the simplest approach, ignoring the reward); lesson 3 makes the RL problem precise (MDPs, returns, value, policy); lesson 4 derives policy gradients (REINFORCE); lesson 5 introduces actor-critic. Phases 2 and 3 then expand into core deep-RL algorithms (Q-learning, advanced policy gradients, model-based RL, variational inference for RL, control as inference) and frontiers (RLHF for LLMs, offline RL, exploration, multi-task and meta-RL, open problems).

Before you start

Prerequisites: none within this track (it is the opener). Background expected from earlier tracks: T11 (Neural Network Intuition), T12 (Intro to Deep Learning), or T13 (Build Neural Networks from Scratch), so “a neural network is a function with thousands of knobs you tune by gradient descent” is already familiar; and ideally T17 (RL Foundations, in parallel) for the MDP and dynamic-programming background that this track’s mathematics will lean on from lesson 3 onward. If you have not seen MDPs, lesson 3 will introduce them, but a prior pass via T17 or Sutton and Barto Chapter 3 will make the going faster. No coding, nothing installed; the practice is pen and paper with a calculator.

By the end, you’ll be able to

Place reinforcement learning alongside supervised and unsupervised learning, and explain why RL is not just “supervised with a reward in place of a label”
Draw the agent-environment loop and define its core vocabulary (state, action, reward, policy, return, discount)
Compute a discounted return given a reward sequence and a discount factor, and explain what changing the discount does
Explain why “deep” RL means a neural-network function approximator, and what that gains and costs versus classical tabular RL
Name the difficulties that make deep RL hard (credit assignment, distribution shift, function-approximation guarantees, exploration vs exploitation, sample efficiency) and recognize them as the track’s later-lesson agenda

Time and difficulty

Read time: about 12 minutes
Practice time: about 13 minutes (computing discounted returns at three different γ, a regime-classification drill, and flashcards)
Difficulty: standard (the opener of a math-heavy track; later lessons step up)