Imitation learning and behavioral cloning
What you’ll learn
Section titled “What you’ll learn”The last lesson framed reinforcement learning. Before we build any RL algorithm proper, this lesson asks a quieter question: do we even need to? If you have an expert to copy (a human driver, a chess grandmaster, a skilled teleoperator), you can collect their demonstrations as (state, expert action) pairs and train a policy by supervised learning to predict the expert’s action. The single capability this lesson builds: state the behavioral-cloning algorithm, explain why distribution shift makes its worst case scale as O(εT²) in episode length, and recognize the settings where it is adequate anyway versus the settings where genuine RL is needed.
You will see behavioral cloning in one equation (θ* = argmin_θ Σ L(π_θ(s), a*)), name the four structural advantages that make it tempting (no reward, no environment interaction, supervised tooling carries over, scales with the dataset), and understand the failure mode: at deployment the policy’s small errors put it in states the expert never visited, where it was never trained, so it makes bigger errors and drifts further off-distribution. The training distribution p_expert(s) diverges from the test distribution p_policy(s), and the gap grows with episode length. You will work the quantitative bound (Ross and Bagnell, 2010): with per-step error rate ε, BC’s expected mistakes scale as O(εT²), while an on-policy alternative like DAgger (Ross, Gordon, Bagnell, 2011) scales as O(εT). Plug in numbers (ε = 0.01, T = 200: BC bound 400, DAgger bound 2) and the warning is structural. You will then see where BC is enough anyway, short horizons, error-tolerant tasks, self-correcting demonstration noise (the NVIDIA PilotNet trick), and single-step LLM supervised fine-tuning (T = 1), and where it is not (long-horizon driving, multi-step agentic LLM behavior, surgical robotics).
Where this fits
Section titled “Where this fits”This is lesson 2 of Phase 1 (RL foundations), the simplest approach to producing a policy and the one that motivates everything else. Lesson 3 will lay down the formal language of Markov decision processes, returns, and value functions; lessons 4 and 5 build the first algorithmic moves (policy gradients, actor-critic) on that formalism. The distribution-shift problem identified here echoes at LLM scale in lesson 13 (RLHF), where the same εT² phenomenon limits supervised fine-tuning and motivates the policy-gradient post-training step.
Before you start
Section titled “Before you start”Prerequisite (within this track): lesson 1, Introduction to deep reinforcement learning, for the agent-environment loop and the vocabulary (state, action, policy). Background from earlier tracks: comfort with supervised learning and gradient descent (T11/T12/T13) since BC is supervised learning applied to (state, action) pairs. No coding, nothing installed; the practice is pen and paper. (T17 RL Foundations is a parallel prerequisite for later T18 lessons but is not load-bearing here; this lesson stays in the supervised-learning frame the BC algorithm itself uses.)
By the end, you’ll be able to
Section titled “By the end, you’ll be able to”- State the behavioral-cloning algorithm and its training objective as supervised learning on (state, expert action) pairs
- Explain why BC is structurally tempting (turns RL into supervised learning, no environment interaction needed) and what its limits are
- Explain distribution shift in BC: the policy creates its own test distribution, which diverges from the expert’s training distribution as episode length grows
- Compute the O(εT²) versus O(εT) bounds across a range of episode lengths and read off the qualitative warning that BC’s worst case becomes uninformative on long horizons
- Describe DAgger as the standard fix and recognize where BC is adequate anyway (short horizons, error-tolerant tasks, self-correcting demonstration noise, single-step LLM fine-tuning)
Time and difficulty
Section titled “Time and difficulty”- Read time: about 13 minutes
- Practice time: about 13 minutes (computing the BC vs DAgger bound across three episode lengths, a “where does BC break?” scenario drill, and flashcards)
- Difficulty: standard