Imitation learning, in brief

What you’ll learn

The last lesson framed reinforcement learning. Before we build any RL algorithm proper, this lesson asks a quieter question: do we even need to? If you have an expert to copy (a human driver, a chess grandmaster, a skilled teleoperator), you can collect their demonstrations as (state, expert action) pairs and train a policy by supervised learning to predict the expert’s action. The single capability this lesson builds: state the behavioral-cloning algorithm, explain why distribution shift makes its worst case scale as O(εT²) in episode length, and recognize the settings where it is adequate anyway versus the settings where genuine RL is needed.

You will see behavioral cloning in one equation (θ* = argmin_θ Σ L(π_θ(s), a*)), name the four structural advantages that make it tempting (no reward, no environment interaction, supervised tooling carries over, scales with the dataset), and understand the failure mode: at deployment the policy’s small errors put it in states the expert never visited, where it was never trained, so it makes bigger errors and drifts further off-distribution. The training distribution p_expert(s) diverges from the test distribution p_policy(s), and the gap grows with episode length. You will work the quantitative bound (Ross and Bagnell, 2010): with per-step error rate ε, BC’s expected mistakes scale as O(εT²), while an on-policy alternative like DAgger (Ross, Gordon, Bagnell, 2011) scales as O(εT). Plug in numbers (ε = 0.01, T = 200: BC bound 400, DAgger bound 2) and the warning is structural. You will then see where BC is enough anyway, short horizons, error-tolerant tasks, self-correcting demonstration noise (the NVIDIA PilotNet trick), and single-step LLM supervised fine-tuning (T = 1), and where it is not (long-horizon driving, multi-step agentic LLM behavior, surgical robotics).

Where this fits

This is lesson 2 of Phase 1 (RL foundations), the simplest approach to producing a policy and the one that motivates everything else. Lesson 3 will lay down the formal language of Markov decision processes, returns, and value functions; lessons 4 and 5 build the first algorithmic moves (policy gradients, actor-critic) on that formalism. The distribution-shift problem identified here echoes at LLM scale in lesson 13 (RLHF), where the same εT² phenomenon limits supervised fine-tuning and motivates the policy-gradient post-training step.

Before you start

Prerequisite (within this track): lesson 1, Introduction to deep reinforcement learning, for the agent-environment loop and the vocabulary (state, action, policy). Background from earlier tracks: comfort with supervised learning and gradient descent (T11/T12/T13) since BC is supervised learning applied to (state, action) pairs. No coding, nothing installed; the practice is pen and paper. (T17 RL Foundations is a parallel prerequisite for later T18 lessons but is not load-bearing here; this lesson stays in the supervised-learning frame the BC algorithm itself uses.)

By the end, you’ll be able to

State the behavioral-cloning algorithm and its training objective as supervised learning on (state, expert action) pairs
Explain why BC is structurally tempting (turns RL into supervised learning, no environment interaction needed) and what its limits are
Explain distribution shift in BC: the policy creates its own test distribution, which diverges from the expert’s training distribution as episode length grows
Compute the O(εT²) versus O(εT) bounds across a range of episode lengths and read off the qualitative warning that BC’s worst case becomes uninformative on long horizons
Describe DAgger as the standard fix and recognize where BC is adequate anyway (short horizons, error-tolerant tasks, self-correcting demonstration noise, single-step LLM fine-tuning)

Time and difficulty

Read time: about 13 minutes
Practice time: about 13 minutes (computing the BC vs DAgger bound across three episode lengths, a “where does BC break?” scenario drill, and flashcards)
Difficulty: standard