Summary: Imitation learning and behavioral cloning
The simplest approach to producing a policy is to ignore the reward entirely. Collect a dataset of (state, expert action) pairs from demonstrations, and train a network by supervised learning to predict the expert’s action given the state. That is behavioral cloning. It is appealing because it turns reinforcement learning into supervised learning; it breaks because small errors compound over long trajectories, and the way it breaks is the reason genuine RL exists. This is the scan-it-in-five-minutes version.
Core ideas
Section titled “Core ideas”- The BC algorithm.
D = { (s_t, a_t*) }of (state, expert action) pairs. Trainπ_θ(s)by minimizing a supervised loss:θ* = argmin_θ Σ L(π_θ(s), a*). No reward, no environment interaction, no exploration during training. - Why it appeals. RL becomes supervised learning, so all the supervised tooling (batched training, mature optimizers, scaling laws) carries over. No environment interaction means no risk during training, useful when acting is expensive (robotics) or unsafe (driving). It often appears to work for the first few steps.
- Why it fails: distribution shift. Training data comes from the expert’s state distribution
p_expert(s). The policy makes small errors, putting it in states the expert never visited, where it was never trained, so it makes bigger errors and drifts further off-distribution. The test distributionp_policy(s)diverges from training, and the gap grows with episode length. - The bound:
O(εT²)vsO(εT). With per-step error rateεon the expert’s distribution, BC’s expected mistakes over aT-step rollout scale asO(εT²)(quadratic in horizon, the compounding-error phenomenon). An on-policy alternative scales asO(εT)(linear). Atε = 0.01,T = 200: BC bound is 400 mistakes, DAgger bound is 2. - The fix: DAgger. Roll out the current policy, ask the expert what they would do at each visited state, add those (state, expert action) pairs to the dataset, retrain, loop. The dataset eventually contains states from
p_policy(s), so the policy learns to recover from its own mistakes. Cost: the expert must be queryable on demand. - Where BC works anyway. Short horizons (
Tsmall enough thatεT² ≈ εT); tasks with self-correcting noise injected during demonstration (the NVIDIA PilotNet view-perturbation trick); abundant data with a genuinely tinyεon an error-tolerant task. LLM supervised fine-tuning on single-completion responses is effectivelyT = 1, where BC and DAgger coincide.
What changes for you
Section titled “What changes for you”You now have the precise reason RL exists as a separate field. If you could get away with copying an expert, you would; BC is what “copying an expert” looks like at scale, and its O(εT²) failure mode is what limits it. Every algorithm in the rest of this track is, in some sense, a response to this. The same lesson explains why supervised fine-tuning of an LLM works well for short responses and breaks down for long-horizon agentic behavior (multi-step coding agents, multi-turn tool use): SFT is BC at scale, and the T² problem reappears unchanged at the LLM scale. RLHF (lesson 13) is, in part, the field’s answer to that: get training signal from the model’s own state distribution, not just the labeler’s. The next lesson goes back to first principles to give the formal language, Markov decision processes, returns, and value functions, that lets us state precisely what an RL agent is trying to do, and lets the rest of the track’s algorithms be derived rather than asserted.