Cheatsheet: Imitation learning and behavioral cloning
Behavioral cloning, in one equation
Section titled “Behavioral cloning, in one equation”D = { (s_t, a_t*) } (expert demonstrations: state, expert action)θ* = argmin_θ Σ over (s, a*) in D of L( π_θ(s), a* )Supervised learning on (state, expert action) pairs. No reward, no environment interaction, no exploration during training.
Why it appeals
Section titled “Why it appeals”- Turns RL into supervised learning (all the tooling carries over).
- Needs no environment interaction during training.
- Scales with the dataset like any supervised model.
- Often works on the first few steps; failure only shows up later.
Why it fails: distribution shift
Section titled “Why it fails: distribution shift”The expert dataset comes from p_expert(s). The trained policy makes small errors. Each error puts the agent in a state the expert never visited, where the policy was never trained, so it makes a bigger error, and the agent drifts further off-distribution. The test distribution p_policy(s) diverges from the training distribution p_expert(s), and the gap grows with episode length.
The quantitative bound (Ross and Bagnell 2010; DAgger from Ross, Gordon, and Bagnell 2011)
Section titled “The quantitative bound (Ross and Bagnell 2010; DAgger from Ross, Gordon, and Bagnell 2011)”| Algorithm | Expected mistakes over T steps |
|---|---|
| Behavioral cloning | O(ε · T²) (quadratic in horizon) |
| DAgger (on-policy correction) | O(ε · T) (linear in horizon) |
ε = per-step error rate on the expert’s distribution.
Plug in numbers (ε = 0.01, per-step error 1%):
T | BC bound | DAgger bound |
|---|---|---|
| 200 | 400 | 2 |
| 1000 | 10,000 | 10 |
The qualitative warning: BC’s worst-case scales quadratically with episode length.
DAgger (Dataset Aggregation)
Section titled “DAgger (Dataset Aggregation)”1. Train π_1 by BC on the expert dataset D_1.2. Roll out π_t in the environment; collect the states it visits.3. Ask the expert what they would do at each visited state.4. Add those (state, expert action) pairs: D_(t+1) = D_t ∪ { new pairs }.5. Retrain on D_(t+1) → π_(t+1). Loop back to step 2.Crucial difference: the dataset eventually contains states from p_policy(s), so the policy learns to recover from its own mistakes. Cost: the expert must be queryable on demand (fine if the expert is a planner; awkward if the expert is a human).
Where BC works anyway
Section titled “Where BC works anyway”- Short episodes (T small).
εT²≈εTwhenTis small. Single-step supervised tasks areT = 1(BC and DAgger coincide). - Self-correcting noise during demonstration. NVIDIA PilotNet trick: perturb the demonstrator’s view and capture the expert’s corrective response, populating the training set with off-distribution states and their recoveries.
- Abundant expert data + error-tolerant task. When
εis genuinely tiny and a few mistakes do not cascade, BC is good enough.
Where it shows up in modern AI
Section titled “Where it shows up in modern AI”- LLM supervised fine-tuning (SFT) is BC over instruction-response pairs. Works for short responses, breaks for long-horizon agentic behavior.
- RLHF (lesson 13) is, in part, the answer to SFT’s
εT²problem: get training signal from the model’s state distribution, not the labeler’s. - Robotics manipulation demos: the unspoken
Tdetermines whether BC is enough.
Pitfalls to dodge
Section titled “Pitfalls to dodge”- “BC is supervised learning, end of story.” It is supervised at training; at deployment the policy creates the test distribution, which disagrees with the training distribution by an amount growing in
T. - Low training loss as evidence the policy works. Low loss on
p_expert(s)says nothing aboutp_policy(s). Evaluate on rollouts. - Confusing imitation with reinforcement. BC is bounded by the expert; RL can in principle exceed any demonstrator (given an honest reward).
- Underestimating DAgger’s expert-query cost. Linear
O(εT)comes with a per-iteration human cost that can dwarf the original demonstration cost.
The one-line version
Section titled “The one-line version”Behavioral cloning is supervised learning on expert demonstrations, scaling as O(εT²) in episode length because the policy creates its own test distribution; DAgger’s on-policy correction brings it to O(εT), which is why long-horizon imitation needs either DAgger or genuine RL.