Lesson: Imitation learning and behavioral cloning
The last lesson introduced reinforcement learning as the third major regime of machine learning. Before we build any RL algorithm proper, it is worth asking a quieter question: do we even need to? If you have access to an expert, a human driver, a chess grandmaster, a skilled robot operator, can you simply copy what they do, by treating their demonstrations as labels? That is imitation learning, and its most direct form is behavioral cloning (BC). It is the simplest approach to producing a policy, and the way it fails is what motivates the rest of the track.
This lesson walks through BC, the reason it is structurally tempting, the quantitative reason it fails (errors compound across long trajectories, an order epsilon T squared blow-up), and the standard fix called DAgger that brings it back down to order epsilon T. By the end you will see why imitation learning is both the right place to start a deep-RL course and the right place to leave.
The setup: an expert and a dataset
Section titled “The setup: an expert and a dataset”Start with a concrete picture. A human driver demonstrates safe driving for ten hours. Sensors on the car record, at each timestep, the state (camera frames, lidar, speed, GPS) and the expert action the human took (steering angle, throttle, brake). After ten hours at 30 frames per second you have a dataset:
D = { (s_1, a_1*), (s_2, a_2*), ..., (s_N, a_N*) } with N ≈ 1,080,000This is the kind of dataset supervised learning was designed for. Train a neural network the policy parameterized by theta to predict the action given the state, by minimizing a standard supervised loss (cross-entropy for discrete actions, mean squared error for continuous). That trained network is your policy: feed it the current state, get back an action.
θ* = argmin_θ Σ over (s, a*) in D of L( π_θ(s), a* )That is behavioral cloning in one equation. Notice what is absent: no reward, no environment interaction during training, no exploration, no return computation. The agent never acts during learning; it just imitates. From the algorithm’s perspective, RL has been smuggled inside a supervised-learning training loop.
Why behavioral cloning is structurally tempting
Section titled “Why behavioral cloning is structurally tempting”Several real virtues:
- It turns RL into supervised learning. Every supervised tool you have (efficient batched training, well-understood loss functions, mature optimizers like Adam, scaling laws) carries over without modification. The hard parts of RL (delayed rewards, credit assignment, exploration) all disappear.
- It needs no environment interaction during training. Acting in the real world is expensive (a robot wears out, a car crashes), and sometimes illegal (you cannot let an untrained policy drive on public roads). BC trains entirely from logged data.
- It scales with the dataset. More expert hours, better policy. The relationship is the familiar one from supervised learning: doubling the data improves the model, with predictable diminishing returns.
- It often works at first. Train a network on enough driving data and it will, in the simulator, drive convincingly for short stretches. That short success is exactly what makes the failure mode below so educational, because it is not visible until the trajectory is long enough.
These are not small advantages. For tasks with abundant expert demonstrations and short, forgiving trajectories, BC is genuinely the right choice. The trouble is that most interesting tasks are not like that.
Why it fails: distribution shift
Section titled “Why it fails: distribution shift”Here is the failure mode, and it is the central reason genuine RL exists.
The expert demonstrations come from one distribution of states: the states the expert visits while driving well. Call this distribution p-expert. The learned policy, no matter how well trained, will make occasional small errors: it steers a touch too left, brakes a touch too late. Each such error puts the car in a slightly different state than the one the expert would have been in. Over time, those small errors compound:
- Step 1: the policy makes a small steering error. The car is slightly off the expert’s lane center.
- Step 2: that off-center state is not in the training distribution (the expert never drifted out of lane), so the policy is now making decisions in states it was never trained on. It makes a bigger error.
- Step 3: the bigger error puts the car further off-distribution. The policy’s behavior gets worse.
- … and so on.
The training distribution is p-expert; the test distribution (the states the policy actually visits) is p-policy. These two distributions disagree, and they disagree more the longer the trajectory runs. This is called distribution shift (or covariate shift). In supervised learning you assume training and test data come from the same distribution; in behavioral cloning, the policy you trained creates the distribution it is then tested on, and they drift apart.
The quantitative bound: order epsilon T squared versus order epsilon T
Section titled “The quantitative bound: order epsilon T squared versus order epsilon T”The intuition above can be made precise. Suppose your trained policy disagrees with the expert on a fraction epsilon of the training-distribution states (its per-step error rate on the expert’s distribution). What is the expected number of mistakes the policy makes when actually rolled out for T steps?
The classical result (Ross and Bagnell, 2010): in the worst case, the expected number of mistakes for behavioral cloning scales as
total mistakes = O( ε · T² )That T squared, the quadratic dependence on trajectory length, is the failure mode written down. Each small per-step error has a chance of putting the agent in an off-distribution state, where it will likely make another mistake, putting it further off-distribution, and so on. The number of mistakes grows faster than linearly in the episode length.
Plug in numbers. Suppose epsilon = 0.01 (1% per-step error rate on the expert’s distribution; quite good for a deep model) and T = 200 (a short episode). The worst-case BC bound is 0.01 times 200 squared, which is 400 mistakes over 200 steps, which is to say the bound is not informative once a mistake puts the agent into the off-distribution regime where things compound. If T = 1000, the bound balloons to 10,000. The lesson is qualitative: BC’s worst-case performance gets dramatically worse the longer you ask it to act, even with a very accurate per-step model.
Compare with an algorithm that has access to corrective data from the policy’s own state distribution. The bound becomes
total mistakes = O( ε · T )Linear in T. With the same epsilon = 0.01 and T = 200, that is at most 2 mistakes. The same per-step error rate, the same episode length, a quadratically smaller bound, because the algorithm gets training signal where it actually visits.
DAgger: the standard fix
Section titled “DAgger: the standard fix”The simplest algorithm that achieves the order epsilon T bound is DAgger (Dataset Aggregation), Ross, Gordon, and Bagnell, 2011. The recipe is iterative and intuitive:
- Train an initial policy the first policy by behavioral cloning on the expert’s dataset D_1.
- Run the first policy in the environment to collect a trajectory of states the policy actually visits, including the off-distribution ones it drifts into.
- Ask the expert what they would do at each of those new states. (This is the demanding part: DAgger requires the expert to label the policy’s mistakes online.)
- Add those new (state, expert action) pairs to the dataset, forming D_2 as the union of D_1 with the new pairs.
- Retrain on D_2 to get the second policy. Repeat from step 2.
The crucial difference from BC: the dataset eventually contains states from p-policy, not just p-expert. The policy learns to recover from its own mistakes, because it now has expert labels at the states it tends to drift into. The compounding stops.
DAgger’s cost is access to the expert: the expert must be available to label every batch of new policy-visited states, which is fine when the expert is a planner running in simulation, awkward when the expert is a human driver. The practical workarounds (have the human disengage and correct only when the policy errs, use a privileged simulator-only policy as the expert) are an active area of research, but the core idea is what matters here: the only general fix for BC’s distribution shift is to get training signal from the policy’s own state distribution.
Where BC works anyway
Section titled “Where BC works anyway”Two genuine cases where BC is good enough that the bound is moot.
Short episodes. When T is small, the difference between epsilon T squared and epsilon T is small. A single-step prediction task (predict the next token given a context) is T = 1, where the two bounds are equal. This is much of the supervised fine-tuning step of an LLM: an instruction-response dataset is BC over single completions. The famous distribution-shift problem of multi-turn agents is exactly this lesson’s problem, returning at a different scale.
Self-correcting noise injected during demonstration. NVIDIA’s PilotNet self-driving system trained its BC policy on data where the driver’s view was sometimes synthetically perturbed (the camera image was shifted as if the car were off-center) along with the correction the driver would make in response. This artificially populates the training set with off-distribution states and the actions that recover from them, a kind of “hand-rolled DAgger” that does not require online expert queries. It is hacky and task-specific, but it works for the use case.
The general rule: BC alone is for short, error-tolerant trajectories with abundant expert data. Long-horizon, error-sensitive tasks (driving, surgery, multi-step robot manipulation) need either DAgger or genuine RL.
Why this matters when you use AI
Section titled “Why this matters when you use AI”Behavioral cloning is everywhere in modern AI, often under different names. Supervised fine-tuning (SFT) of a language model on instruction-response pairs is behavioral cloning: the “expert” is the dataset’s author, the “state” is the prompt, the “action” is the desired response. SFT works well for short responses and short multi-turn conversations and breaks down on long-horizon agentic behavior (write code, run it, debug, repeat over many steps) for exactly the distribution-shift reason in this lesson.
This is also the structural reason the field needs RLHF (lesson 13 of this track) for long-horizon language tasks: once a model is acting over many steps, supervised imitation of curated transcripts is not enough; you need a training signal that reaches the model on the distribution of states it visits, not the distribution the labeler visited. RLHF is, in part, an answer to the same epsilon T squared problem.
And in robotics, every paper that claims a robot has “learned” a manipulation skill from human demonstrations will have an unspoken T somewhere, and a story about why their epsilon T squared is small enough not to matter (short tasks, lots of data, self-correcting perturbations, online correction). When that story is honest, BC works. When it is not, the system fails in deployment in exactly the way this lesson predicts.
Common pitfalls
Section titled “Common pitfalls”Thinking BC is “supervised learning, end of story.” It is supervised learning during training, but at deployment the policy itself creates the test distribution, and that distribution disagrees with the training distribution by an amount that grows with episode length. The supervised assumption (train and test from one distribution) is broken at exactly the place that matters.
Mistaking a low training loss for a working policy. A BC policy can achieve near-zero loss on the expert’s data and still fail catastrophically in deployment, because near-zero loss on p-expert says nothing about behavior on p-policy. Evaluate on rollouts, not on held-out expert-distribution samples.
Confusing imitation with reinforcement. Imitation copies an expert; it does not optimize a reward. A behavioral-cloned agent is bounded by the expert’s performance (it cannot, in general, exceed the demonstrator) and inherits any of the expert’s biases. RL can in principle do better than any demonstrator, given a reward that is honestly specified.
Underestimating DAgger’s cost. DAgger gives order epsilon T, but requires the expert to label every batch of policy-visited states. In domains where the expert is a human, that labeling cost can dwarf the original demonstration cost. The practical algorithms most labs use are compromises (interactive correction, mixed BC+RL, learned reward models that approximate the expert’s preferences); they all derive from the same recognition that BC’s distribution shift needs an answer.
What you should remember
Section titled “What you should remember”- Behavioral cloning is supervised learning on (state, expert action) pairs. Train a policy the policy parameterized by theta to predict the expert’s action by minimizing a standard supervised loss on the expert dataset. No reward, no environment interaction, no exploration during training.
- BC fails because of distribution shift. Small per-step errors put the policy in states the expert never visited; the policy was never trained there; it makes bigger errors; it drifts further. The dataset is from p-expert; the test distribution is p-policy; they disagree, and they disagree more the longer the trajectory runs.
- The quantitative bound: BC’s expected mistakes scale as order epsilon T squared in trajectory length T (quadratic), where epsilon is the per-step error rate. DAgger (which trains on states the policy actually visits, with expert labels) brings this down to order epsilon T (linear). At epsilon = 0.01 and T = 200, the worst-case BC bound is 400 mistakes; DAgger’s is 2.
- BC is the right starting point, and the wrong final answer for long horizons. It works for short, error-tolerant tasks with abundant expert data (and is most of LLM supervised fine-tuning). It fails for long-horizon, error-sensitive tasks (driving, surgery, multi-step agentic behavior), and the fix is either DAgger-style on-policy correction or genuine reinforcement learning, which is the rest of this track.
The next lesson goes back to first principles: the formal language of states, actions, rewards, and policies that lets us state precisely what the agent is trying to do. Markov decision processes, returns, value functions, and the Bellman idea that organizes all of classical and deep RL.