References: Imitation learning and behavioral cloning

Source material

Source curriculum (structural mirror, cited as further study):
• Berkeley CS285 (CS185), Deep Reinforcement Learning,
  Lecture 2: Behavioral Cloning + Lecture 3: Behavioral Cloning Part 2
  (collapsed into one Clawdemy lesson, per phase-0 §5)
  Instructor: Sergey Levine
  Course page: http://rail.eecs.berkeley.edu/deeprlcourse/
  Lecture videos (Fall 2023, most recent recording at time of authoring):
    https://www.youtube.com/playlist?list=PL_iWQOsE6TfVYGEGiAOMaOzzv41Jfm_Ps
  License: YouTube standard (link-out only, no embed, no transcript republication)
This Clawdemy lesson is an original walkthrough of behavioral cloning, the
quantitative reason it fails on long-horizon tasks (the O(εT²) bound from
Ross and Bagnell), and DAgger as the standard fix, following the pedagogical
arc of CS285 lectures 2 and 3. We cite the lectures as the recommended
full-depth companion; we do not reproduce or transcribe the videos. All
rights to the original lectures remain with the creator.

Watch this next

CS285 Lectures 2 and 3, Behavioral Cloning (Sergey Levine, Berkeley). The two lectures this Clawdemy lesson collapses into one. Levine works the distribution-shift failure mode visually (the off-lane car drifting further), derives the O(εT²) bound, and presents DAgger with the policy-rollout-then-expert-query loop. Watching the bound derived live is the cleanest way to lock in why the T² dependence is structural and not an artifact of any one analysis.

Going deeper (the foundational papers)

Efficient Reductions for Imitation Learning (Ross and Bagnell, AISTATS 2010). The paper that proved the O(εT²) bound for behavioral cloning and identified compounding error as the failure mechanism. Short, readable, and the source you cite when someone asks where the quadratic dependence comes from.
A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning (Ross, Gordon, and Bagnell, AISTATS 2011). The DAgger paper. Introduces the dataset-aggregation algorithm, proves the O(εT) bound, and connects imitation learning to the online-learning / no-regret literature. This is the paper that taught the field “the only general fix is on-policy data.”
End to End Learning for Self-Driving Cars (Bojarski et al., NVIDIA, 2016). The PilotNet paper. Notable here for its hand-rolled workaround to BC’s distribution-shift problem: synthetically perturb the demonstrator’s camera view and capture the corrective steering, populating the training set with off-distribution states and recovery actions. The technique is one of the cleanest practical patches for BC that does not require online expert queries.

Going deeper (textbooks and tutorials)

Reinforcement Learning: An Introduction (Sutton and Barto, 2nd edition). Chapter 1’s framing of why pure imitation is not enough motivates the full RL agenda; Chapter 17 has an extended discussion of imitation learning in the broader policy-search context.
Spinning Up in Deep RL (Joshua Achiam, OpenAI). The introductory survey treats imitation learning as a complement to RL, with practical guidance on when to reach for which. A useful counterweight to a pure-RL framing of the field.

Adjacent topics

Where this sits in the wider curriculum.

RL fundamentals (next lesson). Lesson 3 introduces the Markov decision process formalism, returns, and value functions, the language that makes the rest of the track precise. With BC and its failure mode in hand, the motivation for the apparatus of MDPs is clearer.
RL for large language models (lesson 13, RLHF). The supervised fine-tuning step of an LLM (instruction-response pairs) is behavioral cloning; the reinforcement-learning-from-human-feedback step exists in part to address its long-horizon distribution-shift problem with a learned reward model and PPO fine-tuning. This lesson is the calculus-of-the-failure-mode behind that pipeline.
T17 (RL Foundations) and T5 (AI Foundations, the LLM track). T17 covers classical RL foundations, including imitation learning at the same orientation level. T5 covers the LLM side of RLHF (preference data, reward models) that this lesson’s failure mode motivates.