Skip to content

References: Imitation learning and behavioral cloning

Source curriculum (structural mirror, cited as further study):
• Berkeley CS285 (CS185), Deep Reinforcement Learning,
Lecture 2: Behavioral Cloning + Lecture 3: Behavioral Cloning Part 2
(collapsed into one Clawdemy lesson, per phase-0 §5)
Instructor: Sergey Levine
Course page: http://rail.eecs.berkeley.edu/deeprlcourse/
Lecture videos (Fall 2023, most recent recording at time of authoring):
https://www.youtube.com/playlist?list=PL_iWQOsE6TfVYGEGiAOMaOzzv41Jfm_Ps
License: YouTube standard (link-out only, no embed, no transcript republication)
This Clawdemy lesson is an original walkthrough of behavioral cloning, the
quantitative reason it fails on long-horizon tasks (the O(εT²) bound from
Ross and Bagnell), and DAgger as the standard fix, following the pedagogical
arc of CS285 lectures 2 and 3. We cite the lectures as the recommended
full-depth companion; we do not reproduce or transcribe the videos. All
rights to the original lectures remain with the creator.
  • CS285 Lectures 2 and 3, Behavioral Cloning (Sergey Levine, Berkeley). The two lectures this Clawdemy lesson collapses into one. Levine works the distribution-shift failure mode visually (the off-lane car drifting further), derives the O(εT²) bound, and presents DAgger with the policy-rollout-then-expert-query loop. Watching the bound derived live is the cleanest way to lock in why the dependence is structural and not an artifact of any one analysis.

Where this sits in the wider curriculum.

  • RL fundamentals (next lesson). Lesson 3 introduces the Markov decision process formalism, returns, and value functions, the language that makes the rest of the track precise. With BC and its failure mode in hand, the motivation for the apparatus of MDPs is clearer.

  • RL for large language models (lesson 13, RLHF). The supervised fine-tuning step of an LLM (instruction-response pairs) is behavioral cloning; the reinforcement-learning-from-human-feedback step exists in part to address its long-horizon distribution-shift problem with a learned reward model and PPO fine-tuning. This lesson is the calculus-of-the-failure-mode behind that pipeline.

  • T17 (RL Foundations) and T5 (AI Foundations, the LLM track). T17 covers classical RL foundations, including imitation learning at the same orientation level. T5 covers the LLM side of RLHF (preference data, reward models) that this lesson’s failure mode motivates.