Deep RL open problems: brief

What you will learn

You will identify the four open frontiers of deep RL (sample efficiency, safety and alignment, generalization, real-world deployment), name how each connects back to the algorithms covered earlier in T18, recognize the tensions across frontiers (sample efficiency versus safety; generalization versus verifiability; capability versus failure-stakes), and read claims about modern AI systems with the structural literacy that the T18 vocabulary gives you. You will leave with a working map of where the field is moving, what the algorithms in T18 buy you for reading the frontier literature, and where the open problems sit relative to current engineering practice. This lesson closes Phase 3 and the entire 18-lesson T18 syllabus.

Where this fits

This is lesson 18 of Track 18 (Deep Reinforcement Learning), lesson 6 of Phase 3 (rl-frontiers), and the final lesson of the track. It synthesizes the algorithmic content covered across Phase 1, Phase 2, and Phase 3 into a frontiers-and-open-problems map; it closes the track with a syllabus recap and a cross-track-coherence summary.

Source

Berkeley CS285 (Sergey Levine, Fall 2023), lecture on Challenges and open problems. Canonical URL http://rail.eecs.berkeley.edu/deeprlcourse/. The lesson cites 20+ primary papers across the four frontiers plus surveys and benchmarks; the references file enumerates them.

Phase advance

Phase 3 lesson 6 (phase_order: 6). FINAL LESSON of Track 18. Closes Phase 3 (rl-frontiers) and the 18-lesson syllabus.

Lesson body (lesson.mdx)

Hook: you have the toolkit; this lesson takes a step back to the frontier.
Four open frontiers introduced.
Sample efficiency: gap to biological learners; world models, demonstrations, meta-RL, exploration, offline RL all attack it; gap remains.
Safety and alignment: reward hacking, distributional shift, sequence-level safety in agentic systems; RLHF as one engineering practice; alignment is open as a structural problem.
Generalization: domain randomization, self-supervised pretraining, causal representations, test-time adaptation; tension with specialization and sample efficiency.
Real-world deployment: sim-to-real gap; long-horizon tasks; online learning under distribution shift; safety-under-deployment.
Track 18 recap: phase-by-phase summary.
Where T18 fits in the curriculum: T11/T12/T13/T4/T8/T17 as prereqs; T20/T19/T23 as parallel and successor tracks.
Why this matters: production AI failures often have structural origins in the open frontiers.
Common pitfalls (5): single-algorithm-as-answer; sim-to-real underestimation; engineering-vs-structural progress conflation; safety-as-orthogonal-to-capability; “RLHF solved alignment.”
5 remember-bullets.
Closing remark.

Practice (practice.mdx)

Two exercises plus five flashcards.

Place the paper (6 summaries): each paper summary mapped onto one of four open-problem categories. Tests pattern recognition across the frontier literature.
Trace the failure (5 hypothetical failures): each deployment failure traced to its structural origin in an open frontier and a T18 algorithm. Tests structural literacy.

Five flashcards: four frontiers and what each addresses; reward hacking origin; sample efficiency vs safety tension; sample efficiency methods recap; “RLHF did not solve alignment” structural meaning.

Cheatsheet (cheatsheet.mdx)

Tables. Four open frontiers with T18 connections. Sample-efficiency methods. Safety sub-problems. Tensions across frontiers. T18 syllabus recap. Where T18 fits in the curriculum. Pitfalls. Remember-bullets.

References (references.mdx)

CS285 primary. Sample efficiency: Hafner DreamerV3, Ye EfficientZero, Wu DayDreamer. Safety: Hendrycks catastrophic risks, Ngo alignment problem, Bai Constitutional AI, Irving debate, Christiano amplification, Olah circuits, Templeton scaling monosemanticity. Generalization: Schölkopf causal reps, Sun test-time training, Tobin domain randomization, Cobbe Procgen. Real-world: Andrychowicz dexterous, Kalashnikov QT-Opt, Akkaya Rubik’s, Smith walk-in-park. Surveys: Henderson DRL that matters, Kirk generalization, Amodei concrete AI safety.

Editorial discipline

Stage 2 sweep: em/en dashes (0), inline math backticks in lesson.mdx outside fenced blocks (0), Greek letters in prose spelled out where they appear (none expected for this conceptual lesson), placeholder comments present on brief.
§6 watch-zone: this lesson sits closest to AI safety territory of any in the track. The framing is strictly technical: the four open frontiers are described as research areas with active engineering, not as policy positions. “RLHF did not solve alignment” is presented as a structural-engineering observation, not a critique of any specific deployment or vendor. Constitutional AI, debate, amplification, mechanistic interpretability, scalable oversight are all cited as research directions; no endorsement or contestation of any specific approach.
Vendor naming: Anthropic, OpenAI, DeepMind, Google named as paper-author affiliations across the references (Templeton 2024 Anthropic, Amodei 2016 OpenAI, Andrychowicz 2020 OpenAI, Hafner 2023 Google, etc.); positive citations as research-paper authors; no anonymization triggers.
A1 verbatim discipline: no vendor quotations.

Word counts

Lesson 2105
Practice 1380
Summary 670
Cheatsheet 605
References 700
Brief 1015

Total ≈ 6475 words across 6 artifacts.

Notes for promotion

Component placeholders (�J0�, �J1�) as MDX comments. �J2� for CS285 “Challenges and open problems”.
Practice uses real �J0� + �J1� component imports.
Every prior T18 lesson is cross-linked in this lesson body, summary, and cheatsheet; the prereq-path forms are all lessons/deep-reinforcement-learning/�J0�.
Lesson body has no fenced display blocks because the content is conceptual rather than algorithmic. The math-gloss convention is trivially satisfied (no inline math notation to handle).
This lesson closes Track 18 (18 lessons total). Phase 1 (L1-L5) + Phase 2 (L6-L12) + Phase 3 (L13-L18) syllabus complete. Track 18 ratifies the deep-RL track in the curriculum.
Cross-track coherence summary in the closing names T17 (RL Foundations prereq), T11/T12/T13 (NN stack), T4/T8 (math stack), T20 (Agents and Tool Use successor), T19 (Diffusion parallel), T23 (AI Safety successor). Lead may want to wire these as inter-track navigation when promoting.