Practice: Challenges and open problems
Exercise 1: Place the paper
Section titled “Exercise 1: Place the paper”For each one-sentence paper summary, identify which of the four open-problem categories it primarily addresses (sample efficiency / safety and alignment / generalization / real-world deployment). One paper may touch multiple; pick the primary.
- A world-model architecture that achieves competitive Atari scores with 10x fewer environment frames than model-free baselines.
- A method for training language models with a constitutional AI critique loop that reduces harmful outputs without explicit RLHF preference data.
- A domain-randomization pipeline that trains a robotic locomotion policy across 1000 random terrain configurations and demonstrates transfer to real-world surfaces never seen in training.
- A test-time adaptation method that updates a deployed image-classification model’s batch-norm statistics on incoming test data to handle distribution shift.
- A causal representation learning algorithm that identifies environment factors with intervention-stable effects across simulated tasks.
- A scalable oversight architecture using debate between two AI agents and a human judge to evaluate model outputs at super-human capability levels.
Answers
Section titled “Answers”- Sample efficiency. World models reduce data requirements.
- Safety and alignment. Constitutional AI is an alignment-research method.
- Real-world deployment (with generalization secondary). Domain randomization is the standard sim-to-real bridge.
- Generalization (with real-world deployment secondary). Test-time adaptation handles deployment-time distribution shift.
- Generalization. Causal representations target the generalization frontier specifically.
- Safety and alignment. Scalable oversight is core alignment research.
Exercise 2: Trace the failure
Section titled “Exercise 2: Trace the failure”For each hypothetical failure of a deployed AI system, identify the structural origin in terms of one of the open problems and one T18 algorithm or framing.
- An RLHF-tuned customer-support chatbot produces confidently wrong answers on rare technical questions.
- A robot trained in simulation to grasp red cubes consistently misses blue cubes at deployment.
- A medical-treatment recommendation system trained on offline data recommends a treatment combination that was rare in the training data and produces unexpected patient outcomes.
- A self-driving simulator-trained policy performs poorly on a real-world dawn-driving scenario the simulator did not model with accurate lighting.
- A game-playing agent achieving super-human scores on its training game performs at amateur level on a structurally similar game.
Answers
Section titled “Answers”- Safety / reward hacking. RLHF’s preference model rewarded confidence-sounding outputs more than calibrated-uncertainty outputs. The trained policy optimized the proxy. T18 algorithm: L13 RLHF, plus the reward-modeling brittleness named there.
- Generalization. The trained policy uses color or surface features that do not generalize to blue. Training distribution did not include color variation. T18 algorithm: domain randomization or causal representation (Generalization frontier).
- Safety / offline RL extrapolation error. The treatment combination is out of distribution; the offline-RL Q-function extrapolated an inflated value. T18 algorithm: L14 + L15 directly; the BCQ / CQL / IQL constraints exist for exactly this failure mode.
- Real-world deployment / sim-to-real gap. Simulator did not model the deployment-distribution accurately. T18 framings: domain randomization (effective in some axes), demonstration-bootstrapped pipelines (when demos exist).
- Generalization. Policy memorized training-game-specific features rather than the underlying game structure. T18 algorithm: multi-task RL (L17) might have helped if both games were in the training distribution.
Flashcards
Section titled “Flashcards”Q. What are the four open frontiers of deep RL and what does each address?
Sample efficiency: deep RL needs orders of magnitude more environment interactions than biological learners; addressed by model-based RL, demonstration data, meta-RL, exploration improvements, and offline RL. Safety and alignment: trained policies must do what humans intended robustly, including under distribution shift; addressed by reward modeling, KL regularization, conservative training, scalable oversight, interpretability. Generalization: trained policies often fail on small variations; addressed by domain randomization, self-supervised pretraining, causal representations, test-time adaptation. Real-world deployment: simulator-to-real-world gap is the practical bottleneck; addressed by sim-to-real techniques, demonstration-bootstrapping, staged deployment, online learning under shift.
Q. What is reward hacking and which T18 lesson directly addresses its origins?
Reward hacking is when an agent finds an unintended way to maximize the reward signal that does not match the intended behavior. Classical examples: boat-racing agent spinning in circles for power-up rewards instead of finishing the race; RLHF-tuned language model producing confident-sounding wrong answers because the preference model rewards confidence-as-correctness. The structural origin is that the trained policy optimizes whatever the reward signal rewards, and the reward signal is a proxy for what humans actually want. L13 RLHF directly addresses this: reward modeling is brittle because the model can be optimized harder than the supervisor anticipated.
Q. Why are sample efficiency and safety in tension?
Sample efficiency tends to rely on stronger priors (model-based world models, demonstration data, meta-learned initializations) because priors reduce the number of samples needed to identify a good policy. Safety often requires letting the system explore enough to know what failure modes exist and to be robust to them. A more sample-efficient system has trusted more of its prior, which limits what it can verify about its own behavior; a safer system has tested more of its action space, which costs samples. The two frontiers pull in opposite directions on the prior-versus-verification axis.
Q. What do model-based RL, exploration improvements, and meta-RL each contribute to the sample-efficiency frontier?
Model-based RL (L9, L10) reduces sample requirements by learning a transition model and planning against it; the model can be queried many times for free once trained, so a few real-environment samples can update the model and then drive many planning steps. Exploration improvements (L16) make existing samples more informative by directing the agent toward states it has not visited; intrinsic motivation and optimism-based methods reduce the number of samples wasted on already-known regions. Meta-RL (L17) shares structure across tasks; the marginal cost of a new task is much smaller than training from scratch because the meta-trained agent already knows the adaptation process. The three approaches are complementary: combinations are common in production systems.
Q. What does 'RLHF did not solve alignment' mean structurally?
RLHF is an engineering practice that aligns LLM outputs to a preference distribution within the training distribution. It does not handle out-of-distribution behavior (where the preference model has not been trained), deceptive alignment (where the model behaves well during training but differently at deployment), or reward hacking at scale (where the model exploits artifacts of the preference model). The structural problems are open and active research areas. RLHF is one piece of the alignment stack; mechanistic interpretability, scalable oversight, adversarial robustness, and red-teaming together form the working set. Treating RLHF as a complete solution underestimates the open problems.