Skip to content

Summary: Introduction to deep reinforcement learning

Reinforcement learning is the third major regime of machine learning, alongside supervised and unsupervised. Where supervised learning learns from labeled examples, RL learns from rewards collected by an agent acting in an environment over time. The “deep” variant uses neural networks where classical RL used tables, and that one substitution is what gives the field its reach (Atari, AlphaGo, robotics, RLHF) and its difficulty (delayed rewards, distribution shift, broken convergence proofs). This is the scan-it-in-five-minutes version.

  • The three ML regimes. Supervised: fixed (input, label) pairs, predict the output. Unsupervised: fixed unlabeled data, find structure. Reinforcement learning: an agent acts in an environment, receives rewards (often delayed), and its own actions generate the data as it learns. RL is not “supervised with rewards”; the data is policy-generated, the reward is delayed, and the agent must choose actions.
  • The agent-environment loop. At each timestep t, the agent observes state s_t, picks action a_t per its policy π, and receives reward r_t and next state s_(t+1). Episodes either terminate or run on an infinite horizon. The policy is the function the agent is learning; in deep RL, the policy is a neural network.
  • Return and discount. The agent maximizes the discounted return G_t = r_t + γ·r_(t+1) + γ²·r_(t+2) + ..., with discount factor γ between 0 and 1. Worked: r = (0, 0, 1), γ = 0.9 gives G_0 = 0.81. Smaller γ is short-sighted; larger γ is far-sighted. The expected return (over environment and policy randomness) is what training optimizes.
  • Why “deep.” Classical RL stores one value per state in a lookup table. Atari has more possible frames than atoms in the observable universe; the table cannot hold them. The fix is a neural-network function approximator for the policy or value function, which generalizes from states already seen. The gain is scale; the cost is that classical tabular convergence guarantees no longer apply.
  • What makes deep RL hard (the whole track’s agenda): delayed reward and credit assignment (the action that caused the payoff was many steps ago); distribution shift during training (the policy changes, so the data changes); function approximation breaks classical proofs (requiring engineered stabilizers like replay buffers and target networks); exploration vs exploitation (use what works vs try what might); sample efficiency (acting is expensive, especially in robotics).
  • Where it has shown up. Atari (DQN, 2013-15); AlphaGo and AlphaZero (2016-17, board → move → win/loss); robotics (simulated training, real-robot transfer); preference-based post-training of LLMs (ChatGPT, Claude, Gemini): the canonical RLHF recipe trains a reward model from human preferences and post-trains via PPO, with RLAIF / Constitutional-AI variants (AI-generated preferences) and DPO-style direct-preference methods now competing.

You now have the frame to read deep-RL claims with calibrated skepticism. An RL-trained system has learned to maximize a number you chose, on a distribution it generated by acting, with all the credit-assignment and shift-of-distribution caveats that come with that. RLHF-trained chatbots in particular are not “trained to be helpful in the abstract”; they are optimizing a learned proxy for human preference, which is useful but not the same thing. The reward signal is the behavior, no more and no less. The next lesson takes the simplest possible approach to RL: ignore the reward entirely and just imitate an expert, with behavioral cloning. It works, partly, and the way it fails reveals exactly why genuine reinforcement learning is needed for the rest of the track.