Skip to content

Learning by trial and reward

This is lesson 8 of Track 12 (Introduction to Deep Learning) and the opener of Phase 3. Every model in the track so far learned from a fixed pile of examples with known answers: labeled images, text with a known next word, real examples to imitate. This lesson is about learning when there is no answer key at all.

In reinforcement learning (RL), nobody tells the system the correct action. It acts in an environment, receives rewards or penalties, and discovers which behaviors pay off, the way you learned to ride a bike: not from labeled handlebar angles but by wobbling, adjusting, and feeling out what keeps you upright. The lesson builds the agent-environment-reward loop, contrasts it sharply with supervised learning, and names the two difficulties that define the field.

This opens Phase 3 (decisions, limits, and the frontier) and is the paradigm turn of the track: from learning out of a dataset to learning from consequences. It completes the tour of what deep learning can do (see, read, generate, and now decide) before the final pair of lessons steps back to the harder questions. The next lesson, Where deep learning breaks, picks up directly on RL’s honest limits (sample-inefficiency, brittleness) and broadens them into the field’s limitations as a whole.

Prerequisites: the earlier Track 12 lessons that established supervised learning, where a model learns from a dataset of correct answers. This lesson defines itself by contrast with that paradigm, so you need to be comfortable with the idea of learning from labeled examples. Everything specific to reinforcement learning is built from scratch here; no prior RL exposure is assumed.

  • Describe the reinforcement learning loop (state, action, reward, repeat) and name the agent, environment, and policy
  • Explain how reinforcement learning differs from supervised learning (reward as evaluation, not instruction)
  • Explain the credit-assignment problem and why delayed rewards make RL hard
  • Explain the explore-versus-exploit tradeoff
  • Recognize where RL shines (games, simulations) and where it strains (sample-inefficiency, brittleness, real-world deployment)
  • Read time: about 8 minutes
  • Practice time: about 15 minutes (a scenario-modeling exercise, a sort, and flashcards)
  • Difficulty: intro