Summary: Learning by trial and reward
Every model in this track so far learned from a fixed pile of examples with known answers. This lesson opens the final phase with a different kind of learning entirely: reinforcement learning (RL), where there is no answer key. An agent acts in an environment, receives rewards or penalties, and learns which actions pay off the way you learned to ride a bike, by trying and feeling the consequences. The shift from learning out of a dataset to learning from consequences is the paradigm turn that opens this phase.
Core ideas
Section titled “Core ideas”- RL has no answer key. Nobody tells the system the correct action. It discovers what works by acting and observing what happens, learning from consequences rather than from a dataset of right answers.
- The loop is state, action, reward, repeat. The agent observes a state, takes an action, and the environment returns a reward (positive for good outcomes, negative for bad) and a new state. The cast is small: an agent (the learner) and an environment (everything it acts in).
- The goal is total reward over time, not single-step correctness. What the agent learns is a policy: a strategy that, for any state, tells it which action to take. Learning means improving that policy through experience. (A mouse in a maze, penalized per step and rewarded at the cheese, learns the route nobody taught it.)
- It differs from supervised learning at the root. A supervised label says what the correct answer was; a reward says only how good an outcome was, not what to do instead. So the agent cannot copy answers; it must explore, try actions, and infer which behaviors pay off. Evaluation, not instruction.
- Two difficulties define RL. Credit assignment: when a reward is delayed (a chess win after forty moves), which earlier action earned it? And explore versus exploit: stick with the action you know pays off, or try a new one that might be better or worse? All-exploit never improves; all-explore never settles.
- A note on the word “agent.” In RL, an agent is a reward-learning decision-maker. In AI-assistant and tool-use contexts, “agent” usually means a language model wired to take actions with tools. Same word, different idea.
- It shines in games and simulations but strains in the real world. A system beat the Go world champion this way, and RL drives impressive results in games and simulated control. But it is often sample-inefficient (millions of trials) and brittle, so real-world deployment is genuinely hard.
What changes for you
Section titled “What changes for you”Reinforcement learning shows up in the AI you meet in two ways. Directly, it powers game-playing systems and robotics and control research. Indirectly, and closer to the tools you use, a form of it tunes AI assistants: after a language model is trained on text, it is often refined using human feedback as a reward signal, learning to prefer responses people rate as helpful. That is the RL loop, an action, a reward, an improved policy, applied to behavior rather than a game. Knowing the loop helps you read both the flashy game-playing headlines and the quieter shaping of the assistants you actually use. Next, the track turns honest: the following lesson is about where deep learning breaks, the limitations that the confident demos tend to leave out.