Learning by trial and reward

Think back over everything we have built. The digit recognizer learned from images labeled with their correct digits. The sequence models learned from text where the next word was known. Even the generative models learned from a fixed pile of real examples to imitate. In every case, learning meant studying a dataset of right answers. Pull that dataset away and none of these networks have anything to learn from.

This lesson is about learning when there is no answer key at all. Nobody tells the system the correct action; it has to discover what works by trying things and seeing what happens. That is reinforcement learning, or RL, and it is how you learned to ride a bike: not from a labeled dataset of correct handlebar angles, but by wobbling, falling, adjusting, and gradually feeling out what keeps you upright. The shift, from learning out of a dataset to learning from consequences, is the paradigm turn that opens this final phase.

The loop: agent, environment, reward

RL has a small cast and a simple loop. There is an agent (the decision-maker, the learner) and an environment (everything the agent acts in: the game, the road, the simulated world). They take turns.

The agent observes the state of the environment (the board position, the view from the car, the layout of the maze).
Based on that state, the agent takes an action (move a piece, turn the wheel, step left).
The environment responds: it moves to a new state, and it hands back a reward, a number that is positive for good outcomes, negative for bad ones, often zero in between.
The agent observes the new state, and the loop repeats.

Make it concrete with a mouse learning a maze. The state is where the mouse currently stands. Its actions are move up, down, left, or right. The environment hands it a small penalty for each step taken (say, minus one, to discourage wandering) and a big reward for reaching the cheese (say, plus ten). At first the mouse blunders around almost at random. But over many runs, the paths that led to cheese get reinforced and the dead ends get discouraged, until the mouse heads straight for the cheese. Nobody ever told it the correct route; it learned the route from the rewards.

The agent’s goal is not to be right about any single step. It is to collect the most total reward over time. The thing it is actually learning is a policy: a strategy that, for any state it finds itself in, tells it which action to take. A good policy is one that racks up reward over the long run. Learning, in RL, means improving that policy through experience.

How this differs from supervised learning

It is worth pinning the contrast directly, because it is the heart of the lesson.

In supervised learning (everything before this phase), each training example came with the correct answer attached. The network’s job was to match those answers, and it got told, precisely, how wrong it was on each one.

In reinforcement learning, there is no correct-action label. The agent is never told “the right move here was left.” It only ever gets a reward signal, which says how good the outcome was, not what it should have done instead. So the agent cannot simply copy answers; it has to explore, try actions, observe rewards, and infer for itself which behaviors pay off. It learns from evaluation, not from instruction.

The hard part: which action gets the credit?

That reward signal hides a genuine difficulty. Imagine an agent learning chess. It plays forty moves and finally wins, earning a reward of +1 at the very end. Which of those forty moves deserves the credit? The brilliant sacrifice on move twelve? The quiet defensive move on move thirty? The reward arrived all at once, at the end, with no breakdown.

This is the credit-assignment problem, and rewards that arrive long after the actions that earned them (called delayed rewards) are what make RL hard. The agent has to work backward from sparse, delayed signals to figure out which earlier decisions actually mattered. A lot of the machinery of RL exists to solve exactly this, spreading credit from a final reward back across the chain of actions that led to it.

Explore or exploit?

One more tension, and it is one you feel in everyday life. Suppose the agent has found an action that reliably earns a modest reward. Should it keep doing the safe thing (exploit what it knows), or try something new that might be better or might be worse (explore)?

It is the restaurant problem. Do you return to the place you know is good, or try the new spot that might be great or might be a letdown? Always exploiting means you never discover anything better; always exploring means you never settle on what works. Every RL agent has to balance the two, and getting that balance right is part of what makes it learn well.

A quick word on the word “agent”

Because you may be moving between tracks, one clarification. In reinforcement learning, an “agent” means this: a decision-maker that learns from environment rewards. Elsewhere, especially in the world of AI assistants and tool use (Track 20), “agent” usually means a language model wired up to take actions with tools. Same word, genuinely different idea. When this track says agent, it means the RL sense, the reward-learning decision-maker described here.

Where RL shines, and where it strains

Reinforcement learning has produced some of the most striking results in AI. It is how a system famously learned to play Go well enough to beat a world champion (AlphaGo), and it is behind impressive results in other games and in simulated control problems like balancing, walking, and steering. When the environment is a game or a clean simulation, where trials are cheap and the rules are crisp, RL can reach superhuman skill.

But it is honest to name the other side. RL is typically sample-inefficient: it can take millions of trials to learn what a human picks up in a handful, which is fine in a fast simulator and painful in the real world, where each trial costs time, money, or a crashed robot. It can also be brittle, learning a policy that works narrowly and fails when the situation shifts. Real-world RL deployment, outside games and simulators, is genuinely hard, and that gap between the demos and the dependable systems is part of what the next lesson, on the limits of deep learning, takes up directly.

Why this matters when you use AI

Reinforcement learning shows up in AI you encounter in two ways worth recognizing. Directly, it powers game-playing systems and is used in robotics and control research. Indirectly, and more relevant to everyday tools, a form of it is used to tune AI assistants: after a language model is trained on text, it is often refined using human feedback as a reward signal, learning to prefer responses people rate as helpful. That is reinforcement learning’s loop, an action, a reward, an improved policy, applied to behavior rather than a game. Knowing the loop helps you understand both the flashy game-playing headlines and the quieter shaping of the assistants you actually use.

Common pitfalls

Thinking RL needs labeled data like the earlier models. It does not, and that is the whole point. There is no answer key, only a reward signal that scores outcomes. The agent learns from consequences, not corrections.

Thinking the reward tells the agent what to do. It only tells the agent how good an outcome was, not which action was correct. The agent must explore and infer the rest. Evaluation, not instruction.

Underestimating the credit-assignment problem. When the reward comes at the end, figuring out which earlier actions earned it is a real and central difficulty, not a footnote.

Generalizing the game-playing wins to everything. RL’s superhuman game results are real but selective. Outside cheap simulations it is often sample-inefficient and brittle, and deploying it in the messy real world remains hard.

What you should remember

Reinforcement learning has no answer key. An agent acts in an environment and learns from a reward signal, the way you learn from consequences, rather than from a dataset of correct answers.
The loop is state, action, reward, repeat. The agent observes a state, takes an action, receives a reward and a new state, and aims to maximize total reward over time by improving its policy (its state-to-action strategy).
Two difficulties define RL: credit assignment (which earlier action earned a delayed reward?) and the explore-versus-exploit balance (try something new or stick with what works?).
It shines in games and simulations (a system beat the Go world champion this way) but is often sample-inefficient and brittle, so real-world deployment is genuinely hard.

Supervised learning studies an answer key; reinforcement learning has none. It learns the way living things do, by acting, feeling the consequences, and slowly getting better at choosing what to do next.

Next: we have now toured what deep learning can do, see, read, generate, and decide. The final pair of lessons steps back to ask the harder questions. The next one is the honest one: where deep learning breaks, the limitations that every confident demo tends to leave out.