Practice: Learning by trial and reward

Self-check

Six short questions. Try to answer each one in your head (or on paper) before opening the collapsible. Active retrieval is where the learning sticks; rereading feels productive but does much less.

1. What makes reinforcement learning different from every model earlier in this track?

Show answer

There is no answer key. Earlier models learned from a fixed dataset of correct answers; an RL agent has none. It acts in an environment, receives rewards or penalties, and learns which behaviors pay off from the consequences, the way you learned to ride a bike.

2. Name the steps of the RL loop, and say what the agent is ultimately trying to maximize.

Show answer

The agent observes a state, takes an action, and the environment returns a reward and a new state; then the loop repeats. The agent is not trying to be right on any single step; it is trying to collect the most total reward over time. What it learns is a policy: a strategy that tells it which action to take in any state.

3. In one sentence, how does the reward signal differ from a supervised label?

Show answer

A supervised label says what the correct answer was; a reward says only how good an outcome was, not which action was correct. The agent learns from evaluation, not instruction, so it must explore and infer the rest.

4. What is the credit-assignment problem?

Show answer

When a reward arrives long after the actions that earned it (a chess win after forty moves, +1 only at the end), the agent has to work out which earlier decisions actually deserved the credit. Spreading a delayed reward back across the chain of actions that led to it is a central difficulty in RL.

5. Explore versus exploit: what is the tension, and why does going all the way to either extreme fail?

Show answer

Exploit means keep doing the action you know pays off; explore means try something new that might be better or worse. Always exploiting means you never discover anything better; always exploring means you never settle on what works. A good agent balances the two. (The restaurant problem: the reliable favorite versus the promising new place.)

6. The word “agent” means two different things across these tracks. What are they?

Show answer

In reinforcement learning, an agent is a decision-maker that learns from environment rewards (this lesson). In the world of AI assistants and tool use, an “agent” usually means a language model wired up to take actions with tools. Same word, genuinely different idea.

Try it yourself: model a scenario, then sort the setups

No math here. About 15 minutes of reasoning and writing.

Side effects: none. This is a thinking-and-writing exercise. No tools, no API calls, no costs.

Part A: map a scenario onto the RL loop.

A robot vacuum is learning to clean a room as fast as possible without bumping furniture. Identify each piece of the RL loop for this scenario, then propose a reward signal.

What is the agent?
What is the environment?
What might the state include?
What are some actions?
Propose a reward that would encourage fast, gentle cleaning.

Show a model answer

Agent: the vacuum’s control system (the decision-maker).
Environment: the room, including the floor, walls, and furniture.
State: its position, what its sensors see nearby, and perhaps how much of the floor is still dirty.
Actions: move forward, turn left, turn right, maybe a suction toggle.
Reward: a small positive reward for each patch of floor newly cleaned, a small penalty per unit of time (to encourage speed), and a larger penalty for bumping furniture. Maximizing total reward over time then means cleaning thoroughly, quickly, and gently.

Any reasonable scheme that rewards the goal, penalizes time or collisions, and leaves the agent to discover the route is correct. Notice you never told it the path; the reward shapes it.

Part B: supervised or reinforcement?

For each setup, decide whether it is supervised learning or reinforcement learning, and give a one-phrase reason.

A model is trained on 50,000 emails, each labeled spam or not spam.
A program learns to balance a simulated pole on a cart by trying moves and scoring how long the pole stays up.
A network learns to caption photos from a dataset of photos with human-written captions.
A game-playing system improves by playing millions of matches and tracking which lines of play led to wins.

Show answer

Supervised. Every example carries the correct label.
Reinforcement. No labels, just a reward (time upright) earned by trying actions.
Supervised. Each photo has a target caption to match.
Reinforcement. It learns from outcomes (wins) across many trials, with no per-move answer key.

The tell: if every training example has a correct answer attached, it is supervised. If the system learns from rewards earned by acting, it is reinforcement learning.

Flashcards

Ten cards. Click any card to reveal the answer. Use the Print flashcards button to lay out the full set as one card per page, ready to print or save as a PDF for offline review.

Q. What is reinforcement learning, in one sentence?

Learning with no answer key: an agent acts in an environment, receives rewards or penalties, and learns which behaviors pay off from the consequences.

Q. What are the steps of the RL loop?

The agent observes a state, takes an action, and the environment returns a reward and a new state; then the loop repeats.

Q. What is an agent, and what is an environment, in RL?

The agent is the decision-maker that learns; the environment is everything it acts in (the game, the road, the maze) and the source of states and rewards.

Q. What is a policy?

A strategy that, for any state the agent finds itself in, tells it which action to take. Learning in RL means improving the policy through experience.

Q. What is the agent actually trying to maximize?

Total reward over time, not correctness on any single step.

Q. How does a reward signal differ from a supervised label?

A label says what the correct answer was; a reward says only how good an outcome was. RL learns from evaluation, not instruction.

Q. What is the credit-assignment problem?

When a reward is delayed (a win after many moves), figuring out which earlier actions deserved the credit. Spreading a final reward back across the chain of actions is a central RL difficulty.

Q. What is the explore-versus-exploit tradeoff?

Exploit the action you know pays off, or explore a new one that might be better or worse. All-exploit never improves; all-explore never settles. Good agents balance the two.

Q. Where does RL shine, and where does it strain?

It shines in games and clean simulations (a system beat the Go world champion this way). It strains in the real world: often sample-inefficient (millions of trials) and brittle, so deployment outside simulators is hard.

Q. Why does 'agent' mean two different things across tracks?

In RL, an agent is a reward-learning decision-maker. In AI-assistant and tool-use contexts, an agent is usually a language model wired to take actions with tools. Same word, different idea.