Practice: What reinforcement learning actually is

The goal of this orientation practice is to make the framework concrete enough to use: recognizing when a problem is an RL problem, holding the agent-environment-reward loop in your head, and feeling the exploration-versus-exploitation tension on a tiny worked scenario. No formulas yet; that starts in the next lesson.

Self-check

Six short questions. Answer each in your head before opening the collapsible.

1. In one line each, distinguish supervised, unsupervised, and reinforcement learning.

Show answer

Supervised: labeled examples; learn the input-to-label mapping (right answer given for each input). Unsupervised: no labels; find structure in the data (cluster, compress, factor). Reinforcement: no labels, only rewards from acting in an environment; learn a policy that maximizes total reward over time.

2. Name the three things the environment hands the agent at each step, and the name for the agent’s plan for choosing actions.

Show answer

The environment provides a state (information about the situation), accepts an action, and returns a reward (a number) and the next state. The agent’s plan for choosing actions, given a state, is called its policy.

3. What three things make RL genuinely harder than supervised learning?

Show answer

(1) No oracle action. The environment returns a reward for the action taken, not the action you should have taken instead. (2) Delayed reward / credit assignment. Important rewards arrive many steps after the actions that caused them. (3) The data distribution depends on the policy. As the agent’s policy changes, the states it visits change, so the training distribution shifts during learning.

4. Define exploration and exploitation in your own words, and say why neither alone is enough.

Show answer

Exploit: pick the action your current estimates say is best (use what you have learned). Explore: pick an action whose value you do not know well enough yet (gather information). Pure exploitation locks in on a possibly-suboptimal action because you stop learning; pure exploration ignores everything you have learned and never earns the best reward. Every RL method is a principled mix.

5. Why is the reward called “a signal you design” rather than a property of the world?

Show answer

Because there is no objective “right reward” sitting in the environment; the engineer picks a function that expresses what they actually want. A robot rewarded only for “moving fast” will game that signal (move fast and crash). Reward design (and reward shaping) is a real engineering problem, distinct from the algorithms that optimize a given reward.

6. Name two real systems built on reinforcement learning.

Show answer

Any two of: AlphaGo / AlphaZero (board games), DQN on Atari (video games), robotics control (walking, manipulation), recommendation / personalization systems, scheduling and resource allocation, RLHF for large language models (the alignment step covered in Track 5’s rlhf-and-dpo lesson).

Try it yourself: which paradigm?

For each scenario, name the paradigm (supervised, unsupervised, or reinforcement) and say why in one line.

A. A medical-imaging model trained on 100,000 chest X-rays, each labeled
   "pneumonia present" or "absent."
B. A robot that learns to walk by trying motions and receiving a reward
   proportional to distance traveled without falling.
C. A retailer's system that groups its million customers into segments
   based on purchase patterns, without any pre-defined segment labels.
D. A chatbot fine-tuned by collecting human preference ratings on its
   responses and updating it to maximize predicted human preference.
E. A spam filter trained on 10,000 emails each pre-labeled "spam" or
   "not spam."

Show answer

A: supervised. Labeled examples (image to diagnosis); the model learns the input-to-label mapping.
B: reinforcement. No labeled “correct motion”; only a reward signal from the environment for what the agent does.
C: unsupervised. No labels; the system finds structure (clusters) on its own.
D: reinforcement (RLHF specifically). The chatbot acts (produces a response), receives a reward (predicted human preference), and updates its policy to maximize it. The alignment side is Track 5; the RL mechanics underneath are this track.
E: supervised. Pre-labeled examples, learn the mapping. (Even though “filtering” sounds active, training is plain classification.)

The tell: labels per example -> supervised; no labels, no rewards, find structure -> unsupervised; act in an environment, get reward back -> reinforcement.

Try it yourself: a three-arm bandit (feel the tension)

You face three slot machines (arms). You have pulled each one exactly once. The results were:

Arm 1: +0.6     Arm 2: +0.4     Arm 3: +0.7

You do not know the true average payoff of any arm. You have one hundred more pulls. Reason through each strategy below, then check.

1. Pure EXPLOITATION: always pull the arm with the best result so far.
   What does this strategy do for the next 100 pulls, and what could go wrong?

2. Pure EXPLORATION: pick uniformly at random for the next 100 pulls.
   What is your expected average return (in terms of the unknown true
   averages of the three arms), and what is wrong with this strategy?

3. A MIX: mostly exploit, occasionally explore. Why is this likely better
   than either extreme, on this problem?

Show answer

1: Pure exploitation. You pull Arm 3 for all 100 remaining pulls because its single observed payoff (0.7) was highest. The problem: one pull is one sample. Arm 1’s true average might really be 0.9 and Arm 3’s might really be 0.5, and you would never learn, because you never give Arm 1 a chance to update your belief. Exploitation is locked in on noisy evidence.
2: Pure exploration. Uniformly random over three arms gives each arm 1/3 of the pulls in expectation, so your expected average return is (true mean of Arm 1 + true mean of Arm 2 + true mean of Arm 3) / 3, the plain mean of the true means. The data this collects is great for learning which arm is best, but you never act on what you learned, so a clearly-best arm earns no extra pulls. Exploration alone wastes the information.
3: A mix. Mostly exploiting the best-so-far while occasionally trying others (a common pattern is epsilon-greedy: pull the best-looking arm with probability 1 minus epsilon, a random arm with probability epsilon) earns close to the best arm’s true mean once your estimates settle, while still updating your beliefs about the others. Concretely on this problem: even a small amount of exploration would let you discover whether Arm 1’s true mean is actually higher than Arm 3’s, while still spending most pulls on whichever arm currently looks best.

The point: the same dilemma scales to every problem in the track. Phase 3 will give Q-learning a built-in exploration strategy for exactly this reason, and modern systems (epsilon-greedy, UCB, Thompson sampling) are all principled answers to “how should the mix change as I learn?”

Flashcards

Eight cards. Click any card to reveal the answer. Use the Print flashcards button to lay out the full set as one card per page for offline review.

Q. Distinguish the three paradigms of learning.

Supervised: labeled examples, learn the mapping. Unsupervised: no labels, find structure. Reinforcement: no labels, only rewards from acting in an environment; learn a policy that maximizes total reward.

Q. What does the agent observe, do, and receive at each step? What is its plan called?

Observe a STATE, take an ACTION, receive a REWARD (a number) and the next state. Its plan for choosing actions is its POLICY.

Q. Three things that make RL harder than supervised learning?

(1) No oracle action (only reward feedback, not the right answer). (2) Delayed reward / credit assignment (important rewards arrive much later). (3) Data distribution depends on the policy (it shifts as the agent learns).

Q. What is the credit-assignment problem?

Figuring out which earlier actions deserve credit (or blame) for a later reward. In chess, a move’s value may only be clear ten moves later; assigning that outcome to the right earlier choice is a core RL problem.

Q. Define exploration and exploitation, and why a mix is needed.

Exploit = pick the action your estimates say is best. Explore = pick an action whose value is uncertain to learn more. Pure exploit locks in on possibly-suboptimal actions; pure explore never uses what was learned. Methods need a principled mix.

Q. Why is the reward 'a signal you design'?

Because the engineer picks the reward function to express the desired goal; it is not an objective property of the world. Bad reward design (rewarding fast motion without penalizing crashes) produces agents that game the signal.

Q. Why does the data distribution shift in RL but not in supervised learning?

Because in RL the states the agent visits depend on its policy. As the policy changes during learning, the distribution of training data changes too. Supervised learning assumes a fixed distribution; RL must cope with the drift.

Q. Name a real system built with RL.

Any of: AlphaGo / AlphaZero (board games), DQN on Atari (video games), robotics control (walking / manipulation), recommendation systems, scheduling / resource allocation, RLHF for large language models (Track 5’s rlhf-and-dpo lesson covers the alignment side; this track covers the RL mechanics).