Practice: Introduction to deep reinforcement learning

Self-check

Six short questions. Answer each one in your head (or on paper) before opening the collapsible. Trying to retrieve the answer is where the learning sticks; rereading feels productive but does much less.

1. Distinguish the three machine-learning regimes by what their training data looks like.

Show answer

Supervised learning is given fixed (input, correct-output) pairs and learns to predict the output. Unsupervised learning is given fixed unlabeled data and learns its structure (clusters, embeddings). Reinforcement learning is given an environment the agent can act in and a reward signal; the agent’s own actions generate the data as it learns, and the reward (not a label) tells it how good those actions were.

2. Sketch the agent-environment loop and name the five core terms it uses.

Show answer

              action a_t
              ----------->  environment
agent (policy π)
              <-----------
              state s_(t+1), reward r_t

The five terms: state (what the agent sees), action (what it does), reward (the number the environment hands back), policy (the function π mapping state to action), and return (the accumulated reward over time the agent is trying to maximize).

3. Why is the discount factor γ typically less than 1?

Show answer

Two reasons. (1) For an infinite-horizon problem (a task that never terminates), an undiscounted sum of rewards can be infinite; a discount with γ < 1 keeps the return finite. (2) It encodes the common-sense intuition that a reward right now is worth more than the same reward many steps later; a smaller γ makes the agent more short-sighted, a larger γ more far-sighted. At γ = 0 only the immediate reward counts; at γ → 1 the far future matters as much as the present.

4. What does “deep” add to reinforcement learning, and what does it cost?

Show answer

It replaces the classical lookup-table value/policy with a neural network function approximator. Gain: the agent can handle high-dimensional states like Atari pixels or board positions that are too large to tabulate. Cost: the classical convergence guarantees of tabular RL no longer hold, so deep RL relies on engineering stabilizers (replay buffers, target networks, trust regions, etc.) that exist precisely because the theory has more holes than the supervised case.

5. Name three things that make deep RL genuinely hard.

Show answer

Any three of: credit assignment (rewards arrive long after the action that caused them), distribution shift during training (the policy changes, so the data it generates changes too), function approximation breaking classical guarantees (no tabular convergence proof), exploration vs exploitation (use what works vs try what might be better), sample efficiency (acting to generate data is expensive, especially in robotics).

6. Why is RL not just “supervised learning with a reward in place of a label”?

Show answer

Three independent reasons. (a) The data is generated by the agent’s own policy and changes as the policy changes; a supervised model trains on a fixed dataset. (b) The reward is often delayed, the consequence of many earlier actions; a label is immediate and per-example. (c) The agent must choose actions, not just predict, so it has to balance exploring new behavior against exploiting what already works. Each of these breaks an assumption supervised learning quietly depends on.

Try it yourself, part 1: compute discounted returns

Pen and paper, about 5 minutes. An agent collects rewards r_0 = 1, r_1 = 0, r_2 = 0, r_3 = 2 over four steps. Compute the discounted return from t = 0 for each of three discount factors and explain what the contrast tells you.

(a) γ = 0 (myopic). (b) γ = 0.5. (c) γ = 1.0 (no discount).

Show answer

Apply G_0 = r_0 + γ·r_1 + γ²·r_2 + γ³·r_3:

(a) γ = 0: G_0 = 1 + 0·0 + 0·0 + 0·2 = 1. Only the immediate reward counts; the late reward of 2 is invisible.

(b) γ = 0.5: G_0 = 1 + 0.5·0 + 0.25·0 + 0.125·2 = 1 + 0 + 0 + 0.25 = 1.25. The late reward of 2 contributes its discounted value 0.125 · 2 = 0.25, bringing the return slightly above the immediate-only case.

(c) γ = 1.0: G_0 = 1 + 0 + 0 + 2 = 3. All rewards count fully. This sum is well-defined because the episode is finite (four steps); on an infinite horizon γ = 1 could give an infinite return.

The contrast: lower γ makes the agent short-sighted (only the immediate reward of 1 matters at γ = 0); higher γ lets future rewards dominate. The reward of 2 at step 3 contributes 0, 0.25, and 2 to the three returns, an 8× swing (0.25 vs 2) driven purely by how much the agent discounts the future. Picking γ is part of the problem setup.

Try it yourself, part 2: which regime?

About 4 minutes. For each scenario, identify the most natural machine-learning regime (supervised, unsupervised, or reinforcement learning) and give a one-line reason.

Train a model to predict whether an email is spam, given a labeled dataset of past emails.
Train a robot to walk on uneven terrain, where each step that stays upright gets +1 and a fall gets -10.
Embed product descriptions into a vector space so that similar products end up near each other.
Fine-tune a language model so that its responses are rated more highly by human reviewers.
Predict the next character in a corpus of text, given the previous characters.

Show answer

Supervised. Labeled (email, spam-or-not) pairs are the signature of supervised learning.
Reinforcement learning. Agent (robot) acts in an environment (uneven terrain), receives delayed reward, must choose actions. Classic deep-RL setup.
Unsupervised. No labels, just data; learning a structure (the embedding) where similar items are nearby.
Reinforcement learning (specifically RLHF). The reward signal is “human-preferred over alternatives,” and the model (policy) generates responses (actions) in response to prompts (states). Lesson 13 of this track works this pipeline in detail.
Supervised (technically self-supervised, which is supervised learning with labels derived automatically from the data; the “label” for each character is the next character in the corpus). Pre-training a language model is supervised, not RL. The post-training of that model with RLHF is then RL, which is the precise sense in which “ChatGPT” combines both regimes.

Flashcards

Nine cards. Click any card to reveal the answer. Use the Print flashcards button to lay out the full set as one card per page, ready to print or save as a PDF for offline review.

Q. What are the three major machine-learning regimes?

Supervised (labeled (input, output) pairs, fixed data, predict the output). Unsupervised (unlabeled data, find structure). Reinforcement learning (agent acts in an environment, receives rewards, generates its own data, must choose actions over time).

Q. Draw the agent-environment loop.

At each step t: agent observes state s_t, picks action a_t per policy π, sends a_t to environment. Environment returns reward r_t and next state s_(t+1). Repeat. Episodes terminate or run forever (infinite horizon).

Q. What is the policy π, and where does the neural network sit in deep RL?

The policy π is the function (deterministic a = π(s) or stochastic π(a|s)) that picks the action given the state. In deep RL the policy (and/or the value function) is a neural network: the network’s output is the agent’s choice or its score for each choice.

Q. What is the discounted return formula?

G_t = r_t + γ·r_(t+1) + γ²·r_(t+2) + γ³·r_(t+3) + ..., with discount γ between 0 and 1. Lower γ: short-sighted; higher γ: far-sighted. Keeps the sum finite on infinite horizons.

Q. Compute G_0 for r = (0, 0, 1) with γ = 0.9.

G_0 = 0 + 0.9·0 + 0.9²·1 = 0 + 0 + 0.81 = 0.81. The reward of 1 at step 2 is still counted, at 81% of face value once discounted back to step 0.

Q. Why 'deep' RL?

Classical RL tabulates one value per state, which fails on high-dimensional states (Atari frames, board positions, language). The fix is to replace the table with a neural-network function approximator. Gain: scale. Cost: classical convergence guarantees break, requiring engineering stabilizers.

Q. What is the credit-assignment problem?

The reward typically arrives long after the action that caused it. A chess engine wins at move 60; the decisive move was move 23. The agent has to figure out which past decision deserves credit, across a long chain of intermediate actions.

Q. What is distribution shift in RL, and why does it matter?

The data the agent learns from is generated by its own policy, which changes during training. So the dataset is a moving target, unlike supervised learning where the dataset is fixed. This is the central source of instability in deep RL, and many algorithms exist to manage it.

Q. Why isn't reward the same as a label?

A label tells you the right answer for a specific input. A reward tells you how good the action you took was, often only after many further steps, and gives no information about what you should have done instead. Designing a reward that produces the behavior you want is its own craft (reward shaping, RLHF preference modeling).