Introduction to deep reinforcement learning

You did not learn to balance on a bicycle by reading a manual. You wobbled, fell, tried again with a small adjustment, fell less, and after enough attempts your body found a setting that stayed upright. Nobody handed you a labeled dataset of (situation → correct steering input). You generated the data yourself by acting, and the only feedback you got was the continuous yes-or-no of staying up versus hitting the ground. That is the shape of reinforcement learning. The agent acts, the world responds, a reward arrives, and somewhere in that loop the agent is supposed to improve.

This track is about doing that with neural networks, which is the field called deep reinforcement learning (deep RL). It is the regime behind Atari-playing agents, AlphaGo beating Lee Sedol, robots learning manipulation, and the post-training step (RLHF) that turns a raw language model into ChatGPT. This first lesson is for situating the field. By the end you will be able to place deep RL alongside the supervised and unsupervised regimes you already know, name the agent-environment loop and its core vocabulary, and answer the question that will shadow the rest of the track: what exactly is hard about it?

Three regimes of machine learning

Supervised learning, unsupervised learning, and reinforcement learning are the three big ways a model can be trained. The cleanest way to distinguish them is to ask what the data looks like and what the model is asked to produce.

Supervised learning is given pairs of (input, correct output) and learns a function that maps one to the other. Classify this image as cat or dog; predict tomorrow’s temperature; translate this sentence. The label is the ground truth, and training proceeds by minimizing the gap between the prediction and the label.
Unsupervised learning is given unlabeled data and learns its structure. Cluster these documents; compress these images; train an embedding that places similar items near each other. There is no “right answer” per item; the model finds patterns that turn out to be useful.
Reinforcement learning is given neither labels nor a static dataset. It is given an environment the agent can act in, and a reward signal that arrives in response to actions. The agent’s job is to choose actions over time such that the accumulated reward is large.

The differences look superficial when listed but cut deep in practice. Supervised learning’s data is fixed before training; RL’s data is generated by the agent’s own policy, which changes during training, so the dataset is a moving target. Supervised learning’s label is immediate; RL’s reward is often delayed, the consequence of an action many steps earlier, so the agent has to figure out which past decision caused the eventual payoff (the credit-assignment problem). And where supervised learning is content with predicting accurately, RL has to choose what to do, which means it must balance trying things it already knows are good against trying new things that might be better. That tension has a name we will return to: exploration versus exploitation.

If you internalize one thing about the three regimes, internalize this: RL is not “supervised learning with a fancier label.” It is a different shape of problem.

The agent-environment loop

Here is the picture every RL paper draws on the first page.

       action a_t
       --------->
 agent             environment
       <---------
       state s_(t+1), reward r_t

At each timestep, the agent observes a state (a description of the world it sees), picks an action from the set of things it can do, and sends it to the environment. The environment responds with a reward (a number telling the agent how good or bad that step was) and a next state. The agent uses the new state to pick the next action, and so on. The loop runs either forever (an infinite horizon problem, like a balancing pole) or until some terminal condition fires (an episodic problem, like a game of Go that ends in a win, loss, or draw).

The agent’s behavior is summarized by a policy, a function (deterministic or stochastic) that says which action to take in each state. The standard notation is the Greek letter pi (written formally in fenced equations below). In deep RL the policy is a neural network: feed in the state, get out an action (or a probability distribution over actions). The agent’s goal is to choose its policy so that the rewards over time, added up, are large.

Returns, discount, and what the agent maximizes

The reward is the agent’s feedback for a single step. What the agent actually cares about is the total reward over the whole trajectory, called the return. There are two ways to add it up.

Undiscounted return is the straightforward sum: the reward at step zero plus the reward at step one plus the reward at step two, and so on. It works fine for short episodes that end. For ongoing tasks it can be infinite, which is awkward.

Discounted return weights nearer rewards more than farther ones, with a discount factor (the Greek letter gamma) between 0 and 1:

G_t  =  r_t + γ·r_(t+1) + γ²·r_(t+2) + γ³·r_(t+3) + ...

At gamma equal to 0 only the immediate reward counts; at gamma close to 1 future rewards count almost as much as present ones. The discount keeps the sum finite even on infinite horizons and reflects a common-sense idea: a reward right now beats the same reward a hundred steps from now.

A small numerical example to ground it. Suppose an agent gets the reward sequence (0, 0, 1) (a sparse reward at the end of three steps), with gamma equal to 0.9. The discounted return from time 0 is:

G_0 = 0 + 0.9·0 + 0.9²·1 = 0 + 0 + 0.81 = 0.81

The reward at step 2 still counts, just at 81% of its face value once you discount back to step 0. From step 1’s perspective the return is zero plus 0.9 times 1, which is 0.9, closer to the payoff. The discount shrinks the influence of far-off rewards but does not erase them, and it gives a clean mathematical handle on “accumulated reward over time.”

The agent’s job, stated precisely, is to choose its policy so that the expected return is large. The expectation is over the randomness in the environment (transitions can be stochastic) and any randomness in the policy itself.

Why “deep”?

Classical RL stores the value of each state in a table: one number per state. That works when the state space is small and discrete, a grid of a few hundred squares, a few tens of thousands of board positions. It fails the moment the state is high-dimensional. The raw pixels of an Atari screen are a state with 210 by 160 by 3 components in a discretized colour space; there are more possible Atari frames than atoms in the observable universe. You cannot tabulate a value for each.

The fix is to replace the table with a function approximator, and in deep RL that approximator is a neural network. The network takes the state (or the state-action pair) as input and outputs a value or a policy probability. Now the agent does not need to have seen this exact state before; the network generalizes from similar states it has encountered. This is the same use of neural networks you have seen across T11-T13, applied to a different ingredient. The substitution is mechanically simple. Its consequences are not.

What makes deep RL hard

A short list, the one that explains why the field has its own track.

Delayed reward and credit assignment. When the reward arrives, it is the consequence of a long chain of past actions. The agent has to figure out which of those actions deserves credit. A chess engine wins a game in move 60; the good move was probably move 23. Teasing apart which decision mattered is genuinely hard.

Distribution shift during training. The agent’s policy is changing as it learns. The data it collects (state-action visits) is generated by the current policy. So the dataset the agent learns from is a function of the very thing being trained. In supervised learning this never happens; the training set is fixed. In RL it is the central source of instability.

Function approximation breaks classical guarantees. Tabular Q-learning has a clean convergence proof. Replace the table with a neural network and that proof falls over. In practice deep RL works, often spectacularly, but it works through engineering stabilizers (replay buffers, target networks, trust regions) that exist precisely because the theory has more holes than the supervised case.

Exploration versus exploitation. The agent has two competing pressures. Exploit what it knows already works, and the rewards are decent but bounded by what was discovered. Explore new actions, and most attempts will be worse but a few might unlock much higher rewards. Balancing the two, especially when reward is sparse and exploration is expensive, is a problem with no general solution.

Sample efficiency. A supervised model trained on ImageNet sees each image roughly the same number of times. A deep-RL agent has to act to generate data, and acting is expensive (especially in robotics where a step costs real time and real wear). Many deep-RL methods need tens or hundreds of millions of environment interactions, which is fine in a simulator and awful in the real world.

Every algorithm in this track is, in some sense, a response to one or more of these difficulties.

Where you have already seen deep RL

The headline cases are worth naming so the abstract loop above acquires concrete examples in your head.

Atari. DeepMind’s DQN, introduced in 2013-2015, played Atari 2600 games from raw pixels at human level. State is the screen; action is a joystick direction or button press; reward is the score change.
AlphaGo and successors. Beat Lee Sedol in 2016; later AlphaZero learned chess, shogi, and Go from self-play alone. State is the board; action is a move; reward is +1 for a win, -1 for a loss.
Robotics. Policies trained in simulation, transferred (with care) to real robots, for manipulation, locomotion, and dexterous in-hand tasks.
Preference-based post-training in language models. ChatGPT, Claude, and Gemini are language models first pre-trained on text and then post-trained with RLHF or related preference-based methods. The reward signal is a learned model of pairwise preferences: human in the original RLHF recipe, AI-generated in Constitutional-AI / RLAIF variants (used in Claude’s published alignment), and direct-preference optimization (DPO) methods now compete with the explicit reward-model + PPO pipeline. The policy is the language model itself. Lesson 13 of this track is about the canonical pipeline.

Each of these is the same loop. Different state, different action, different reward, same agent-environment shape, same algorithms underneath.

Why this matters when you use AI

Even if you never train a deep-RL agent, the framing changes how you read what these systems can and cannot do. RL learns from interaction with an environment under a chosen reward. So the behavior an RL-trained system exhibits is exactly the behavior the reward incentivized, no more and no less. A model trained with a reward that rewards “agreeable answers” will be agreeable, even when the right answer is uncomfortable. An RLHF-trained chatbot is, in a precise sense, optimizing for the proxy of human preference that its reward model captures, which is not the same thing as truth or helpfulness, just a useful approximation of them. Knowing this lets you read claims about “what an RL-trained system has learned” with calibrated skepticism: it has learned to maximize a number you chose, on a distribution it generated by acting, with all the credit-assignment and shift-of-distribution caveats that come with that.

Common pitfalls

Calling RL “supervised learning with rewards.” It is not. The data is generated by the agent’s policy, the reward is delayed, and the agent has to do (not just predict). Conflating the two regimes misses everything that makes RL hard.

Treating the reward as a label. A label tells you the right answer for an input; a reward tells you how good the action you took was, often only after many further steps, and offers no information about what you should have done. Designing a reward that produces the behavior you want is a separate craft (reward shaping, RLHF preference modeling).

Thinking the agent learns the environment. In standard RL the environment is treated as fixed (and possibly stochastic), and the agent learns a policy. Learning a model of the environment is its own subfield (model-based RL, lessons 9-10 of this track), useful but optional.

Equating “deep” with “RL solved.” Function approximation lets RL scale to high-dimensional states, which is the only reason deep RL can play Atari or train a robot. It also breaks classical convergence guarantees and introduces a new class of failure modes. The engineering tricks of the next several lessons exist precisely to manage this trade.

What you should remember

RL is the third major ML regime. Supervised learning has labels; unsupervised learning has unlabeled data; RL has an agent acting in an environment and collecting rewards over time. The data is generated by the agent’s own policy, the reward is often delayed, and the agent has to choose actions, not just predict.
The agent-environment loop is the central object. At each step the agent sees the current state, chooses an action per its policy (denoted by pi), and receives a reward and the next state. The agent maximizes the expected return, often discounted (the return at a step is the reward now plus gamma times the next reward plus gamma-squared times the reward after that, and so on). Worked: with rewards (0, 0, 1) and gamma equal to 0.9, the return from time 0 is 0.81.
“Deep” means function approximation. Replace classical RL’s lookup table with a neural network, gaining the ability to handle high-dimensional states (pixels, board positions, language tokens) at the cost of classical convergence guarantees.
The track’s whole agenda is what makes deep RL hard: delayed rewards (credit assignment), distribution shift as the policy changes, function approximation breaking textbook proofs, exploration-vs-exploitation, and sample efficiency. Every later lesson is, in some sense, a response to one of these.

The next lesson takes the simplest possible approach: ignore the reward entirely and just imitate an expert. Imitation learning turns the RL problem into supervised learning by handing the agent a dataset of (state, expert action) pairs. It will work, partly, and the way it fails reveals exactly why genuine RL is needed.