RL fundamentals: MDPs, value, and Bellman

The last two lessons set the stage: lesson 1 framed reinforcement learning as the third major ML regime with its own agent-environment loop, and lesson 2 walked through behavioral cloning and the way its long-horizon failure mode (the O(epsilon T squared) blow-up) makes the case for genuine RL. This lesson makes the loop precise. Without the formal language, every algorithm in the rest of the track would have to be asserted; with it, they can be derived.

The formal language is the Markov decision process (MDP), and the objects it gives you, value functions and the Bellman equation, are what policy gradients, Q-learning, actor-critic, trust-region methods, model-based RL, offline RL, and RLHF are all defined against. By the end of this lesson you will be able to state an RL problem as an MDP, write down the value and Q-functions, recognize the Bellman recursion, and solve a small MDP by hand.

The MDP tuple

A Markov decision process is the tuple (S, A, P, R, gamma):

S is the set of states. A state is whatever information the agent observes that lets it decide; for an Atari agent it might be the last few frames of the screen, for a chess agent the current board, for a robot a vector of joint angles and velocities. The state can be discrete (chess) or continuous (robot).
A is the set of actions. Whatever the agent can do: joystick directions, board moves, joint torques, tokens to emit. Also discrete or continuous.
P(s’ | s, a) is the transition function: the probability of landing in state s’ after taking action a in state s. In a deterministic environment, P(s’ | s, a) = 1 for exactly one s’ and zero for the rest. In a stochastic environment (a robot whose actuators are noisy, a card game with a shuffled deck), it spreads probability across several outcomes.
R(s, a) is the reward function: the number the environment hands back when action a is taken in state s. (Sometimes written R(s, a, s’) if the reward also depends on the next state.) Designing this is its own craft.
gamma is the discount factor, gamma in [0, 1), weighting future rewards relative to immediate ones, exactly as in lesson 1.

That five-tuple is the entire definition of the RL problem. Everything else in the track, every algorithm, every value function, every policy update, is a way of acting in or learning about an MDP.

The Markov property

The “Markov” in MDP refers to one assumption: the next state and reward depend only on the current state and action, not on any earlier history.

P(s_(t+1), r_t | s_t, a_t, s_(t-1), a_(t-1), ...) = P(s_(t+1), r_t | s_t, a_t)

Once you know the current state and action, the past is irrelevant. This is what makes the formalism tractable. Without it the agent would need infinite memory of the full trajectory to make decisions, and dynamic programming and value functions could not be defined recursively.

The Markov property is not a property of the world; it is a property of the state representation. If the agent’s “state” omits enough relevant context, the past matters and the assumption is violated. The classic example: a single Atari frame is not a Markov state (you cannot tell whether a ball is moving left or right from one frame); four stacked consecutive frames is, which is exactly why the original DQN paper stacked four frames. When you design a state representation, you are choosing whether the Markov assumption holds.

The policy

The agent’s behavior is summarized by a policy pi. Two flavors:

Deterministic: a = pi(s). Map each state to a single action.
Stochastic: pi(a | s) is a probability distribution over actions. The agent samples from it.

In deep RL the policy is a neural network with parameters theta, written the policy parameterized by theta. Stochastic policies are the default in policy-gradient methods (the gradient of an action probability is well-defined; the gradient of “pick this discrete action” is not); deterministic policies appear in value-based methods and in some actor-critic variants (DDPG, TD3). SAC is stochastic, with max-entropy training and a reparameterized squashed-Gaussian actor; covered in L12.

Returns, recap

The agent’s objective is to maximize the expected discounted return. From timestep t:

G_t = r_t + γ·r_(t+1) + γ²·r_(t+2) + γ³·r_(t+3) + ...

Identical to lesson 1. What is new here is that we are now ready to ask: given a policy pi, what is the expected return from each state? That is what value functions answer.

Value functions: V, Q, and the advantage

Three objects, all defined as expectations of the return under a policy.

The state-value function V-pi(s) is the expected return starting in state s and following policy π:

V^π(s)  =  E [ G_t  |  s_t = s,  follow π afterward ]

“If I am here and I follow pi from now on, how much reward do I expect?”

The action-value function Q-pi(s, a) is the expected return starting in state s, taking action a, and following policy pi afterward:

Q^π(s, a)  =  E [ G_t  |  s_t = s, a_t = a,  follow π afterward ]

“If I am here, take this action first, and then follow pi, how much reward do I expect?” Useful because it scores each first-action choice separately, which is exactly what you need to act greedily.

The advantage function A-pi(s, a) = Q-pi(s, a) - V-pi(s) is how much better action a is than the policy’s average from state s. Positive advantage: action a beats what pi would do on average. Negative: it is worse. The advantage shows up everywhere in advanced policy-gradient methods (lesson 8’s TRPO and PPO) because it has lower variance than the bare return.

The Bellman equation

The Bellman equation is the recursion that ties value functions to themselves through the dynamics. It is the single most important equation in classical RL.

Start from Q-pi. By definition, the return is the immediate reward plus the discounted return from the next state:

G_t = r_t + γ·G_(t+1)

Take the expectation under the policy and the dynamics:

Q^π(s, a)  =  R(s, a)  +  γ · E_(s' ~ P(·|s,a)) [ V^π(s') ]
           =  R(s, a)  +  γ · Σ over s'  P(s' | s, a) · V^π(s')

And V-pi is the expectation of Q-pi over the policy’s action distribution:

V^π(s)  =  E_(a ~ π(·|s)) [ Q^π(s, a) ]
       =  Σ over a  π(a | s) · Q^π(s, a)

Substitute one into the other to get the Bellman equation for V-pi:

V^π(s)  =  Σ over a  π(a | s) · [ R(s, a)  +  γ · Σ over s'  P(s' | s, a) · V^π(s') ]

For a finite-state, finite-action MDP and a fixed policy pi, this is a system of linear equations in the unknowns V-pi(s). As many equations as states, as many unknowns. You can solve it directly. That is what makes the formalism so powerful: questions about an agent’s long-run performance reduce to linear algebra.

Worked example: a 2-state loop MDP

The smallest MDP that shows the Bellman equation actually doing something.

states:   { s0, s1 }
action:   "go" (only choice in each state)
dynamics: s0 --go--> s1   with reward r = 1
          s1 --go--> s0   with reward r = 0
discount: γ = 0.9

The policy is trivial (one action), so write the Bellman equations for V-pi:

V(s0)  =  1  +  γ · V(s1)
V(s1)  =  0  +  γ · V(s0)

Substitute the second into the first:

V(s0)  =  1 + γ · ( γ · V(s0) )  =  1 + γ² · V(s0)
V(s0) · (1 - γ²)  =  1
V(s0)  =  1 / (1 - γ²)  =  1 / (1 - 0.81)  =  1 / 0.19  ≈  5.263
V(s1)  =  γ · V(s0)  =  0.9 · 5.263  ≈  4.737

Sanity-check the geometric way. Starting at s0, the reward sequence is 1, 0, 1, 0, 1, 0, … (the agent collects 1 every other step). The discounted return is

G(s0)  =  1 + γ²·1 + γ⁴·1 + γ⁶·1 + ...  =  Σ over k of γ^(2k)  =  1 / (1 - γ²)

which matches V(s0) = 1 / (1 - gamma squared) = 5.263 exactly. The Bellman equation and the direct geometric sum are saying the same thing.

This is the smallest non-trivial example, and it is worth dwelling on. The Bellman equation is not a heuristic; it is the precise statement of “value at this state equals immediate reward plus discounted value at the next state, in expectation.” Every value-based RL method (Q-learning, DQN, every actor-critic) is, at heart, trying to satisfy this equation.

The optimal value and the Bellman optimality equation

So far we have evaluated a given policy pi. The point of RL is to find the best policy. The optimal value functions are

V*(s)  =  max over π of V^π(s)
Q*(s, a)  =  max over π of Q^π(s, a)

The best you can do from each state (or state-action pair) over all possible policies. The Bellman optimality equations are the recursion they satisfy:

V*(s)  =  max over a of [ R(s, a)  +  γ · Σ over s' P(s' | s, a) · V*(s') ]

Q*(s, a)  =  R(s, a)  +  γ · Σ over s'  P(s' | s, a) · max over a' of Q*(s', a')

The max makes this a non-linear system, so unlike the policy-evaluation Bellman equation above, it does not yield to direct linear-algebraic solution. Classical RL solves it iteratively, with value iteration or policy iteration, which Track 17 (RL Foundations) covers in depth and which Sutton and Barto Chapter 4 derives.

The payoff: once you have Q-star, the optimal policy is just the greedy policy:

π*(s)  =  argmax over a of Q*(s, a)

Take the action with the highest optimal Q-value. So in principle, all of RL reduces to: estimate Q-star, then act greedily. Q-learning and DQN (lessons 6-7) are exactly that recipe with neural-network function approximation.

Why all of this matters for deep RL

Every algorithm later in this track is estimating one of these objects, from data, with a neural network, in a particular way:

Policy-gradient methods (lessons 4, 5, 8) parameterize the policy parameterized by theta directly and improve its parameters in the direction that increases expected return.
Value-based methods (lessons 6, 7) parameterize Q-theta (or V-theta) and improve it toward satisfying the Bellman optimality equation, then act greedily.
Actor-critic methods (lesson 5) keep both: a parameterized actor (the policy parameterized by theta) and a critic (a learned value or action-value function). The critic provides lower-variance learning signal for the actor.
Advanced PG methods (lesson 8, TRPO and PPO) use the advantage A-pi = Q-pi - V-pi as a variance-reduced training signal for the policy update.
Model-based methods (lessons 9, 10) try to learn the dynamics P(s’ | s, a) itself, then plan using the learned model.

That is the entire dispatch table. When you read a deep-RL paper from the rest of this track, the first question to ask is: what is this method estimating, and how is it improving the estimate? The answer is always one of pi, V, Q, A, or P, and the “how” is always either gradient ascent on return (policy-gradient family), Bellman-error minimization (value family), or supervised regression on observed transitions (model-based family). The MDP gives you the objects; the rest of the track gives you the algorithms.

Why this matters when you use AI

The MDP framing changes how you read claims about RL-trained systems. When a paper says a model achieves some reward, what it means precisely is that, on a particular (S, A, P, R, gamma) (a particular state representation, action space, environment dynamics, reward function, and discount), the model has found a policy with high expected return. Every word of that is a choice. The state representation determines what is Markov; the action space determines what the model can do; the reward function determines what counts as “good.” The much-discussed “reward hacking” problem in RL is precisely this: the agent optimizes the reward you wrote, not the one you meant, because the MDP it was trained on is the one with your reward, not your intent. Reading RL claims through the MDP lens lets you ask the right question: not “did the agent learn the task?” but “did the reward function the agent was given encode the task you cared about?”

Common pitfalls

Confusing V and Q. V-pi(s) is keyed by state alone; Q-pi(s, a) is keyed by state and action together. V answers “from this state, how much reward do I expect?” Q answers “from this state, if I take this specific action first, how much reward do I expect?” Confusing them mangles every algorithm that uses them.

Forgetting the Markov assumption is a choice about the state. If a single Atari frame omits motion information, the state is not Markov. Stack four frames and it is. The Markov property is a property of the representation, not of the world.

Treating V-star or Q-star as “what the agent currently knows.” They are mathematical objects defined by the MDP and the agent’s choice of policy class; they do not change as the agent learns. The agent’s estimates V-theta or Q-theta change; the true V-star is the fixed target those estimates are trying to reach.

Reading the Bellman equation as a definition of value rather than as a constraint value functions must satisfy. Value is defined as the expected return; Bellman is the recursion that value automatically satisfies, given the MDP. Every value-based algorithm exploits this by training estimates to minimize Bellman error.

What you should remember

An MDP is the tuple (S, A, P, R, gamma): states, actions, transition probabilities, rewards, and a discount. The Markov property (next state depends only on current state and action) is what makes the formalism tractable, and it is a property of how you choose the state, not of the world.
The policy pi is what the agent learns. Stochastic pi(a|s) or deterministic a = pi(s); in deep RL it is a neural network, the policy parameterized by theta. The objective is the expected discounted return (the expected sum of discounted rewards over time).
Three value functions: V-pi(s) (expected return from state s under π), Q-pi(s, a) (expected return from s after action a, then π), and A-pi(s, a) = Q-pi(s, a) - V-pi(s) (the advantage). Optimal versions V-star, Q-star maximize over all policies; the optimal policy is greedy with respect to Q-star.
The Bellman equation is the recursion that ties value functions to themselves: V-pi(s) = Σ_a pi(a|s) [R(s,a) + gamma Σ_(s’) P(s’|s,a) V-pi(s’)]. For a fixed pi it is a system of linear equations. On the 2-state loop with gamma = 0.9, it gives V(s0) = 1/(1-gamma squared) = 5.263 and V(s1) = 4.737, matching the geometric-series sum. Every algorithm in the rest of this track is estimating pi, V, Q, A, or P and improving the estimate.

The next lesson takes the most direct route to improving a policy: policy gradients. Parameterize the policy parameterized by theta as a neural network and follow the gradient of the expected return with respect to theta. The derivation uses everything in this lesson, and the algorithm (REINFORCE) is short enough to fit on one page.