Skip to content

Cheatsheet: Markov Decision Processes

A Markov Decision Process (S, A, P, R, gamma) is the formal contract of RL. The Markov property is a feature of the state representation you choose, not the world; the discounted return is the geometrically weighted sum of future rewards.

MDP = ( S , A , P , R , gamma )
S = set of states (must encode enough to be Markov)
A = set of actions (sometimes A(s))
P = P(s' | s, a) transition probabilities (the dynamics)
R = R(s, a) reward function (DESIGNED, not given)
gamma = discount in [0, 1) (myopia vs far-sightedness; required for continuing tasks)
P( s_(t+1), r_(t+1) | s_t, a_t, s_(t-1), a_(t-1), ..., s_0, a_0 )
=
P( s_(t+1), r_(t+1) | s_t, a_t )

A property of the state representation, not the world. Atari fix: stack 4 frames so velocity is in the state.

trajectory: s_0 --a_0--> s_1 (reward r_1) --a_1--> s_2 (reward r_2) ...
return: G_t = r_(t+1) + gamma * r_(t+2) + gamma^2 * r_(t+3) + ...
= sum over k >= 0 of gamma^k * r_(t+k+1)
Convention: action a_t taken in state s_t produces r_(t+1) and s_(t+1).
The first reward in a trajectory is r_1, not r_0.

The discount factor (one trajectory, three gammas)

Section titled “The discount factor (one trajectory, three gammas)”
Trajectory rewards: r_1=+1, r_2=+1, r_3=-1, r_4=+1
gamma = 0 -> G_0 = 1 (myopic; only r_1 matters)
gamma = 0.5 -> G_0 = 1 + 0.5 - 0.25 + 0.125 = 1.375
gamma = 0.9 -> G_0 = 1 + 0.9 - 0.81 + 0.729 = 1.819 (far-sighted)

Standard practice: gamma = 0.9-0.99. gamma < 1 mathematically required for continuing tasks.

Knows P, R?Method
Planning (Phase 2)YESSolve MDP directly (policy / value iteration, lessons 4-5)
Model-free learning (Phase 3)NO — only samplesLearn from interaction (MC, TD, Q-learning, lessons 6-8)
  • Confusing “Markov” with “history is irrelevant” (it means the state ALREADY summarizes the relevant history).
  • Off-by-one on reward indexing (a_t in s_t produces r_(t+1), not r_t).
  • gamma = 1 on a non-terminating task (sum can diverge).
  • Treating P, R as available in real problems (usually they are not — Phase 3).
  • Designing a state too small for the problem to be Markov (silently breaks algorithms).
  • State / action / reward / transition: the four moving parts of the MDP.
  • Markov property: next state and reward depend only on the current state and action.
  • Policy pi(a | s): the agent’s rule for choosing actions from a state.
  • Trajectory: the sequence of states/actions/rewards under a policy.
  • Return G_t: the discounted total future reward from time t.
  • Discount factor gamma: in [0, 1); controls myopia and ensures convergence.