Markov Decision Processes: cheatsheet

The one idea

A Markov Decision Process (S, A, P, R, gamma) is the formal contract of RL. The Markov property is a feature of the state representation you choose, not the world; the discounted return is the geometrically weighted sum of future rewards.

The MDP tuple

MDP = ( S , A , P , R , gamma )

S      = set of states (must encode enough to be Markov)
A      = set of actions (sometimes A(s))
P      = P(s' | s, a)   transition probabilities (the dynamics)
R      = R(s, a)        reward function (DESIGNED, not given)
gamma  = discount in [0, 1) (myopia vs far-sightedness; required for continuing tasks)

The Markov property

P( s_(t+1), r_(t+1)  |  s_t, a_t, s_(t-1), a_(t-1), ..., s_0, a_0 )
                       =
P( s_(t+1), r_(t+1)  |  s_t, a_t )

A property of the state representation, not the world. Atari fix: stack 4 frames so velocity is in the state.

Trajectory and return

trajectory: s_0  --a_0-->  s_1 (reward r_1)  --a_1-->  s_2 (reward r_2)  ...
return:     G_t = r_(t+1) + gamma * r_(t+2) + gamma^2 * r_(t+3) + ...
              = sum over k >= 0 of  gamma^k * r_(t+k+1)

Convention: action a_t taken in state s_t produces r_(t+1) and s_(t+1).
The first reward in a trajectory is r_1, not r_0.

The discount factor (one trajectory, three gammas)

Trajectory rewards: r_1=+1, r_2=+1, r_3=-1, r_4=+1
  gamma = 0    -> G_0 = 1           (myopic; only r_1 matters)
  gamma = 0.5  -> G_0 = 1 + 0.5 - 0.25 + 0.125 = 1.375
  gamma = 0.9  -> G_0 = 1 + 0.9 - 0.81 + 0.729 = 1.819   (far-sighted)

Standard practice: gamma = 0.9-0.99. gamma < 1 mathematically required for continuing tasks.

Planning vs learning (Phase 2 vs Phase 3)

	Knows P, R?	Method
Planning (Phase 2)	YES	Solve MDP directly (policy / value iteration, lessons 4-5)
Model-free learning (Phase 3)	NO — only samples	Learn from interaction (MC, TD, Q-learning, lessons 6-8)

Pitfalls to dodge

Confusing “Markov” with “history is irrelevant” (it means the state ALREADY summarizes the relevant history).
Off-by-one on reward indexing (a_t in s_t produces r_(t+1), not r_t).
gamma = 1 on a non-terminating task (sum can diverge).
Treating P, R as available in real problems (usually they are not — Phase 3).
Designing a state too small for the problem to be Markov (silently breaks algorithms).

Words to use precisely

State / action / reward / transition: the four moving parts of the MDP.
Markov property: next state and reward depend only on the current state and action.
Policy pi(a | s): the agent’s rule for choosing actions from a state.
Trajectory: the sequence of states/actions/rewards under a policy.
Return G_t: the discounted total future reward from time t.
Discount factor gamma: in [0, 1); controls myopia and ensures convergence.