Cheatsheet: Markov Decision Processes
The one idea
Section titled “The one idea”A Markov Decision Process (S, A, P, R, gamma) is the formal contract of RL. The Markov property is a feature of the state representation you choose, not the world; the discounted return is the geometrically weighted sum of future rewards.
The MDP tuple
Section titled “The MDP tuple”MDP = ( S , A , P , R , gamma )
S = set of states (must encode enough to be Markov)A = set of actions (sometimes A(s))P = P(s' | s, a) transition probabilities (the dynamics)R = R(s, a) reward function (DESIGNED, not given)gamma = discount in [0, 1) (myopia vs far-sightedness; required for continuing tasks)The Markov property
Section titled “The Markov property”P( s_(t+1), r_(t+1) | s_t, a_t, s_(t-1), a_(t-1), ..., s_0, a_0 ) =P( s_(t+1), r_(t+1) | s_t, a_t )A property of the state representation, not the world. Atari fix: stack 4 frames so velocity is in the state.
Trajectory and return
Section titled “Trajectory and return”trajectory: s_0 --a_0--> s_1 (reward r_1) --a_1--> s_2 (reward r_2) ...return: G_t = r_(t+1) + gamma * r_(t+2) + gamma^2 * r_(t+3) + ... = sum over k >= 0 of gamma^k * r_(t+k+1)
Convention: action a_t taken in state s_t produces r_(t+1) and s_(t+1).The first reward in a trajectory is r_1, not r_0.The discount factor (one trajectory, three gammas)
Section titled “The discount factor (one trajectory, three gammas)”Trajectory rewards: r_1=+1, r_2=+1, r_3=-1, r_4=+1 gamma = 0 -> G_0 = 1 (myopic; only r_1 matters) gamma = 0.5 -> G_0 = 1 + 0.5 - 0.25 + 0.125 = 1.375 gamma = 0.9 -> G_0 = 1 + 0.9 - 0.81 + 0.729 = 1.819 (far-sighted)Standard practice: gamma = 0.9-0.99. gamma < 1 mathematically required for continuing tasks.
Planning vs learning (Phase 2 vs Phase 3)
Section titled “Planning vs learning (Phase 2 vs Phase 3)”| Knows P, R? | Method | |
|---|---|---|
| Planning (Phase 2) | YES | Solve MDP directly (policy / value iteration, lessons 4-5) |
| Model-free learning (Phase 3) | NO — only samples | Learn from interaction (MC, TD, Q-learning, lessons 6-8) |
Pitfalls to dodge
Section titled “Pitfalls to dodge”- Confusing “Markov” with “history is irrelevant” (it means the state ALREADY summarizes the relevant history).
- Off-by-one on reward indexing (a_t in s_t produces r_(t+1), not r_t).
- gamma = 1 on a non-terminating task (sum can diverge).
- Treating P, R as available in real problems (usually they are not — Phase 3).
- Designing a state too small for the problem to be Markov (silently breaks algorithms).
Words to use precisely
Section titled “Words to use precisely”- State / action / reward / transition: the four moving parts of the MDP.
- Markov property: next state and reward depend only on the current state and action.
- Policy pi(a | s): the agent’s rule for choosing actions from a state.
- Trajectory: the sequence of states/actions/rewards under a policy.
- Return G_t: the discounted total future reward from time t.
- Discount factor gamma: in [0, 1); controls myopia and ensures convergence.