Skip to content

Summary: Markov Decision Processes

An MDP is the universal contract of reinforcement learning: a tuple (S, A, P, R, gamma) plus the Markov property, on which every algorithm in the track operates. The first lesson drew the agent-environment loop informally; this one nails it down precisely so the rest of the track has something formal to compute on. This summary is the scan-in-five-minutes version of the full lesson.

  • The MDP tuple. (S, A, P, R, gamma): states S, actions A, transition probabilities P(s’ | s, a), reward function R(s, a), and discount factor gamma in [0, 1). Together these specify the dynamics and what the agent values.
  • The Markov property. The next state and reward depend only on the current state and action, not on prior history. Crucially this is a property of the state representation you choose, not of the world: a single Atari frame is not Markov (you cannot see velocity); four stacked frames are. DQN’s frame-stacking is the canonical fix.
  • Trajectory and return. A policy plus the environment generate s_0, a_0, r_1, s_1, a_1, r_2, … The discounted return from time t is G_t = r_(t+1) + gamma * r_(t+2) + gamma^2 * r_(t+3) + …, the geometrically weighted total of future rewards.
  • The discount factor gamma. Near 0 makes the agent myopic (cares only about immediate reward); near 1 makes it far-sighted. Standard practice is gamma = 0.9-0.99. gamma < 1 is mathematically required for continuing (non-terminating) tasks so the infinite sum is well-defined.
  • Worked at three gammas. A four-step trajectory with rewards 1, 1, -1, 1 gives G_0 = 1 at gamma = 0, 1.375 at gamma = 0.5, and 1.819 at gamma = 0.9. Same trajectory, three different returns; gamma is the only knob that changed.
  • Phase 2 vs Phase 3 of the track. Phase 2 (planning) assumes you know P and R; you solve the MDP directly. Phase 3 (model-free learning) assumes only samples; you learn from interaction without ever writing P down. The MDP is the contract both phases assume.

You now have the language the rest of the track speaks. When you read about a new RL method, you can ask the right questions immediately: what are the states (and are they actually Markov)? what is the reward function (and is it well-designed)? what is gamma? does the algorithm assume the model is known or only samples? The Atari frame-stacking story sticks because it shows that the Markov assumption is something you engineer for, not something a problem comes with. And the discount-factor walk-through makes gamma a knob with felt consequences, not a hyperparameter to copy. The next lesson defines the value functions V and Q on this MDP and writes the Bellman equations that tie them to themselves, the recursion the rest of the track will solve in different ways.