Summary: Markov Decision Processes
An MDP is the universal contract of reinforcement learning: a tuple (S, A, P, R, gamma) plus the Markov property, on which every algorithm in the track operates. The first lesson drew the agent-environment loop informally; this one nails it down precisely so the rest of the track has something formal to compute on. This summary is the scan-in-five-minutes version of the full lesson.
Core ideas
Section titled “Core ideas”- The MDP tuple. (S, A, P, R, gamma): states S, actions A, transition probabilities P(s’ | s, a), reward function R(s, a), and discount factor gamma in [0, 1). Together these specify the dynamics and what the agent values.
- The Markov property. The next state and reward depend only on the current state and action, not on prior history. Crucially this is a property of the state representation you choose, not of the world: a single Atari frame is not Markov (you cannot see velocity); four stacked frames are. DQN’s frame-stacking is the canonical fix.
- Trajectory and return. A policy plus the environment generate s_0, a_0, r_1, s_1, a_1, r_2, … The discounted return from time t is G_t = r_(t+1) + gamma * r_(t+2) + gamma^2 * r_(t+3) + …, the geometrically weighted total of future rewards.
- The discount factor gamma. Near 0 makes the agent myopic (cares only about immediate reward); near 1 makes it far-sighted. Standard practice is gamma = 0.9-0.99. gamma < 1 is mathematically required for continuing (non-terminating) tasks so the infinite sum is well-defined.
- Worked at three gammas. A four-step trajectory with rewards 1, 1, -1, 1 gives G_0 = 1 at gamma = 0, 1.375 at gamma = 0.5, and 1.819 at gamma = 0.9. Same trajectory, three different returns; gamma is the only knob that changed.
- Phase 2 vs Phase 3 of the track. Phase 2 (planning) assumes you know P and R; you solve the MDP directly. Phase 3 (model-free learning) assumes only samples; you learn from interaction without ever writing P down. The MDP is the contract both phases assume.
What changes for you
Section titled “What changes for you”You now have the language the rest of the track speaks. When you read about a new RL method, you can ask the right questions immediately: what are the states (and are they actually Markov)? what is the reward function (and is it well-designed)? what is gamma? does the algorithm assume the model is known or only samples? The Atari frame-stacking story sticks because it shows that the Markov assumption is something you engineer for, not something a problem comes with. And the discount-factor walk-through makes gamma a knob with felt consequences, not a hyperparameter to copy. The next lesson defines the value functions V and Q on this MDP and writes the Bellman equations that tie them to themselves, the recursion the rest of the track will solve in different ways.