Markov Decision Processes: brief

What you’ll learn

This is lesson 2 of Track 17 (Reinforcement Learning Foundations) and the second lesson of Phase 1 (The RL setup). The first lesson drew the agent-environment-reward loop informally; this one is the formal version. You will learn the Markov Decision Process, the universal contract of reinforcement learning: a five-piece tuple plus the Markov property, on which every algorithm in the rest of the track operates. The source curriculum is David Silver’s UCL RL course (CC BY-NC 4.0), freely available and cited per lesson as further study.

The lesson lays out the MDP tuple (S, A, P, R, gamma), names each piece, states and unpacks the Markov property (and shows it is a property of the state representation, not the world, with the Atari frame-stacking story as the canonical fix), defines a trajectory and the discounted return, walks the return at three values of gamma on a small two-state MDP so the discount becomes a felt knob rather than a hyperparameter, and draws the planning-versus-learning boundary that organizes the rest of the track (Phase 2 assumes P and R are known; Phase 3 only assumes samples).

Where this fits

This is lesson 2 of 10, in Phase 1. The previous lesson, What reinforcement learning actually is, set up the loop informally. The next lesson, Value functions and the Bellman equations, defines V and Q on the MDP introduced here and writes the recursive Bellman equations that the rest of the track will solve in different ways. From this lesson on, the math-rigorous register the track promised is in force.

Before you start

Prerequisites: the previous lesson (What reinforcement learning actually is) for the agent-environment loop. Comfort with basic probability and expectation (Track 9’s random-variable material is sufficient), and ease reading notation like P(s’ | s, a) and a sum of terms like gamma^k * r_(t+k+1). No calculus required for this lesson.

About the math

This is the first lesson with real notation. You will read the MDP tuple, a conditional probability for the dynamics, a summation for the discounted return, and a small numeric calculation of G_0 at three values of gamma. Everything is anchored to a worked two-state MDP so the symbols correspond to numbers. No proofs; the precision is for use, not derivation.

By the end, you’ll be able to

Write down the MDP tuple (S, A, P, R, gamma) and explain what each piece encodes
State the Markov property and explain why it is what makes MDP algorithms tractable
Define a trajectory under a policy and compute the discounted return on a small example
Explain how the discount factor gamma controls myopic versus far-sighted behavior, and why an infinite-horizon problem needs gamma less than 1
Recognize the engineering job of designing states that make a real problem actually Markov (the Atari-frame-stacking example)

Time and difficulty

Read time: about 13 minutes
Practice time: about 16 minutes (a self-check, an identify-the-pieces exercise formalizing a recommender as an MDP, a compute-the-return exercise at three discount values, and flashcards)
Difficulty: standard (the first lesson with real notation, but the arithmetic is small and every symbol is grounded in the H-S worked example)