References: Markov Decision Processes

Source material

Source curriculum (structural mirror, cited as further study):
• David Silver, "Reinforcement Learning" (UCL course), Lecture 2:
  Markov Decision Processes
  Author: David Silver
  Course page: https://davidstarsilver.wordpress.com/teaching/
  License: CC BY-NC 4.0
Clawdemy's lessons are original prose that follows the pedagogical arc of this
course. We do not embed, reproduce, or transcribe Silver's slides or video
lectures; we link out to the relevant lecture as recommended further study.
The non-commercial clause is now consistent with Clawdemy's own CC BY-NC-SA 4.0 license; both forbid commercial use without permission. Commercial use is licensed separately at [/legal/licensing](/legal/licensing/).
All rights to the original materials remain with the author and UCL.

Source-scope note: this lesson mirrors Silver's Lecture 2 (Markov processes
through Markov Decision Processes) and restates it in Clawdemy's voice. The
two-state H/S worked example, the explicit "Markov is a property of the state
representation" framing with the Atari frame-stacking story, the three-gamma
return walk-through, and the Phase 2 vs Phase 3 planning-versus-learning
boundary are Clawdemy framing. Silver's Lecture 2 also introduces value
functions and the Bellman expectation equation in its later half; Clawdemy
splits those into the next lesson rather than crowding them in here. Exact
per-lecture URLs are verified at promotion.

Read this next

David Silver, UCL RL course, Lecture 2: Markov Decision Processes by David Silver. The lecture this lesson mirrors, with the same formalism developed alongside Markov reward processes and the first sketch of value functions. CC BY-NC 4.0, freely available. Watch it alongside this lesson for the longer development and additional intuition on the Markov property.

Going deeper

A short, durable list. Both are free.

David Silver, UCL RL course, Lecture 3: Planning by Dynamic Programming (within the course above). Where the MDP becomes computable: policy iteration and value iteration, which Track 17 develops as Phase 2 (lessons 4-5).
Sutton and Barto, “Reinforcement Learning: An Introduction” (2nd edition), Chapter 3 (Finite Markov Decision Processes). The textbook treatment of the same material, with worked examples and a careful discussion of the reward hypothesis and gamma. The standard reference for the formalism.

Adjacent topics

Where this leads inside this track.

What reinforcement learning actually is. The previous lesson. It drew the agent-environment-reward loop informally; this lesson is the formal version.
Value functions and the Bellman equations. The next lesson. With the MDP in hand, you can define V(s) and Q(s, a) and write the recursive Bellman equations that link them, the mathematical heart of the whole field.
Planning with a known model (Phase 2). Lessons 4-5. Once an MDP is specified and known, you can solve it directly with policy iteration and value iteration, which are repeated applications of the Bellman equations.