References: What reinforcement learning actually is
Source material
Section titled “Source material”Source curriculum (structural mirror, cited as further study):• David Silver, "Reinforcement Learning" (UCL course), Lecture 1: Introduction to Reinforcement Learning Author: David Silver Course page: https://davidstarsilver.wordpress.com/teaching/ License: CC BY-NC 4.0Clawdemy's lessons are original prose that follows the pedagogical arc of thiscourse. We do not embed, reproduce, or transcribe Silver's slides or videolectures; we link out to the relevant lecture as recommended further study.The non-commercial clause aligns with Clawdemy's free, zero-revenue posture.All rights to the original materials remain with the author and UCL.
Source-scope note: this lesson mirrors Silver's Lecture 1 (the RL paradigmand the agent-environment setup) and restates it in Clawdemy's voice withoriginal framing. The three-paradigm split (supervised / unsupervised /reinforcement), the explicit "what makes RL harder than supervised" list, thedesigned-reward caveat, and the three-arm bandit walk-through used to makeexploration-vs-exploitation concrete are Clawdemy framing. The lesson doesnot yet introduce MDPs or value functions; those are the next two lessons.Exact per-lecture URLs are verified at promotion.Read this next
Section titled “Read this next”- David Silver, UCL RL course, Lecture 1: Introduction to Reinforcement Learning by David Silver. The lecture this lesson mirrors, with the canonical introduction to the RL framework, the multi-armed bandit motivation, and the historical context (psychology, control, operations research). CC BY-NC 4.0, freely available. Watch it alongside this lesson for the longer development.
Going deeper
Section titled “Going deeper”A short, durable list. Both are free.
- David Silver, UCL RL course, Lecture 2: Markov Decision Processes (within the course above). The direct continuation: formalizing the loop this lesson sketched into a Markov Decision Process, the setup for the rest of the track. This is Track 17 lesson 2.
- Richard Sutton and Andrew Barto, “Reinforcement Learning: An Introduction” (2nd edition, available freely at the authors’ page). The standard textbook the whole field references. Chapter 1 covers the same ground as this lesson at book length, with the multi-armed bandit fully worked in Chapter 2.
Adjacent topics
Section titled “Adjacent topics”Where this leads inside this track and beyond.
- Markov Decision Processes. The next lesson. It turns the loop here into a formal object (states, actions, transitions, rewards, discount), which the rest of the track relies on.
- Value functions and the Bellman equations. Lesson 3. The mathematical heart of how RL reasons about long-run reward.
- RLHF and DPO (AI Foundations, Track 5). A separate, more applied track. T5’s
rlhf-and-dpolesson covers the alignment side of using RL on large language models; this track teaches the RL mechanics that RLHF assumes, and lesson 10 closes the loop with an explicit bridge back to T5.