References: What reinforcement learning actually is

Source material

Source curriculum (structural mirror, cited as further study):
• David Silver, "Reinforcement Learning" (UCL course), Lecture 1:
  Introduction to Reinforcement Learning
  Author: David Silver
  Course page: https://davidstarsilver.wordpress.com/teaching/
  License: CC BY-NC 4.0
Clawdemy's lessons are original prose that follows the pedagogical arc of this
course. We do not embed, reproduce, or transcribe Silver's slides or video
lectures; we link out to the relevant lecture as recommended further study.
The non-commercial clause is now consistent with Clawdemy's own CC BY-NC-SA 4.0 license; both forbid commercial use without permission. Commercial use is licensed separately at [/legal/licensing](/legal/licensing/).
All rights to the original materials remain with the author and UCL.

Source-scope note: this lesson mirrors Silver's Lecture 1 (the RL paradigm
and the agent-environment setup) and restates it in Clawdemy's voice with
original framing. The three-paradigm split (supervised / unsupervised /
reinforcement), the explicit "what makes RL harder than supervised" list, the
designed-reward caveat, and the three-arm bandit walk-through used to make
exploration-vs-exploitation concrete are Clawdemy framing. The lesson does
not yet introduce MDPs or value functions; those are the next two lessons.
Exact per-lecture URLs are verified at promotion.

Read this next

David Silver, UCL RL course, Lecture 1: Introduction to Reinforcement Learning by David Silver. The lecture this lesson mirrors, with the canonical introduction to the RL framework, the multi-armed bandit motivation, and the historical context (psychology, control, operations research). CC BY-NC 4.0, freely available. Watch it alongside this lesson for the longer development.

Going deeper

A short, durable list. Both are free.

David Silver, UCL RL course, Lecture 2: Markov Decision Processes (within the course above). The direct continuation: formalizing the loop this lesson sketched into a Markov Decision Process, the setup for the rest of the track. This is Track 17 lesson 2.
Richard Sutton and Andrew Barto, “Reinforcement Learning: An Introduction” (2nd edition, available freely at the authors’ page). The standard textbook the whole field references. Chapter 1 covers the same ground as this lesson at book length, with the multi-armed bandit fully worked in Chapter 2.

Adjacent topics

Where this leads inside this track and beyond.

Markov Decision Processes. The next lesson. It turns the loop here into a formal object (states, actions, transitions, rewards, discount), which the rest of the track relies on.
Value functions and the Bellman equations. Lesson 3. The mathematical heart of how RL reasons about long-run reward.
RLHF and DPO (AI Foundations, Track 5). A separate, more applied track. T5’s rlhf-and-dpo lesson covers the alignment side of using RL on large language models; this track teaches the RL mechanics that RLHF assumes, and lesson 10 closes the loop with an explicit bridge back to T5.