Summary: What reinforcement learning actually is
Reinforcement learning is a third paradigm beside supervised and unsupervised learning, the one where an agent learns from interaction with consequences. You learned to ride a bike by trying; AlphaGo learned to play Go by self-play; modern chatbots are fine-tuned with reward signals. The thread is the same: an agent, an environment, and a reward. This opener sets up the loop, names what makes RL harder than supervised learning, introduces the through-line of the whole track (exploration vs exploitation), and tours where RL appears. This summary is the scan-in-five-minutes version of the full lesson.
Core ideas
Section titled “Core ideas”- Three paradigms. Supervised learns input-to-label mappings from labeled examples. Unsupervised finds structure with no labels. Reinforcement has no labels and no oracle, only rewards from acting in an environment; the agent learns a policy that maximizes total reward over time.
- The agent-environment-reward loop. At each step the agent observes a state, picks an action, and receives a reward and the next state. Repeat. The policy is the agent’s plan for choosing actions, and the goal is total reward, not the immediate one.
- What makes RL harder than supervised. (1) No oracle action, only feedback on the action you took. (2) Delayed reward, important outcomes arrive many steps later (the credit-assignment problem). (3) The data distribution depends on the policy, so as the agent learns, the training distribution shifts.
- Exploration vs exploitation, the through-line. Exploit what looks best now and you stop learning; explore at random and you never act on what you learned. Every algorithm in the track is, underneath, a precise mix. The three-arm bandit makes the tension visible: pulling only the best-seen-so-far arm locks in noise; pulling uniformly never collects what you found.
- Reward is designed, not given. The reward function is an engineering choice that expresses what you want; bad reward design produces agents that game the signal (the robot that “moves fast” by crashing). Reward shaping is its own concern.
- Where it shows up. Games (AlphaGo, DQN on Atari), robotics, recommendation, scheduling, and RLHF behind modern LLMs. Track 5’s
rlhf-and-dpocovers the alignment side; this track teaches the RL mechanics that underlie it (lesson 10 closes that loop).
What changes for you
Section titled “What changes for you”You have a clean mental model for a whole class of AI systems you have been reading about. When you see “AlphaGo learned by self-play” or “the chatbot was tuned with reinforcement learning from human feedback,” the framework now slots in: there is an agent, an environment, a reward signal, a policy, and a learning algorithm that updates the policy from interaction. You also see the headline tension up front, exploration vs exploitation, that every method in the track is built to manage, so as the algorithms accumulate you can read each one as a different precise answer to one question: how much of the next action should be based on what we already think we know, and how much on what we still need to learn?