Skip to content

What reinforcement learning actually is

This is the opening lesson of Track 17 (Reinforcement Learning Foundations). The track teaches the math-rigorous foundations of reinforcement learning, the family of techniques behind systems that learn by trial and reward, from AlphaGo to robotic locomotion to the RLHF step inside modern large language models. This first lesson does not teach a technique; it gives you the map: what RL is and is not, the agent-environment-reward loop the whole track is built on, the things that make RL harder than the supervised learning you may have met, and the single tension every method in the rest of the track is, underneath, a precise answer to. The source curriculum is David Silver’s UCL Reinforcement Learning course (CC BY-NC 4.0), freely available and cited per lesson as further study.

The lesson places RL beside supervised and unsupervised learning as a third paradigm, draws the agent-environment-reward loop, names the three things that make RL its own discipline (no oracle action, delayed reward, distribution shift from the changing policy), introduces the exploration-versus-exploitation dilemma with a three-arm bandit walkthrough that makes both extremes visibly fail, and tours real systems built on RL, closing with a clean named cross-reference to Track 5’s rlhf-and-dpo lesson for the LLM-alignment side.

This is lesson 1 of 10, the entry point of the track. There is no previous lesson here; the prerequisite is comfort with basic probability and expectation (Track 9 material is sufficient). The next lesson, Markov Decision Processes, turns this lesson’s informal loop into a formal MDP, which the rest of the track relies on. Lesson 10 closes the arc with a bridge to modern RL, including RLHF.

Prerequisites: none beyond comfort with basic probability and expectation; Track 9 (Statistics & Probability for AI) is the natural foundation, particularly its random-variables and expected-value material. Some exposure to machine learning vocabulary (training, test data, models, gradient descent) helps but is not required for this opener.

This orientation lesson has essentially no math. It is conceptual: the three-paradigm split, the agent-environment-reward loop drawn out, and a three-arm bandit walked through in words. Real notation begins in the next lesson (the MDP tuple) and intensifies through the track; Bellman equations arrive in lesson 3 and stay through the rest of the foundations. This lesson is calibration for the register; if the conceptual register here feels right, the math-rigorous register of later lessons will sit on a clear conceptual base.

  • Distinguish reinforcement learning from supervised and unsupervised learning as a third paradigm
  • Describe the agent-environment-reward loop and the role of states, actions, and rewards
  • Explain what makes RL harder than supervised learning (no oracle, delayed reward, distribution shift from the policy)
  • State the exploration-versus-exploitation dilemma and recognize it as the through-line of the track
  • Recognize where RL shows up in real systems (games, robotics, recommendation, RLHF behind modern LLMs)
  • Read time: about 12 minutes
  • Practice time: about 14 minutes (a self-check, a classify-the-paradigm exercise across supervised / unsupervised / RL scenarios, a three-arm bandit reasoning exercise that makes pure-exploit and pure-explore visibly fail, and flashcards)
  • Difficulty: standard (a conceptual orientation lesson; no math beyond counting)