What reinforcement learning actually is
You learned to ride a bike by trying. Not by reading a textbook that, for every angle and speed, told you the exactly correct adjustment. You wobbled, fell, adjusted, fell less, and at some point rode. AlphaGo learned to play Go better than any human by playing itself millions of times and keeping what worked. ChatGPT was fine-tuned with reward signals that nudged it toward responses humans preferred. The same family of techniques is behind all three, and it is not supervised learning and not unsupervised learning. It is the third paradigm, reinforcement learning, the one where an agent learns from interaction with consequences.
This opening lesson sets up the whole track. You will see what makes RL its own thing, the basic loop everything in the track is built on, what makes it harder than the supervised setting you may have met, and the single tension, exploration against exploitation, that every algorithm in the next nine lessons is, underneath, a different precise answer to.
Three paradigms of learning
Section titled “Three paradigms of learning”Most machine learning falls into one of three buckets, and naming the buckets is the cleanest way to say what RL is.
- Supervised learning gives the model labeled examples: here is an image, here is the correct label “cat”; here is a sentence, here is the correct translation. The model learns the mapping from input to label. The teacher tells the learner the right answer for every example.
- Unsupervised learning gives the model no labels and asks it to find structure on its own: cluster these users, compress these images, learn which words go together. The teacher tells the learner nothing; the learner finds patterns.
- Reinforcement learning gives the model neither labels nor pure structure. It gives a goal expressed as reward and lets the agent act in an environment to maximize that reward over time. There is no “correct action” stamped onto each situation; there is only feedback about how well things went.
The split matters because the algorithms that work in each setting are different. RL is not a special case of supervised learning with the labels hidden; it is its own paradigm with its own complications, and the rest of the track is what those complications force you to do.
The agent-environment-reward loop
Section titled “The agent-environment-reward loop”Underneath every RL system is one picture, and it does not change for the whole track. An agent interacts with an environment in discrete time. At each step the agent observes the current state, picks an action, the environment responds with a reward and a next state, and the loop continues.
+---------------+ | ENVIRONMENT | +---------------+ ^ | | action a_t| |state |reward | |s_(t+1)|r_(t+1) | v v +---------------+ | AGENT | +---------------+Three pieces, named precisely:
- State (s): the agent’s information about the situation. In a chess game, the board position. For a robot, the joint angles and sensor readings. In a recommendation system, a user profile and recent activity.
- Action (a): a choice the agent can make at this state. In chess, a legal move. For the robot, a torque on each joint. For the recommender, which item to show next.
- Reward (r): a number the environment hands back that tells the agent how good the result of an action was, right now. Winning the game: +1 at the end. Falling over: -10. A click on the recommendation: +1.
The agent’s job is to choose actions over time to maximize the total reward, not the immediate one. The plan it uses to choose actions is called a policy. That is the whole vocabulary for now; the next lesson formalizes it as a Markov Decision Process, and the rest of the track is methods for finding good policies.
What makes RL harder than supervised learning
Section titled “What makes RL harder than supervised learning”If you are coming from supervised learning, three things are genuinely new and harder in RL.
No oracle telling you the right action. In supervised learning, the dataset says “for this input, this label is correct.” In RL, when you take an action, the environment hands back a reward, not the action you should have taken instead. You learn what was good or bad, but not what was best. The learner has to figure out the best action from feedback about the actions it actually tried.
Delayed reward (credit assignment). Many of the rewards that matter come late. In chess you do not learn whether a move was good for fifteen more moves; in language modeling, whether a sentence “worked” might only be clear after a paragraph. The agent has to figure out which of its earlier actions deserve credit for a later outcome. This is the credit-assignment problem, and a surprising amount of RL machinery exists to handle it.
The data distribution depends on the policy. In supervised learning the dataset is fixed: train on it, test on more like it. In RL the data the agent sees, the states it visits, depends entirely on the actions it has been taking. A different policy walks a different path through the environment and sees different states. So as the policy changes during learning, the distribution of the data the agent is learning from also changes, breaking the static-distribution assumption supervised learning relies on. Algorithms in this track have to cope with that drift.
These are not minor inconveniences. They are the reason RL needs its own framework, the reason Bellman equations show up in the next lesson rather than gradient descent on a labeled loss, and the reason the simplest correct-looking ideas in RL often diverge in practice.
Exploration versus exploitation: the through-line
Section titled “Exploration versus exploitation: the through-line”Here is the bet at the heart of every RL algorithm. An agent does not know how the world works, only what it just tried and what it got back. So it has to do two things at once that pull against each other.
- It must exploit what it already believes is good: pick the action that, on the evidence so far, seems to pay best.
- It must explore what it has not tried enough to know: pick an action whose value is uncertain, because the action it has never taken might be far better than the one it keeps picking.
Pure exploitation locks in too early. Imagine an agent facing three slot machines and pulling each one once. Machine 1 paid 0.6 on its single try, machine 2 paid 0.4, and machine 3 paid 0.7. A pure exploiter pulls machine 3 forever, because it has the best score so far. But “so far” is one pull each; the true average might be entirely different. If machine 1’s true average is 0.9 and the agent never pulls it again, it never learns and never collects.
Pure exploration is also bad. An agent that picks completely at random gets diverse experience but never uses what it learns; over time its average reward stays at the average of the machines, ignoring everything the data taught it about which was best.
The right strategy is a mix, and the mix has to be precise. Every algorithm in this track, value iteration, Q-learning, policy gradient, will resolve the tension in its own way, and a large fraction of the field’s research is about doing it well. The single most important thing to carry out of this lesson is the recognition that exploration versus exploitation is not a side issue; it is the central tension RL exists to navigate.
Where RL shows up
Section titled “Where RL shows up”The breadth of applications shows why the foundations are worth learning carefully.
- Games. AlphaGo and AlphaZero (board games), DQN on Atari (video games), modern systems on chess, shogi, StarCraft, Poker. Games are an ideal RL testbed because the reward is clean and the environment is cheap.
- Robotics and control. Learning to walk, manipulate objects, fly. The reward is task success and the cost of moves; the state is sensor readings; the action is a motor command. Real-world friction makes this much harder than games, which is its own area of research.
- Recommendation and personalization. Long-running systems that learn what to show by interacting with users; the reward is engagement or another business metric.
- Resource allocation and scheduling. Routing traffic, scheduling jobs, balancing power loads, anything where you make sequential decisions under uncertainty with a measurable outcome.
- Language models, via RLHF. Modern large language models are pretrained on text and then fine-tuned with reinforcement learning from human feedback (RLHF): humans rate responses, a reward model learns to predict those ratings, and the language model is updated to maximize predicted reward. That treatment, on the alignment side, lives in Track 5’s RLHF-and-DPO lesson; this track teaches the RL mechanics underneath (reward, expected return, policy gradient), which lesson 10 will tie back to RLHF explicitly.
The thread tying these together is the same setup: an agent making decisions over time in an environment that gives back rewards. Once you have the framework, you have a vocabulary for thinking about all of them.
Common pitfalls
Section titled “Common pitfalls”- Treating RL as supervised learning with hidden labels. It is not. There is no “right answer” per state, only feedback on the actions tried. The algorithms that work assume this from the start.
- Thinking RL is only for games. Games are the cleanest demo. The framework is general; recommendation, robotics, control, and language alignment are all real applications.
- Mistaking the reward for an objective property of the world. Reward is a signal you design to express what you want. Bad reward design (rewards a robot for “moving fast” without penalizing crashes) produces agents that game the reward. Reward shaping is a serious engineering concern; the track will return to it.
- Equating exploration with randomness. Random action is one form of exploration, but the principled methods (covered later in the track) target uncertainty, picking actions whose values you most need to learn, not actions that are merely unpredictable.
- Underestimating the data-distribution shift. Because the data depends on the policy, supervised-learning intuitions about stable training datasets do not transfer. Algorithms in the track explicitly handle this.
What you should remember
Section titled “What you should remember”- Reinforcement learning is a third paradigm, sitting beside supervised and unsupervised learning, the one where an agent learns from interaction with consequences (no labels, only reward).
- The whole field is built on the agent-environment-reward loop: at each step the agent observes a state, takes an action, and receives a reward and a next state, choosing actions according to a policy to maximize total reward over time.
- RL is genuinely harder than supervised learning because there is no oracle action, rewards can be delayed (the credit-assignment problem), and the data distribution depends on the agent’s policy (it shifts as the agent learns).
- The central tension every RL method addresses is exploration vs exploitation: exploit what looks good now, explore what you have not tried enough to know, in some principled mix. This is the through-line of the track.
- RL underlies real systems from board games and Atari (AlphaGo, DQN) to robotics to recommendation to RLHF behind modern LLMs (lesson 10 will close that loop). The foundations here are what all of them assume.