Skip to content

Cheatsheet: Learning by trial and reward

supervised learning: study a dataset of correct answers
reinforcement learning: NO answer key. act in an environment, get rewards,
learn which actions pay off (like learning to ride a bike)
agent observes STATE → takes an ACTION → environment returns REWARD + new state → repeat
goal: maximize TOTAL reward over time

The agent learns a policy: for any state, which action to take.

Maze example: state = mouse’s position; actions = up/down/left/right; reward = −1 per step, +10 for cheese. Over many runs, cheese-reaching paths get reinforced until the mouse goes straight there. Nobody gave it the route.

SupervisedReinforcement
Signalcorrect answer per examplea reward (how good the outcome was)
Tells youwhat you should have doneonly how good it was, not what to do
Learning stylematch the answersexplore, observe rewards, infer what pays off

RL learns from evaluation, not instruction.

  • Credit assignment: a reward at the end (win the game after 40 moves) must be traced back to the actions that earned it. Delayed, sparse rewards make this hard.
  • Explore vs exploit: stick with a known-good action (exploit) or try something new that might be better (explore)? The restaurant problem. Every agent must balance the two.
  • Here (RL): a decision-maker that learns from environment rewards.
  • Track 20: a language model wired to tools.
  • Same word, different paradigm. This track means the RL sense.
  • Shines: games (a system beat the Go world champion) and clean simulations (balancing, walking, steering) where trials are cheap.
  • Strains: sample-inefficient (can need millions of trials), brittle (narrow policies), and hard to deploy in the messy real world. (More in the next lesson.)
  • Everyday tie-in: AI assistants are often refined with human feedback as a reward signal, RL’s loop applied to behavior.
  • “RL needs labeled data.” No. No answer key, just a reward signal.
  • “The reward says what to do.” No. It says how good the outcome was; the agent infers the rest.
  • “Credit assignment is a footnote.” No. Tracing a delayed reward to the right actions is central and hard.
  • “Game wins generalize everywhere.” No. Selective successes; real-world RL is sample-inefficient and brittle.
  • Agent / environment: the learner / the world it acts in.
  • State, action, reward: what it sees, what it does, the score it gets back.
  • Policy: the agent’s state-to-action strategy (what it is learning).
  • Credit assignment: figuring out which earlier actions earned a delayed reward.
  • Explore vs exploit: trying new actions vs repeating known-good ones.

Supervised learning studies an answer key; reinforcement learning has none, and learns the way living things do, by acting and feeling the consequences.