Skip to content

Lesson: Reasoning models and the road ahead

You started this track not knowing what a transformer was. You can now run one, fine-tune it, share it, wrangle and curate its data, ship it as a demo, and instruction-tune it into an assistant. This final lesson looks at the current frontier, reasoning models, explains what they add and how they are trained at a working level, and then steps back to place the whole Hugging Face ecosystem, and your new skills, in the landscape.

A notebook is optional here; this lesson is more about the shape of the frontier than a hands-on build.

An ordinary language model is a strong pattern matcher, but it struggles with problems that need several steps: multi-step math, logic puzzles, anything where the answer depends on a chain of intermediate results. Asked directly, it tends to produce a plausible-looking answer in one shot, and a plausible-looking answer to a multi-step problem is often wrong.

Reasoning models change the shape of the output. Instead of jumping straight to an answer, the model first generates an explicit chain of thinking, then the answer, with the two kept structurally separate. Take a simple problem:

Problem: "I have 3 apples and 2 oranges. How many pieces of fruit in total?"
Thought: "I need to add the apples and the oranges: 3 + 2."
Answer: "5"

The model produces the thought and the answer in a structured format (a thinking section, then the final answer) so a program can pull them apart: show the user the answer, keep or inspect the reasoning. The point is not that the thinking is shown; it is that generating the steps explicitly improves the answer on problems that need them. Working through “3 + 2” before committing makes the model more likely to land on 5 than guessing in one shot.

Here is the new idea, and it connects directly to lesson 10. SFT teaches a model by imitation: show it good responses, it learns to copy the pattern. Reasoning training uses reinforcement learning (RL) instead: rather than copying example answers, the model is rewarded when its generated reasoning leads to a correct result, so over many attempts it learns to produce the kind of step-by-step thinking that gets there. Imitation teaches “respond like these examples”; RL teaches “reason in whatever way reaches the right answer.”

A few anchors for the landscape, kept factual:

  • DeepSeek R1 is the model that demonstrated this reasoning-via-RL approach working at scale and made the technique widely known.
  • Open R1 is a Hugging Face community project that reproduces that approach in the open, so the method is not locked inside one lab.
  • GRPO (Group Relative Policy Optimization) is the RL training method used, and it is available in TRL, the same library whose SFT trainer you met in lesson 10. So the pipeline you already know extends cleanly: pretrain (learn language), SFT (learn to follow instructions), and now RL (learn to reason), all reachable through the same ecosystem.

This lesson stays at the level of what these methods do and how they fit together. The deeper questions about reasoning models, what they mean for the trajectory of AI, are beyond this track’s scope; the goal here is a working map of the frontier, not a position on where it leads.

It is worth noticing what Open R1 represents. When a capability appears first in a closed frontier model, the open-source community often works to reproduce it in the open, with public code, public weights, and public datasets. That is the through-line of this entire track: the Hugging Face ecosystem is the open counterweight to closed models. You have used every piece of it: the Hub for models and datasets, transformers to run them, datasets and tokenizers to feed them, the Trainer and TRL to train them, PEFT/LoRA to do it affordably, Gradio and Spaces to ship them, and Argilla to curate the data behind them. Reasoning models are simply the newest capability that ecosystem is now bringing into the open.

Step back and look at what you can do. You can take a problem, name the task and the model shape it needs (lesson 7), pick or fine-tune a model for it (lessons 2, 3, 10), feed it well-curated data (lessons 5, 11), debug it when it breaks (lesson 8), and put it in front of people (lessons 4, 9). That is the full applied loop, and it is the same loop whether the model is a 2018 BERT classifier or a 2025 reasoning model.

Which points at the most durable lesson of all. The specific frontier will keep moving: reasoning models are the headline now, something else will be next, and particular model names go stale within months (as lesson 1 warned). But the method does not move. Choose the right model shape, use the ecosystem instead of reinventing it, curate your data, evaluate honestly on held-out data, debug calmly, and ship something people can try. Those habits outlast every model on today’s leaderboard, and they are what this track was really teaching, with transformers as the worked example.

Reasoning models are where capability is being pushed right now, so understanding them, even at this level, is what keeps you current rather than describing the field as it was a year ago. But the larger point is the one the whole track has been building toward: AI is not a fixed thing you either know or do not; it is a fast-moving ecosystem, and the way to stay useful in it is to hold the working method while the specifics churn. You now have that method, and you have the open tools to apply it, which means the next capability, whatever it is, is something you can pick up and use rather than watch from outside. That is the real graduation from this track: not that you have memorized today’s frontier, but that you can keep up with tomorrow’s.

  • Reasoning models generate explicit step-by-step thinking before the answer, structured so the thought and the answer are separable. The steps are not decoration; they improve the answer on multi-step problems.
  • They are trained with reinforcement learning, not just imitation. SFT copies good responses (lesson 10); RL rewards reasoning that reaches correct results, so the model learns to think its way there.
  • The landscape anchors: DeepSeek R1 showed reasoning-via-RL at scale, Open R1 reproduces it openly, and GRPO (the RL method) lives in TRL, the same library as the SFT trainer. The pipeline extends: pretrain, SFT, then RL.
  • The open ecosystem is the through-line. Hub, transformers, datasets, tokenizers, Trainer/TRL, PEFT, Gradio/Spaces, Argilla, you have used all of it, and Open R1 is the open community bringing the newest capability into it.
  • The applied loop is model-agnostic: name the task and shape, pick or fine-tune a model, curate the data, debug, and ship. The same loop works on a 2018 classifier and a 2025 reasoning model.
  • The method outlasts the frontier. Specific models and techniques go stale; choosing the right shape, using the ecosystem, curating data, evaluating honestly, and shipping do not. That method is what the track was really teaching.

Tokens in, tokens out, attention in the middle: that was lesson 1, and it still holds for the reasoning models at today’s frontier. You started this track unable to run a model and end it able to use, adapt, curate for, ship, and reason about the whole ecosystem. The frontier will keep moving; you now have the method to move with it.