Practice: Reasoning models and the road ahead

Self-check

Seven short questions. Answer each before opening the collapsible.

1. What do reasoning models change about the output, and why does it help?

Show answer

Instead of jumping straight to an answer, a reasoning model first generates an explicit chain of thinking, then the answer, kept structurally separate. It helps because generating the intermediate steps explicitly improves the answer on problems that need several steps (multi-step math, logic), where a one-shot plausible answer is often wrong.

2. How does training a reasoning model with RL differ from supervised fine-tuning?

Show answer

SFT teaches by imitation: show the model good responses and it learns to copy the pattern. RL (reinforcement learning) instead rewards the model when its generated reasoning leads to a correct result, so over many attempts it learns to produce step-by-step thinking that reaches the right answer. Imitation says “respond like these examples”; RL says “reason in whatever way reaches the correct result.”

3. What are DeepSeek R1, Open R1, and GRPO?

Show answer

DeepSeek R1 is the model that demonstrated reasoning-via-RL working at scale and made the technique widely known. Open R1 is a Hugging Face community project that reproduces that approach in the open. GRPO (Group Relative Policy Optimization) is the RL training method used, available in TRL, the same library as the SFTTrainer from lesson 10.

4. How does the training pipeline extend once you add reasoning?

Show answer

Pretrain (learn language) to SFT (learn to follow instructions) to RL (learn to reason). The reasoning stage uses reinforcement learning on top of an instruction-tuned model, and it is reachable through the same ecosystem (TRL) you already used for SFT.

5. What does Open R1 illustrate about the open ecosystem?

Show answer

That when a capability appears first in a closed frontier model, the open-source community often reproduces it in the open with public code, weights, and data. The Hugging Face ecosystem is the open counterweight to closed models, and reasoning is just the newest capability being brought into it.

6. Why is the applied loop described as “model-agnostic”?

Show answer

Because the same loop, name the task and model shape, pick or fine-tune a model, curate the data, debug, and ship, works regardless of which model you use. A 2018 BERT classifier and a 2025 reasoning model are handled with the same method; only the specific model and technique change.

7. What is the “most durable lesson” of the track, and why?

Show answer

That the working method outlasts the frontier. Specific models and techniques go stale within months, but choosing the right model shape, using the ecosystem instead of reinventing it, curating data, evaluating honestly on held-out data, debugging calmly, and shipping do not. Those habits are what let you keep up with whatever comes next.

Try it yourself: synthesize the track

About 10 minutes, no code. This capstone exercise checks that the whole arc connects.

Part A: place each tool in the pipeline. For each ecosystem piece, name what stage or job it serves.

a. transformers (AutoModel / pipeline)
b. Trainer / TRL (SFTTrainer, GRPO)
c. datasets + Argilla
d. PEFT / LoRA
e. Gradio + Spaces

What you’ll get

a. transformers: run and load models (inference, the base of everything).
b. Trainer / TRL: train models. The Trainer for task fine-tuning, SFTTrainer for instruction tuning, GRPO for reasoning (RL).
c. datasets + Argilla: feed and curate the data. Mechanical wrangling (datasets) and human-in-the-loop curation (Argilla).
d. PEFT / LoRA: make training large models affordable (train a few added parameters).
e. Gradio + Spaces: ship. Wrap a model in a demo and host it.

If you can place all five, you have the whole ecosystem mapped.

Part B: walk a problem through the loop. A startup wants a support assistant fluent in their product’s terminology, deployable for users to try. Sketch the path using this track’s tools (you do not need code, just the steps and the tools).

What a good answer looks like

Try prompting an instruction-tuned model first (lesson 2/10). If that is not enough: curate a high-quality dataset of their support conversations and terminology (datasets + Argilla, lessons 5/11), supervised-fine-tune an open model on it with SFTTrainer, using LoRA to keep it affordable (lesson 10), evaluate on held-out data (lessons 3/7), debug the pipeline when it breaks (lesson 8), then ship it as a Gradio demo on a Space for users to try (lesson 9), and push the model to the Hub (lesson 4). That path uses essentially the whole track, which is the point.

Part C (reasoning). A new model architecture trends next year and the specific APIs in this track are renamed. How much of what you learned still applies?

What you should notice

Almost all of it. The specific class names and arguments may change, but the method does not: name the task and the model shape, use the ecosystem rather than reinventing it, curate your data, evaluate honestly, debug calmly, and ship. New techniques (like reasoning today) slot into that same loop. The track taught a durable working method with transformers as the worked example, so a new frontier is something you pick up, not start over from.

Flashcards

Nine cards. Click any card to reveal the answer. Use the Print flashcards button to lay the set out one card per page for offline review.

Q. What do reasoning models add over ordinary LLMs?

They generate an explicit chain of thinking before the answer, kept structurally separate from it. The steps improve the answer on multi-step problems (math, logic), where a one-shot answer is often wrong.

Q. How does RL training differ from SFT?

SFT teaches by imitation (copy good responses). RL rewards the model when its reasoning reaches a correct result, so it learns to think its way there rather than copy. Imitation: ‘respond like these’; RL: ‘reason to the right answer’.

Q. What are DeepSeek R1, Open R1, and GRPO?

DeepSeek R1 demonstrated reasoning-via-RL at scale. Open R1 is the Hugging Face community open reproduction. GRPO is the RL training method, available in TRL (the same library as SFTTrainer).

Q. How does the training pipeline extend with reasoning?

Pretrain (learn language) to SFT (follow instructions) to RL (learn to reason). The reasoning stage runs on top of an instruction-tuned model, via the same TRL ecosystem used for SFT.

Q. What does Open R1 show about the open ecosystem?

When a capability appears first in a closed frontier model, the open community often reproduces it in the open (public code, weights, data). The HF ecosystem is the open counterweight; reasoning is the newest capability entering it.

Q. Why is the applied loop model-agnostic?

Name the task and shape, pick/fine-tune a model, curate data, debug, ship, the same loop works on a 2018 BERT classifier and a 2025 reasoning model. Only the specific model and technique change.

Q. What is the track's most durable lesson?

The working method outlasts the frontier. Specific models and techniques go stale; choosing the right shape, using the ecosystem, curating data, evaluating honestly, debugging, and shipping do not.

Q. Name the ecosystem pieces this track covered.

Hub (models/datasets), transformers (run), datasets + tokenizers (feed), Trainer/TRL (train), PEFT/LoRA (affordably), Gradio/Spaces (ship), Argilla (curate data).

Q. What is the one-line through-line of the whole track?

Tokens in, tokens out, attention in the middle (lesson 1) still holds at the reasoning frontier; you can now use, adapt, curate for, ship, and reason about the whole open ecosystem.