Practice: What's next

Self-check

Seven short questions. Answer each before opening the collapsible.

1. Does longer context eliminate the need for retrieval? Why or why not?

Show answer

No. Long context gives retrieval more room to work with and enables some moves that were not feasible before (longer few-shot, multi-document analysis, code-base understanding), but the lesson-2 productive limits still push back: input tokens still cost real money per call, latency rises with input length (TTFT scales with prefill), and the model’s attention to specific facts in a huge context is uneven. Selective retrieval still wins on cost, latency, and often quality at scale. “Stuff the prompt” is a temptation, not a strategy.

2. What changes about an LLM application when the underlying model becomes multimodal?

Show answer

The application patterns from lessons 1-7 generalize, you still ship a minimal app, prompt-engineer, augment with retrieval over a multimodal corpus, instrument with LLMOps, but the input and output components expand to include images, audio, and sometimes video. Entirely new application categories open up: vision input for documents/charts, voice in/out without separate STT/TTS, document understanding without text extraction, screenshot-based UI agents.

3. When is a smaller specialized model the right choice over a frontier model?

Show answer

For narrow sub-tasks within a larger application (classification, summarization, extraction, routing), where the small model can match or beat the frontier model at a fraction of the cost and latency. The right architecture is often a mix: small specialized models for the inner sub-tasks, one frontier call for the synthesis. For the user-facing main response, a frontier model is usually still right.

4. Walk the build-vs-buy spectrum.

Show answer

Hosted API (almost always the right starting point; lowest setup cost; strongest models; provider handles scale and updates). Fine-tune an open model (when prompting consistently fails on a specific recurring task at scale; lesson 9 is the next step). Train from scratch (rare; almost never right for an application team; usually a research or structural-data-advantage decision; Track 15 territory). The bar for “should we train our own?” keeps rising as hosted gets better and cheaper; the bar for “should we fine-tune a small model?” keeps falling as small open models and tooling improve.

5. What four things change for builders when an application becomes agentic?

Show answer

(1) Latency budgets stretch: multi-step agents take longer; lesson 6’s streaming progress and latency masking become more important. (2) Cost compounds: each agent step is a model call; a ten-step agent costs roughly ten times a single-turn response. (3) Evaluation gets harder: “did the agent accomplish the task?” is harder to score than “is this response correct?”; lesson 7 metrics shift toward task completion rate. (4) Failure modes are different: agents can loop, take wrong paths, or get confused several steps in; recoverable-failure UX and observability both get harder. Lesson 10 is the deep dive.

6. When should you reach for a reasoning model, and when should you not?

Show answer

Use a reasoning model when the task actually needs reasoning (multi-step math, code with constraints, logic puzzles, anything where intermediate steps matter), because the cost is justified by the quality gain. Do not use one “to be safe” on every task, that pays the higher per-task cost (reasoning models often generate 5-20x the visible response in thinking tokens) on the many tasks that did not need it. Reasoning models are a deliberate choice, not a default.

7. State the builder’s instinct for reading each new capability.

Show answer

Ask what each change does to the three productive limits from lesson 2 (context, cost, latency), where it fits on the build-vs-buy spectrum, and which of the patterns from lessons 1-7 generalize vs need new techniques. That discipline outlasts any specific model or technique; the field’s specific capabilities will keep moving, the way you read them shouldn’t have to.

Try it yourself: read a release

About 10 minutes, no code. Apply the builder’s-eye discipline.

Part A: a frontier model releases. A provider announces a new flagship model with 4x the context length of its predecessor, full multimodal input (text + image + audio), and a “reasoning” mode that produces explicit thinking. List at least four implications for an existing LLM application that currently runs on the predecessor.

What you’ll get

(1) Long-context temptation: the team will be tempted to drop retrieval and “stuff the prompt.” Lesson 2’s productive limits still apply; per-request cost rises with longer inputs, latency rises with prefill, and attention to specific facts in a huge context is uneven. The right move is usually still targeted retrieval, with the extra context budget going to longer few-shot or richer system prompts. (2) Multimodal input opens new use cases: if any user request currently produces “I cannot process this image” or “send me a text description of the document,” those failures may now have a direct fix. Worth surveying user feedback (lesson 7’s logged signal) for which multimodal gaps to close first. (3) Reasoning mode is a deliberate per-task choice, not a default: reserve it for tasks that actually need multi-step reasoning, since it produces many thinking tokens per answer. The A/B testing discipline from lesson 7 is exactly how you’d compare reasoning-on vs reasoning-off on real traffic. (4) Adopt under regression testing: the lesson 7 suite, run on the new model first, confirms whether quality moved before switching. A model upgrade is exactly what regression testing makes safe.

Part B (reasoning). Your team currently uses a frontier model for every step of an internal RAG pipeline (router, retriever-rewriter, answer generator). Walk through how the “smaller specialized models” direction might change the architecture.

What you should notice

The router and retriever-rewriter are inner sub-tasks with narrow, predictable inputs and outputs; they are excellent candidates for smaller specialized models (cheaper, faster, often equal quality after a small fine-tune). The answer-generator stays on the frontier model because the user-facing response benefits most from the frontier capability. Result: a mixed architecture where most of the cost-per-request comes down (small inner models) and the user-facing answer quality stays the same (frontier outer model). The A/B testing discipline from lesson 7 is how you prove the mixed architecture matches quality before fully switching.

Part C (reasoning). Why is “the builder’s instinct outlasts specific models and techniques” the right framing for the close of this lesson?

What you should notice

Particular model names and capabilities go stale within months; the same point Track 14 lesson 1 and Track 15 lesson 14 named for transformers and reasoning models. What does not go stale: the three productive limits from lesson 2 (anything new is read against context, cost, latency), the build-vs-buy spectrum (every capability lands somewhere on it), the LLMOps discipline of lesson 7 (regression-test before adopting; instrument before shipping), and the application patterns of lessons 1-6 (which generalize even as their inputs and outputs do). A reader who internalizes the discipline reads every new release with the same eye; a reader who learned only the current models has to re-learn each one.

Flashcards

Nine cards. Click any card to reveal the answer. Use the Print flashcards button to lay the set out one card per page for offline review.

Q. Does longer context eliminate the need for retrieval?

No. Long context still costs more per call (input tokens), raises latency (TTFT scales with prefill), and attention to specific facts in a huge context is uneven. Selective retrieval still wins at scale; “stuff the prompt” is a temptation, not a strategy.

Q. What does multimodality change for builders?

The lesson 1-7 patterns generalize with expanded input/output components (text + images + audio + video). Opens new categories: vision document parsing, voice in/out without STT/TTS, screenshot agents, multi-modal RAG.

Q. When are smaller specialized models the right choice?

For narrow inner sub-tasks (classification, summarization, extraction, routing) where they match frontier quality at a fraction of the cost/latency. Often as part of a mix: small inner sub-tasks + one frontier call for the user-facing synthesis.

Q. The build-vs-buy spectrum?

Hosted API (almost always start here) -> fine-tune an open model (when prompting consistently fails on a specific task; lesson 9) -> train from scratch (rare; usually research or structural data advantage; Track 15). Bar for “train your own” keeps rising; bar for “fine-tune small” keeps falling.

Q. What four things change when an application becomes agentic?

(1) Latency budgets stretch (multi-step takes longer). (2) Cost compounds (each step = a model call). (3) Evaluation is harder (task completion vs answer quality). (4) New failure modes (loops, wrong paths, confusion). Lesson 10 is the deep dive.

Q. When to reach for a reasoning model?

When the task actually needs reasoning (multi-step math, code-with-constraints, logic). Don’t default to one “to be safe”; reasoning models generate 5-20x the visible response in thinking tokens, paying higher cost on tasks that didn’t need it.

Q. The builder's instinct for reading a new capability?

What does it change about the three productive limits (context/cost/latency)? Where does it fit on the build-vs-buy spectrum? Which lesson 1-7 patterns generalize, and which need new techniques? The discipline outlasts specific models.

Q. Adopting a model upgrade safely?

Run the lesson-7 regression suite on the new model FIRST. Compare against the old. Use A/B testing on real traffic if quality looks comparable. Don’t switch silently; model upgrades are exactly what regression testing makes safe.

Q. The 'mix' architecture this lesson hints at?

Small specialized models for inner sub-tasks (router, retriever-rewriter, classifier, extractor) + one frontier-model call for the user-facing synthesis. Lowers per-request cost without losing the user-facing quality.