Skip to content

Lesson: What's next, the LLM landscape in motion

You can now build, ship, and operate an LLM application. The field, of course, is not standing still. This lesson is a survey of the directions the LLM landscape is moving, lighter on mechanical depth than the rest of the track because the point is to map the territory, not to drill down. By the end you should be able to read a model release or a new platform announcement with a builder’s eye for “what does this change about an application I would ship?” and you should see the Phase 3 lessons that follow as concrete deep dives on three of these directions.

This lesson does not predict the future; it names the directions the field is already moving in. Specific model names go stale within months (Track 14 lesson 1 made the same point about transformer model versions); this lesson talks in families and trends.

Context lengths have grown from ~8K tokens (early frontier models) to ~128K (the next generation) to ~1M tokens (current frontier). Two practical shifts for an application builder:

  • The “stuff the prompt” temptation grows. With a million-token window, it is tempting to put the entire knowledge base in every call instead of building proper retrieval. The lesson-2 productive limits push back hard: input tokens still cost real money per call, latency rises with input length (TTFT scales with prefill), and the model’s attention to specific facts in a huge context is still uneven. Long context does not eliminate retrieval; it gives retrieval more room to work with and makes some moves (longer few-shot, multi-document analysis, code-base understanding) actually feasible.
  • Selective retrieval still wins at scale. As Track 14 lesson 11 and Track 15 lesson 12 named on the data side, less unique-clean beats more duplicated; the same principle applies here. The right move is usually still targeted retrieval (lesson 4), even with abundant context.

Modern frontier models are increasingly multimodal: they accept and produce text, images, audio, sometimes video. For application builders this opens entire categories of applications:

  • Vision input: parse documents, charts, photos, screenshots; build visual support assistants.
  • Audio input/output: voice assistants without a separate speech-to-text + text-to-speech stack.
  • Document understanding: PDFs as first-class inputs rather than text-extracted approximations.
  • UI automation and screenshot-based agents: a model can “see” what is on screen.

The application patterns from lessons 1-7 generalize, you still ship a minimal app, prompt-engineer, augment with retrieval over a multimodal corpus, instrument with LLMOps, but the input and output components expand. Expect more of the next year’s applications to be multimodal by default.

Alongside frontier scaling, a parallel trend matters more for production economics: smaller models trained or distilled for specific capabilities. The textbook-style synthetic-data work (Track 15 lesson 12) plus distillation produces small models (a few billion parameters) that match or beat much larger general models on narrow tasks while costing a fraction per token to serve.

For an application builder this becomes a real choice:

  • For sub-tasks within a larger application (classification, summarization, extraction, routing), a small specialized model often beats a frontier model on cost and latency at equal quality.
  • For the user-facing main response, a frontier model is still usually right.
  • The right architecture is often a mix: small models for the inner sub-tasks; one frontier call for the synthesis.

This connects to the build-vs-buy spectrum named next.

The honest map of “should I train my own model?” runs along a spectrum from “use a hosted API” to “fine-tune an open model” to “train from scratch”:

  • Hosted API (almost always the right starting point). Lowest setup cost; access to the strongest models; provider handles scale and updates. Most production LLM applications are this and stay this.
  • Fine-tune an open model (next step when prompting consistently fails on a specific recurring task at scale). Captures domain-specific behavior the prompt cannot reliably enforce; smaller and cheaper to serve than the frontier model it replaces; lesson 9 is the next step here.
  • Train from scratch (rare; Track 15’s territory). Almost never the right move for an application team; reasons are usually research or a structural data advantage that the hosted models cannot match.

The trend worth naming: the bar for “should we train our own?” keeps rising as hosted models get better and cheaper, while the bar for “should we fine-tune a small model?” keeps falling as small open models and tooling improve. Most applications now sit comfortably in “hosted API plus maybe fine-tune a small model for one or two narrow sub-tasks.”

Lesson 4’s tool-use loop (model decides, tool runs, model continues) is the seed of agents: chains of tool calls where the model plans, executes, observes, and revises across multiple steps. The application moves from “model answers the question” to “model accomplishes the task,” sometimes over minutes and dozens of tool calls.

What changes for builders:

  • Latency budgets stretch. Multi-step agents take longer than single-turn responses; UX (lesson 6’s streaming progress, latency masking) becomes more important.
  • Cost compounds. Each agent step is a model call; an agent that takes ten steps costs roughly ten times a single-turn response.
  • Evaluation gets harder. “Did the agent accomplish the task?” is harder to score than “is this response correct?” The LLMOps discipline of lesson 7 stretches: traces become more important; success metrics shift from “answer quality” to “task completion rate.”
  • Failure modes are different. Agents can loop, take wrong paths, or get confused several steps deep. Recoverable-failure UX (lesson 6) and observability (lesson 7) both get harder.

Lesson 10 is the deep dive on agents. This survey points at why they matter.

A specific recent direction worth naming: reasoning models (DeepSeek R1, the Open R1 family, and others, covered fully in Track 15 lesson 14). The model produces explicit step-by-step thinking before its answer, trained via RL with verifiable rewards. Two effects for application builders:

  • They are markedly better on multi-step problems (math, code, logic puzzles, anything where intermediate reasoning matters).
  • They have a very different cost profile. Reasoning models generate many “thinking” tokens per answer, often 5-20x the visible response. Per-task cost is higher; lesson 2’s productive limits apply with extra force.

The build choice becomes: use a reasoning model when the task actually needs reasoning (and the per-task cost is justified by the quality gain), and a standard model otherwise. The mistake is using a reasoning model everywhere “to be safe”, that pays the higher cost on the many tasks that did not need it.

The pattern across these directions: each is a new capability the field has unlocked, and each changes the productive-limits math from lesson 2 in a different direction. Longer context expands what fits but raises cost and latency. Multimodality adds input/output types. Smaller specialized models trade frontier capability for cost. Build-vs-buy shifts as both hosted and fine-tune options improve. Agents expand what an application can accomplish but compound cost and latency. Reasoning models lift multi-step quality at higher per-task cost.

The builder’s instinct is to read each new capability the same way: what does this change about the three productive limits? Where does it fit on the build-vs-buy spectrum? Which of the patterns from lessons 1-7 generalize and which need new techniques? That discipline outlasts any specific model or technique, the same point Track 15’s capstone made about its own field. The next three lessons take three of these directions and go deeper: training your own model (lesson 9), agents (lesson 10), and an industry-perspective capstone (lesson 11).

  • This is a survey of where the field is moving, not predictions. Longer context, multimodality, smaller specialized models, build-vs-buy spectrum, agents, reasoning models.
  • Longer context does not eliminate retrieval; selective retrieval still wins on cost and latency. “Stuff the prompt” is a temptation, not a strategy.
  • Multimodality generalizes the patterns from lessons 1-7 with expanded input/output components; expect more applications to be multimodal by default.
  • Smaller specialized models often beat frontier models on narrow sub-tasks for cost and latency at equal quality. The right architecture is often a mix: small for inner sub-tasks, frontier for the synthesis.
  • Build-vs-buy spectrum: hosted API is almost always the right starting point; fine-tune an open model when prompting consistently fails on a specific task (lesson 9); train from scratch is rare and almost never the right move for an application team.
  • Agents scale lesson 4’s tool-use loop into multi-step task accomplishment, with longer latency budgets, compounding cost, harder evaluation, and new failure modes (lesson 10).
  • Reasoning models lift quality on multi-step problems but have very different cost profiles (many “thinking” tokens per answer). Use when the task actually needs reasoning.
  • The builder’s instinct: read each new capability through lesson 2’s productive limits and lesson 7’s LLMOps discipline; the discipline outlasts specific models and techniques.

The field is moving in named directions; each changes the productive-limits math differently. Read new capabilities with a builder’s eye, what fits, what costs, what generalizes from what you already know, and the rest of Phase 3 (lessons 9-11) takes three of these directions deeper.