How a transformer generates text: brief

What you’ll learn

This is lesson 1 of Phase 5 (How we steer models at inference) in Track 5 (AI Foundations). Phase 2 built the transformer architecture (tokens, embeddings, attention, the full block). This lesson picks up at what that trained architecture actually does at runtime. The full course materials are at cme295.stanford.edu.

The model does not produce text in one shot. It produces text one token at a time, in a loop: forward pass through every block, softmax over the vocabulary, sample one token, append it to the input, run the forward pass again. The lesson walks that prediction loop, compares decoding strategies (greedy, pure sampling, top-k, top-p, plus temperature as a separate dial that reshapes the distribution before any of those), explains KV caching honestly (per the FC-2-001/002 audit fix: KV caching removes the recompute cost that would have made naive generation grow quadratically with output length, so per-token cost is linear in cache length, not constant; the dominant constant per-token model cost is what makes streaming feel steady until contexts get long), and closes on speculative decoding as the 2026 production speedup (TensorRT-LLM, vLLM, SGLang ship it natively).

Where this fits

This is lesson 1 of Phase 5, How we steer models at inference, and the phase opener. Phase 2 gave you the transformer block; this lesson shows that block in action at inference time. The next three Phase 5 lessons cover How prompting works (mechanics, system prompts, prompt injection), How few-shot examples teach in context, and How chain of thought makes models think out loud. Together those four lessons cover the full inference-time steering toolbox.

Before you start

Prerequisites: the transformer block lesson is required. We assume you understand what a transformer block is and what it produces at the top of the stack. The tokens and embeddings lessons are useful supporting context but not required. If “logits” or “softmax” feel unfamiliar, the attention lesson covers softmax at the depth this lesson assumes.

By the end, you’ll be able to

Explain how a transformer produces text autoregressively (one token at a time) and trace the prediction loop end to end (forward pass, logits, softmax, sample, append)
Compare the most common decoding strategies (greedy, sampling, top-k, top-p) and name the trade-off each one makes
Explain what temperature does to the probability distribution and how it shapes the model’s output style
Explain KV caching honestly (it removes the recompute cost that would have made generation quadratic, so per-token cost is linear in cache length, not constant) and recognize speculative decoding as the production speedup layered on top
Predict how output cost scales with output length and which API parameters matter most for cost (max_tokens, stop sequences, prompting that produces shorter outputs)

Time and difficulty

Read time: about 22 minutes
Practice time: about 15 minutes (sampling-by-hand on a tiny logits vector, plus a quick decoding-strategy comparison on a hosted playground)
Difficulty: standard