Token by token: how a transformer generates text

When you type a prompt into a chat interface and watch the response appear, you might assume the model “thought of the answer” and is now typing it out for you. That is not what is happening.

The model produces one token at a time. After the input passes through every block of the transformer, the top of the stack at the last position gives a probability distribution over the entire vocabulary. The model picks one token from that distribution. That token gets appended to the input. The model runs the whole architecture again on the new, longer input. Pick another token. Append. Run again. Stop when a stop condition fires.

This is autoregressive generation, and it is what every chat interface, code completion tool, and AI writing assistant you have ever used does under the hood. By the end of this lesson you will know exactly how the prediction loop works, what the decoding parameters in any AI API actually do, and why streaming responses appear one word at a time.

From transformer to text

The architecture from the previous five lessons produces, for every input position, a d_model-dim vector at the top of the stack. For most positions we ignore this output (we already know those tokens; we don’t need to predict them). But for the last position, the position right after the input ends, the model’s output is what we use to predict what comes next.

The last bit of architecture is one more linear layer that projects the last position’s d_model-dim vector to vocab_size (typically 30,000 to 100,000). The output is called logits: a vector of unnormalized scores, one per token in the vocabulary. The score at position N in this vector is the model’s preference for “the next token should be the Nth token in the vocabulary.”

Logits are not yet probabilities. To turn them into probabilities, apply softmax (the same operation from the attention lesson, applied here across the vocabulary instead of across keys). Now you have a probability distribution: vocab_size numbers, all between 0 and 1, summing to 1.0. Each one is the model’s predicted probability that this specific token comes next.

The next step is to pick a token from that distribution. That picking step is where decoding strategies live, and it is where almost all of the API parameters you have ever set live.

The prediction loop

Autoregressive generation. Each pass produces one new token, which appends to the input and triggers the next pass. KV caching reuses keys and values from prior positions so subsequent passes only do new work for the one new token, not for the whole sequence.

Five steps, in order:

Forward pass. Run the input tokens through every block of the transformer.
Logits. The last position’s output goes through the final linear projection to a vocab_size-dim vector of unnormalized scores.
Softmax. Turn the scores into a probability distribution.
Sample. Pick one token from the distribution. (Decoding strategy lives here.)
Append. Add the new token to the input. Loop back to step 1.

Stop when a stop condition fires (covered below). Otherwise, this loop runs once per output token. A 500-token response runs the whole loop 500 times.

Decoding strategies

The sample step is where one model can produce wildly different outputs depending on what you tell it. Five common strategies, each useful in different situations.

Greedy

Always pick the highest-probability token. Deterministic; the same prompt always produces the same output. Good for short structured outputs where predictable behavior matters (a single classification, a short JSON field). Bad for creative writing because the model can get stuck in repetitive loops; the highest-probability token at each step is often whatever it just said. Also weaker than light sampling for longer multi-step reasoning, where the slight stochasticity of low-temperature top-p tends to escape bad local choices.

Pure sampling

Sample from the distribution exactly as the model produces it. Each token’s chance of being chosen equals its predicted probability. Maximum variety, but the model can drift into nonsense because low-probability tokens (sometimes wrong, sometimes random) get picked more often than they should.

Top-k sampling

Restrict to the top-k highest-probability tokens, then sample from those (after renormalizing). Typical k is 40 or 50. Cuts the long tail of unlikely tokens. A reasonable middle ground that avoids the worst failures of pure sampling.

Top-p (nucleus) sampling

Restrict to the smallest set of tokens whose cumulative probability is at least p (typically 0.9 or 0.95), then sample from those. Adapts the candidate pool size to the distribution: when the model is confident (a peaked distribution), few candidates; when uncertain (a flat distribution), many candidates. It is the most common modern default.

Temperature

Before applying softmax, divide all logits by a temperature value T. Then softmax. T = 1.0 is the default (no change). T < 1.0 sharpens the distribution (the high-probability tokens become more dominant). T > 1.0 flattens the distribution (low-probability tokens get a fairer shot). Most APIs treat T = 0 as a shortcut for greedy (the formula is undefined at exactly zero, so the convention is “as T approaches zero, sampling collapses to argmax”).

Combine: top-p with temperature is the most common modern setup. Temperature shapes the distribution; top-p restricts the candidate pool. Both at the same time gives you a tunable balance between predictability and variety.

Stop conditions

Generation runs in a loop. Three things can stop it:

Max tokens. A hard limit. Common defaults are 2048 or 4096 tokens, but you can usually raise or lower this per call.
End-of-sequence (EOS) token. Every model is trained to emit a special EOS token when its response is “done.” When the sample step picks EOS, generation stops cleanly.
Custom stop sequences. Many APIs let you specify strings ("User:", "</answer>", "\n\n") that, when generated, halt the loop.

In practice: max_tokens always applies; the EOS token or a custom stop sequence ends the loop earlier if either one fires.

KV caching

Here is the most practically important detail about generation speed.

The naive prediction loop runs the entire architecture from scratch on the full input every time. A 1000-token prompt that produces a 500-token response would do 500 forward passes, each one larger than the last. Because attention’s cost grows quadratically with sequence length, naive generation would scale quadratically with output length. That is much too slow to be practical.

The optimization: KV caching. The K and V vectors that attention computes for each previous token do not change between steps. Each new step needs to compute K and V for one new token; everything before it is identical to last time. So the cache stores K and V for every position seen so far, and each new step only computes K and V for the one new token.

The first forward pass (over the full prompt) is the expensive one. Each subsequent pass is much cheaper because the per-position K and V for prior tokens are read from the cache instead of recomputed.

A subtlety worth getting right: KV caching does not make per-token decoding constant-time. The new query still has to attend over every cached position, so each new token’s attention cost still grows linearly with the cache length (the number of tokens generated so far). What caching removes is the recomputation of past K and V, a large constant factor and the work that would have made naive generation grow quadratically with output length. With caching, generation is linear per token in cache length, not constant. In practice the linear-in-context-length cost is usually dominated by the constant per-token model cost until contexts get long, which is why streaming responses appear at a roughly steady rate after a brief initial delay; on very long contexts the per-token cost does eventually grow.

Modern serving stacks layer one more trick on top: speculative decoding. A small “draft” model proposes several next tokens; the larger target model verifies them in a single forward pass. If the draft was right, the target model accepts multiple tokens at once instead of one per pass. As of 2026, speculative decoding is shipping natively in TensorRT-LLM, vLLM, SGLang, and most production serving stacks, and several frontier models train compatible draft heads as part of release. The user-visible effect is the same KV-cached generation loop but with throughput jumps of 2-3x on routine outputs.

Why this matters when you use AI

Three direct consequences when you use AI APIs or chat interfaces.

Temperature, top-p, top-k are knobs you can turn. Defaults vary by provider, but 0.7 is a common middle setting; lower for code, math, and structured output; higher for creative writing or brainstorming. Top-p of 0.9 or 0.95 is a common modern default. If your API call’s output feels too rigid or too random, these are the first parameters to adjust.
Streaming is real, not cosmetic. When you see tokens appear one at a time in a chat UI, that is the model genuinely producing one token per forward pass. The UI is not artificially delaying anything; the architecture is just sequential at inference. Streaming exists because the alternative (waiting for the whole response) would feel much worse.
Cost scales with output length, not just input length. Every output token is its own forward pass. APIs price input and output tokens separately, and output tokens are typically more expensive than input tokens (sometimes three to five times as much per token) because they require sequential compute. If you are optimizing API spend, shortening output via better prompting, smaller max_tokens, or sensible stop sequences is the highest-leverage move.

Common pitfalls

A few mistakes are common enough to be worth naming.

Thinking the model “knows” its answer. It does not. At each step it knows only the probability distribution over the next token. There is no plan, no draft, no concept of where the response is heading. Every token is sampled fresh. The illusion of coherent reasoning emerges from the architecture being well-trained on coherent text, not from the model “having an answer in mind.”

Confusing temperature with intelligence. Higher temperature does not make the model “more creative” in any deep sense. It just samples from a flatter distribution. If a model is bad at a task at temperature 0.7, raising the temperature to 1.5 will make it bad in a more random way, not more competent.

Reaching for greedy on every “deterministic” task. Greedy is fine for short structured output (a single label, a short JSON field). For longer multi-step problems (math, multi-line code, multi-paragraph reasoning), pure greedy often gets stuck in suboptimal local choices. Modern best practice for these tasks is low temperature plus top-p rather than pure greedy. The slight stochasticity helps the model escape bad local minima.

Ignoring the EOS token. When a model “rambles past where it should have stopped,” it is usually because either it did not sample EOS at the right place, or max_tokens is set higher than needed. Setting a sensible max_tokens is a hygiene practice; thinking about stop sequences is a bigger lever for structured tasks.

Mistaking streaming for processing time. The first-token delay is real (initial prompt processing through the full prompt). The subsequent token-by-token stream is also real (one forward pass per token). What you see is actually how the model works, not a UI animation.

What you should remember

Generation is a loop. Forward pass → logits → softmax → sample → append → repeat. Until a stop condition fires.
Decoding strategies shape the sample step. Greedy (always top), pure sampling (raw distribution), top-k (top N), top-p (nucleus). Temperature rescales the distribution before any of these.
Most modern APIs default to top-p sampling with temperature near 0.7. Lower temperature for structure; higher for creativity. Top-p around 0.9 is the common default.
KV caching makes generation much cheaper per token by reusing past K and V vectors instead of recomputing them. Without it, generation would scale quadratically with output length. With it, the per-token cost still grows linearly with cache length (the new query attends over the whole cache), but the dominant constant factor of recomputation is eliminated. Linear, not quadratic; not constant.
Output cost scales linearly with output length. Every output token is its own forward pass; APIs price accordingly.

You are now ready for the practice section, where you will run the sample step by hand on a tiny logits vector and compare what greedy, top-k, and top-p decoding produce from the same starting point.

If you remember one thing

The model does not write.
It predicts one token at a time.