Skip to content

Cheatsheet: Token by token: how a transformer generates text

forward pass → logits → softmax → sample → append → loop

Generation is a loop over the architecture, one token per iteration. The architecture does not produce a response; it produces the next token, then runs again.

StepOperationOutput
1Forward pass through every blockA d_model-dim vector at every position
2Final linear (last position)vocab_size-dim logits
3SoftmaxProbability distribution over the vocabulary
4Sample one tokenThe next token (decoding strategy lives here)
5Append to inputNew, longer sequence; loop back to step 1
StrategyWhat it doesWhen to use
GreedyAlways pick the highest-probability tokenShort structured output (single label, JSON field)
Pure samplingSample from the raw distributionRare on its own; usually paired with top-k or top-p
Top-kRestrict to top k tokens, sample from thoseReasonable middle ground; typical k is 40 to 50
Top-p (nucleus)Smallest set with cumulative probability ≥ p, sampleModern default; p is typically 0.9 to 0.95

Divide all logits by T before softmax.

TEffectDistribution
T < 1.0SharpenTop tokens dominate further
T = 1.0IdentityOriginal distribution
T > 1.0FlattenLow-probability tokens get a fairer shot
T = 0API shortcut for greedyFormula is undefined at zero; convention is “approaches argmax”
Combined with top-pStandard production setupTemperature shapes the distribution; top-p restricts the candidate pool
ConditionBehavior
max_tokensHard token-count limit; always applies
EOS tokenModel has been trained to emit a special end-of-sequence token; sampling it ends the loop cleanly
Custom stop sequencesUser-specified strings halt the loop when generated

The K and V vectors for previous tokens do not change between generation steps. Cache them; each new step only computes K and V for the one new token.

Without KV cacheWith KV cache
Quadratic in output lengthRoughly constant per new token after the first pass
Naive prediction loopStandard production setup
Streaming would feel slower as the response growsStreaming feels constant after the prefill delay
Field in the APIWhat it controls
temperatureThe sharpening or flattening described above
top_pNucleus sampling threshold
top_kTop-k threshold (less common in modern APIs; some still expose it)
max_tokensHard output-length limit
stopCustom stop sequences (string or array of strings)
streamWhether to stream tokens as they generate or wait for the full response
PitfallReality
The model “knows” its answer when it starts typingNo. Every token is sampled fresh from the next-token distribution. No plan, no draft.
Higher temperature equals more creativeNo. Higher temperature equals more random. If a model is bad at a task, raising temperature makes it bad in a more random way, not more competent.
Greedy is always right for “deterministic” tasksRight for short structured output. Wrong for longer multi-step reasoning, where low-temp top-p escapes bad local choices.
Streaming is a UI animationNo. Each streamed token is a real forward pass producing real output. The streaming is the work.
Output and input tokens cost the sameNo. Output is sequential compute; input parallelizes during prefill. APIs typically charge 3x to 5x more per output token.
  • Logits: unnormalized scores over the vocabulary, output of the final linear layer.
  • Softmax: turns logits into a probability distribution (all values 0 to 1, summing to 1.0).
  • Greedy: decoding strategy that always picks the highest-probability token.
  • Top-k: decoding strategy that restricts candidates to the top k tokens, then samples.
  • Top-p (nucleus): decoding strategy that restricts candidates to the smallest set with cumulative probability ≥ p, then samples.
  • Temperature (T): logit rescale before softmax; sharpens (T<1) or flattens (T>1) the distribution.
  • EOS token: the model’s special “end of text” token; sampling it stops the generation loop.
  • KV caching: reusing the K and V vectors computed for previous positions so each new token only requires new K and V for itself.
  • Prefill: the first forward pass over the full input prompt; parallelizable.
  • Generation step: one iteration of the prediction loop; produces one new token.

The model does not write.
It predicts one token at a time.