forward pass → logits → softmax → sample → append → loop
Generation is a loop over the architecture, one token per iteration. The architecture does not produce a response; it produces the next token, then runs again.
Step Operation Output 1 Forward pass through every block A d_model-dim vector at every position 2 Final linear (last position) vocab_size-dim logits3 Softmax Probability distribution over the vocabulary 4 Sample one token The next token (decoding strategy lives here) 5 Append to input New, longer sequence; loop back to step 1
Strategy What it does When to use Greedy Always pick the highest-probability token Short structured output (single label, JSON field) Pure sampling Sample from the raw distribution Rare on its own; usually paired with top-k or top-p Top-k Restrict to top k tokens, sample from those Reasonable middle ground; typical k is 40 to 50 Top-p (nucleus) Smallest set with cumulative probability ≥ p, sample Modern default; p is typically 0.9 to 0.95
Divide all logits by T before softmax.
TEffect Distribution T < 1.0Sharpen Top tokens dominate further T = 1.0Identity Original distribution T > 1.0Flatten Low-probability tokens get a fairer shot T = 0API shortcut for greedy Formula is undefined at zero; convention is “approaches argmax” Combined with top-p Standard production setup Temperature shapes the distribution; top-p restricts the candidate pool
Condition Behavior max_tokensHard token-count limit; always applies EOS token Model has been trained to emit a special end-of-sequence token; sampling it ends the loop cleanly Custom stop sequences User-specified strings halt the loop when generated
The K and V vectors for previous tokens do not change between generation steps. Cache them; each new step only computes K and V for the one new token.
Without KV cache With KV cache Quadratic in output length Roughly constant per new token after the first pass Naive prediction loop Standard production setup Streaming would feel slower as the response grows Streaming feels constant after the prefill delay
Field in the API What it controls temperatureThe sharpening or flattening described above top_pNucleus sampling threshold top_kTop-k threshold (less common in modern APIs; some still expose it) max_tokensHard output-length limit stopCustom stop sequences (string or array of strings) streamWhether to stream tokens as they generate or wait for the full response
Pitfall Reality The model “knows” its answer when it starts typing No. Every token is sampled fresh from the next-token distribution. No plan, no draft. Higher temperature equals more creative No. Higher temperature equals more random. If a model is bad at a task, raising temperature makes it bad in a more random way, not more competent. Greedy is always right for “deterministic” tasks Right for short structured output. Wrong for longer multi-step reasoning, where low-temp top-p escapes bad local choices. Streaming is a UI animation No. Each streamed token is a real forward pass producing real output. The streaming is the work. Output and input tokens cost the same No. Output is sequential compute; input parallelizes during prefill. APIs typically charge 3x to 5x more per output token.
Logits: unnormalized scores over the vocabulary, output of the final linear layer.
Softmax: turns logits into a probability distribution (all values 0 to 1, summing to 1.0).
Greedy: decoding strategy that always picks the highest-probability token.
Top-k: decoding strategy that restricts candidates to the top k tokens, then samples.
Top-p (nucleus): decoding strategy that restricts candidates to the smallest set with cumulative probability ≥ p, then samples.
Temperature (T): logit rescale before softmax; sharpens (T<1) or flattens (T>1) the distribution.
EOS token: the model’s special “end of text” token; sampling it stops the generation loop.
KV caching: reusing the K and V vectors computed for previous positions so each new token only requires new K and V for itself.
Prefill: the first forward pass over the full input prompt; parallelizable.
Generation step: one iteration of the prediction loop; produces one new token.
The model does not write.
It predicts one token at a time.