Skip to content

Cheatsheet: Inference

PhaseWhat it doesCost profile
PrefillProcess the whole prompt in one parallel forward pass; build initial KV cacheCompute-bound (like a training step)
DecodeGenerate one token at a time, autoregressivelyMemory-bound: weights + KV cache read every step

Most user-visible time on long responses is in decode.

KV-cache memory ~ sequence_length x n_layers x n_kv_heads x head_dim x bytes/value
  • Stored once per token per layer; read at every later decode step.
  • At long contexts can exceed the model’s own weights.
  • Reads are the bandwidth bottleneck of decode.
TechniqueWhat it attacksHow
Continuous batchingPer-token weight loadAmortize weight load across many concurrent requests; keep the running batch full by adding/removing requests live
Paged attentionWasted cache memoryKV cache as virtual-memory pages; allocate only as needed; release on finish
Speculative decodingPer-step autoregressionSmall draft proposes -> big model verifies in one parallel pass; multiple tokens per pass, identical output distribution
Quantization (int8/int4)Bytes-per-step HBM trafficSmaller weights/cache = less HBM bandwidth per decode step
GQA (lesson 4)KV-cache size itselfShare K/V across head groups; cache shrinks severalfold

Inference parallelism (different from training)

Section titled “Inference parallelism (different from training)”
SchemeTrainingInference
Data parallelReplicate + batch-splitReplicas behind a load balancer
Tensor parallelPer-layer comm, within nodeComfortable, preserves per-request latency
Pipeline parallelStages across nodes, microbatchedLess common: pipeline bubbles hurt latency

Inference optimizes latency, not just throughput.

A “10x throughput” claim is almost always a recipe:

  • continuous batching (biggest single contributor)
  • paged attention (concurrent users)
  • quantization (bandwidth)
  • speculative decoding (tokens/pass)
  • often GQA in the model itself

Ask: which ingredients, and against what baseline?

  • Prefill / decode: the two inference phases (parallel prompt processing / autoregressive generation).
  • KV cache: stored keys/values for previous tokens, per layer.
  • Continuous batching: dynamic in-flight batching that keeps the GPU full.
  • Paged attention: KV cache as virtual memory pages.
  • Speculative decoding: draft + verify for many tokens per big-model pass.
  • Quantization: serve at lower precision (int8/int4) to cut HBM traffic.
  • Stanford CS336, Lecture 10 (Inference), by Hashimoto and Liang. cs336.stanford.edu. Independent structural mirror in original prose; see references.