DRAFT MODEL (small): generates K tokens autoregressively (fast)
↓
TARGET MODEL (big): one forward pass on prompt + K draft tokens
↓ produces K+1 probability distributions (one per draft position + 1 next)
↓
ACCEPTANCE-REJECTION:
for each draft token:
compare draft prob to target prob at that position
accept with probability min(target_prob/draft_prob, 1)
if rejected: resample from a corrected distribution
↓
OUTPUT (same distribution as target-model-alone, faster)
Why it works: LLM inference is memory-bound at scale. A single target-model pass with K tokens is roughly as expensive as one with 1 token. Many tokens per pass = many tokens per memory roundtrip.
Autoregressive (AR): one token at a time, each from a full forward pass conditional on all previous tokens. The default LLM generation paradigm.
Speculative decoding: serving optimization. Small draft model proposes K tokens; big target model verifies in one pass; acceptance-rejection guarantees same output distribution.
Multi-token prediction: speculative decoding with the draft mechanism embedded in the target model via multiple heads. No separate small model.
Diffusion LLM (DLLM): non-autoregressive generation. Start from all-masked sequence, refine across K denoising steps.
Masked diffusion model (MDM): alternative term for DLLM emphasizing the masking mechanism.
[MASK] token: the discrete-text equivalent of “noise” in image diffusion. Tokens are progressively replaced with [MASK] during training’s forward process.
Memory-bound inference: the property that LLM inference’s dominant cost is memory load, not compute. The reason speculative decoding works.
Autoregressive (one token at a time) is the default. It is not the only option. Speculative decoding makes autoregressive faster without changing the output distribution. Diffusion LLMs change the paradigm: start from all-mask, refine in parallel passes.