Summary: New ways to generate, speculative decoding and diffusion LLMs

Standard LLM generation is autoregressive. One token at a time, each one a full forward pass through the network. Every LLM you currently use generates text this way. The field is exploring two specific alternatives that change either how fast or whether sequential at all.

Speculative decoding speeds up autoregressive generation. A small “draft” model proposes the next K tokens autoregressively. The big “target” model then verifies all K tokens in a single forward pass. An acceptance-rejection scheme accepts tokens where the draft and target agree and falls back when they don’t. The math guarantees the output distribution matches what the target model would have produced alone; quality is preserved exactly while throughput goes up substantially. Now standard in production LLM serving.

Diffusion LLMs (DLLMs) abandon autoregressive entirely. Borrowing from image diffusion, they start from a fully-masked output sequence and refine it across K denoising steps. Each step considers the whole sequence at once. ~10× faster than autoregressive on long outputs. Bidirectional context makes fill-in-the-middle tasks natural. Quality is approaching autoregressive parity but not yet at the frontier as of late 2025.

The lecturer’s “writing a speech” analogy for DLLMs: when you write a speech, you don’t produce it linearly from word one. You sketch a plan, draft each section roughly, then refine. Diffusion generation works the same way: a coarse first pass establishes structure, subsequent passes add detail.

This summary is the scan-it-in-five-minutes version. The full lesson covers the speculative-decoding mechanism in detail, the “noise is to images what mask is to text” framing, and the practical guidance on when each alternative fits.

Core ideas

Autoregressive default. One token per forward pass. Sequential, left-to-right. Every standard LLM works this way.
Speculative decoding mechanism. Small draft model proposes K tokens. Big target model verifies all K in one pass. Acceptance-rejection scheme guarantees same output distribution as target-alone. Net: many tokens per target-model pass instead of one.
Why speculative decoding is free quality-wise. The math is designed so the marginal distribution over generated tokens matches what the target would have produced. Not “approximately” or “usually.” Exactly.
Why it works mechanically. LLM inference is memory-bound at scale. A single big-model forward pass with K tokens is roughly as expensive as one with one token. So verifying K draft tokens at once is essentially free.
Multi-token prediction variant. Embed the draft mechanism inside the target model with multiple heads. Same idea, no separate small model needed.
Speculative decoding is now standard. Most frontier API serving uses it implicitly. Open-source frameworks (vLLM, TGI, TensorRT-LLM) expose it.
Diffusion LLMs replace the paradigm. Start from all-mask sequence. Run K denoising steps; each step predicts all positions in parallel. Output emerges from coarse-to-fine refinement.
“Noise is to images what mask is to text.” The translation from image diffusion to text diffusion. Tokens are discrete; you can’t add Gaussian noise to them, but you can mask them.
DLLM advantages. Speed (~10× faster on long outputs). Bidirectional context (fill-in-the-middle natural). Coarse-to-fine refinement.
DLLM limitations. Quality not yet at frontier, but closing. Many post-2022 LLM techniques (CoT, reasoning chains, RLHF) need adaptation. Production tooling immature.
Pitfall: confusing speculative decoding with quantization/distillation. All improve throughput but for different reasons; they can be combined, are not the same.
Pitfall: treating DLLMs as production-ready. They are not for most use cases as of late 2025.

What changes for you

After this lesson, “frontier model is X faster now” announcements stop being mysterious. Most are speculative-decoding wins, often invisible to the user. The “one token at a time” mental model becomes one option among several. And when a code-completion tool seems eerily good at filling in the middle of a function, the underlying generation might not be autoregressive at all; diffusion LLMs are starting to show up in those niches.

Autoregressive (one token at a time) is the default. It is not the only option.
Speculative decoding makes autoregressive faster without changing the output distribution.
Diffusion LLMs change the paradigm: start from all-mask, refine in parallel passes.