Speculative decoding and diffusion LLMs

For the entire track so far, we have assumed text generation works one way: autoregressive. The model predicts the next token from everything before it. Then it predicts the next token from the new context. Repeat until the end-of-sequence token. Each token is one full forward pass through the network. This is how every LLM you have ever used produces output.

That assumption is not the only option. The field is exploring two specific alternatives. One changes how tokens are produced. The other changes whether they are produced one at a time. Both are recent enough to still count as “frontier directions.” And both have clear advantages worth understanding.

Speculative decoding keeps autoregressive generation but speeds it up. A small “draft” model proposes several tokens at once; the big “target” model verifies them in a single forward pass. The math guarantees the output distribution matches what the target model would have produced alone, so you get the same quality at substantially higher throughput.

Diffusion LLMs (DLLMs) drop autoregressive generation entirely. They borrow from image diffusion models. They start from a fully-masked output sequence and predict all tokens in parallel, refining over a small number of denoising steps. They are reportedly 10 times faster than autoregressive on long outputs, with a different shape of capability. Fill-in-the-middle code completion benefits especially.

This is lesson 5 of Phase 7, the second-to-last lesson on the track. The closer pulls together every safety thread woven through Phases 4 to 7. By the end of this lesson, you will recognize both alternatives by name, understand why each is interesting, and know when each might matter.

Speculative decoding

The motivation is a mechanical fact about LLM inference. At the scale of frontier models, inference is memory-bound, not compute-bound. Loading the model’s weights from GPU memory into the chip’s compute units takes longer than the matrix multiplications themselves. So a single big-model forward pass over many tokens at once costs about the same as one over a single token. The bottleneck is the memory load, not the compute.

This is the unlock. If you can somehow propose several tokens at once and verify them in a single big-model pass, you get many tokens per memory-bound roundtrip instead of one.

The mechanism (Leviathan et al. 2023):

A small “draft” model generates the next K tokens autoregressively. Because the draft model is small, this is fast.
The big “target” model processes all K draft tokens (plus the prefix) in a single forward pass. This produces K+1 probability distributions: one for each of the K positions, and one for the next position after them.
An acceptance-rejection scheme compares the draft model’s predictions to the target model’s distributions. Tokens where the target probability matches or exceeds the draft probability are accepted. Tokens where it doesn’t are rejected via a specific resampling procedure.
The resulting output is mathematically equivalent to sampling from the target model directly. This is the key claim of the speculative-decoding paper. The acceptance-rejection scheme is built so that the spread of generated tokens matches what the target model would have produced on its own.

Net effect: when the draft and target models agree (most tokens), the system produces multiple tokens per target-model forward pass. When they disagree (occasional tokens), the system falls back to one token per pass at that position. Throughput in practice is several-fold faster than naive autoregressive generation, with no quality loss.

Speculative decoding is now standard in production LLM serving. Most frontier APIs use it implicitly; some open-source serving frameworks (vLLM, TGI, TensorRT-LLM) expose it as a configuration option.

A variant called multi-token prediction builds the draft step into the target model itself. The model has several “heads” on top of its final-layer representation. Each head predicts a different position. At inference, all heads produce candidates at once, and an acceptance scheme picks among them. The advantage: you do not need a separate small model. The technique has been explored in Meta’s Multi-Token Prediction work (Gloeckle et al. 2024).

Diffusion LLMs (DLLMs)

The second alternative is more radical. Instead of generating one token at a time, generate all tokens at once and refine them across a few iterative steps.

The intuition borrows from image diffusion models (the architectural class behind Stable Diffusion, DALL-E 3, Flux, and most modern image generators). The Stanford lecturer’s quote from Michelangelo captures it: “The sculpture is already complete within the marble block before I start my work. I just have to chisel away the superfluous material.”

For images, diffusion works in two directions. First, you add Gaussian noise to clean training images until they are pure noise. Then you train a model to reverse the process: given a noisy image, predict the noise to remove. At inference, you start from random noise and apply the reverse process step by step. You denoise until you have a clean generated image.

For text, the question is what “noise” means. Tokens are discrete; you cannot add Gaussian noise to a sequence of discrete tokens. The current research consensus, captured in papers like LLaDA (Nie et al. 2025; reference repo ML-GSAI/LLaDA on GitHub, from Renmin University’s GSAI lab in collaboration with Ant Group), is:

Noise is to images what the mask token is to text.

Concretely, a diffusion-LLM works like this:

Forward process (training): start with a clean text sequence. Gradually replace tokens with a special [MASK] token, more aggressively at each step, until the entire sequence is masked.
Reverse process (training): train a model to predict the original tokens given a partially-masked sequence. The model has to learn to fill in masks based on the surrounding context.
Inference: start with an all-[MASK] sequence (conditioned on a prompt). Run the model for K steps; at each step it produces a refined prediction across all positions, gradually unmasking. After K steps, the sequence is fully unmasked and serves as the output.

The lecturer’s “writing a speech” analogy for why this works: when you write a speech, you don’t write linearly from the first sentence to the last. You sketch a rough plan (“I’ll cover X, then Y, then Z”), draft each section roughly, then refine. Diffusion generation works the same way: a coarse first pass produces a rough draft, subsequent passes refine.

Key advantages:

Speed. The number of forward passes is the number of diffusion steps (typically 10-50), not the number of output tokens (could be thousands). Reportedly 10 times faster than autoregressive on long outputs.
Bidirectional context. Each step considers the entire sequence at once, including future tokens. This makes “fill-in-the-middle” tasks (where the model has to generate code in the middle of an existing function, with both prefix and suffix as context) particularly well-suited.
Coarse-to-fine refinement. The first few steps establish global structure; later steps refine local details. This is closer to how humans write than autoregressive’s strict left-to-right.

Current limitations:

Quality not yet at frontier. Autoregressive frontier models still beat diffusion LLMs on most benchmarks. The gap is closing. By 2026, the LLaDA2 family (a 16B “mini” model and a 100B-parameter MoE “flash” variant) shows competitive code generation. Production use is emerging. But most live code-completion (Copilot, Cursor, and similar) still runs autoregressive models. Treat diffusion text generation as an active research frontier with early traction, not a settled successor.
Inference-time techniques don’t transfer cleanly. Many post-2022 LLM techniques (chain-of-thought prompting, reasoning chains, RLHF-style alignment) were designed for autoregressive models. Adapting them for diffusion is active research.
Tooling is immature. Production LLM serving infrastructure assumes autoregressive generation. DLLM-specific serving is mostly research-grade as of 2025.

The technique is on the trajectory of “becoming production-ready” rather than “already production.” Worth knowing about; not yet what you reach for unless you’re specifically optimizing for fill-in-the-middle or extreme-low-latency generation.

Where each alternative fits

The two alternatives don’t compete; they solve different problems.

Speculative decoding is a serving optimization. It keeps the same model and the same output distribution. It just makes generation faster. The advantage is “purely beneficial”: no quality trade-off, just throughput. Production LLM APIs use it under the hood. As a user, you benefit from it without doing anything. You cannot tell whether a given response was generated speculatively or naively.

Diffusion LLMs are a different architecture entirely. Different training, different inference, different quality profile. Worth knowing about because the field is investing in them and they may be the right answer for specific applications (especially fill-in-the-middle code generation), but they’re not a drop-in replacement for autoregressive frontier LLMs today.

A useful frame: speculative decoding is a “horizontal” innovation (faster, same outputs). DLLMs are a “vertical” innovation (different paradigm, different outputs, different applications).

Why this matters when you use AI

Three things to hold onto.

Most “frontier model is fast now” announcements are speculative-decoding wins. When a vendor says “the new version is 3 times faster,” it is often because they improved their speculative-decoding setup. Think better draft model, multi-token prediction, or smarter scheduling. The underlying model did not necessarily change.
Diffusion LLMs may be the future for some applications. Code editors that need to fill in code in the middle of an existing function, ultra-low-latency chat, and applications where the output is structured (form filling, data generation) could benefit. Production deployment is still mostly research as of late 2025.
The “one token at a time” mental model is incomplete. It is right for almost everything you currently use. But it is not the only way LLMs can work. The field keeps adding new ways to generate. Your model of “what an LLM does” should leave room for the alternatives.

Common pitfalls

Three mistakes worth dodging.

Confusing speculative decoding with quantization or distillation. All three improve throughput, but for different reasons. Speculative decoding uses a small draft model alongside the big target model. Quantization uses lower-precision arithmetic for the same weights. Distillation trains a smaller model on the bigger model’s outputs to replace it entirely. These can be combined. They are not the same thing.

Treating diffusion LLMs as “ready for production today.” They are not, on most metrics. The papers report promising results that suggest they may be production-ready in the near future, but autoregressive frontier models still win on standard benchmarks. Reach for diffusion when you specifically need fill-in-the-middle or extreme low latency, not because it’s “newer.”

Overinterpreting the speech-writing analogy. The “draft, then refine” framing for DLLMs is a useful intuition, but the actual mechanism doesn’t match how humans write at every level of detail. The model isn’t producing a literal outline; it’s running through a learned denoising procedure that happens to converge from coarse to fine. The analogy explains why it works conceptually; the mechanism is its own thing.

What you should remember

Standard LLM generation is autoregressive. One token at a time, each one a full forward pass. This is what every LLM you currently use does.
Speculative decoding speeds up autoregressive generation. A small draft model proposes K tokens; the target model verifies them in a single pass. Acceptance-rejection scheme guarantees same output distribution. Now standard in production serving.
Diffusion LLMs (DLLMs) generate non-autoregressively. Start from all-mask, refine across K steps in parallel. Around 10 times faster than autoregressive on long outputs. Bidirectional context. Fill-in-the-middle works naturally.
DLLMs are not yet at frontier quality but the gap is closing; they are research-stage moving toward production.
The two alternatives solve different problems. Speculative decoding is a serving optimization (same model, faster). DLLMs are an architectural alternative (different model, different outputs, different applications).

If you remember one thing

Autoregressive (one token at a time) is the default. It is not the only option.
Speculative decoding makes autoregressive faster without changing the output distribution.
Diffusion LLMs change the paradigm: start from all-mask, refine in parallel passes.