Speculative decoding and diffusion LLMs: brief

What you’ll learn

This is lesson 5 of Phase 7, How we judge models and where they’re going, in Track 5 (AI Foundations). For the entire track so far, we have assumed text generation works one way: autoregressive, one token at a time, each one a full forward pass. That assumption is not the only option, and the field is exploring two specific alternatives. Speculative decoding keeps autoregressive generation but speeds it up: a small “draft” model proposes K tokens; the big “target” model verifies them in a single forward pass. An acceptance-rejection scheme guarantees the output distribution matches what the target model would produce alone, so quality is preserved. Now standard in production LLM serving. Diffusion LLMs (DLLMs) abandon autoregressive entirely. Borrowing from image diffusion, they start from a fully-masked output sequence and refine it across K denoising steps in parallel. Reportedly 10× faster than autoregressive on long outputs, with bidirectional context that makes fill-in-the-middle code completion particularly natural. Quality is approaching autoregressive parity but not yet there at the frontier. By the end of this lesson you will recognize both alternatives by name, understand why each is interesting, and know when each might matter. Course materials are at cme295.stanford.edu.

Where this fits

This is lesson 5 of Phase 7. The previous lesson (Transformers beyond text) covered ViT and MoE: transformer adaptations for non-text modalities and for sparse parameter scaling. This lesson covers transformer-adjacent alternatives at the generation-time layer. The next lesson (Where to be careful) closes the track by pulling together every safety thread woven through Phases 4 to 7.

Before you start

Prerequisites: the transformers-beyond-text lesson is required for narrative continuity (it set up “the transformer block as a general-purpose primitive”). The decoding strategies lesson is useful since this lesson assumes you understand standard autoregressive decoding (the thing speculative decoding optimizes and DLLMs replace).

By the end, you’ll be able to

Recognize speculative decoding as a serving-time optimization that preserves the target model’s output distribution
Walk through the speculative decoding mechanism (draft model proposes, target model verifies, acceptance-rejection guarantees correctness)
Recognize diffusion LLMs as an architectural alternative that generates by denoising masks rather than by autoregressive one-token-at-a-time
Distinguish the kinds of problems each alternative is suited for (speculative decoding is “purely beneficial”; DLLMs are different applications and quality profile)
Describe how the “noise is to images what mask is to text” framing translates image diffusion into text diffusion

Time and difficulty

Read time: about 12 minutes
Practice time: about 12 minutes (a self-check on both alternatives, a hands-on triage exercise on which technique fits which problem, and flashcards)
Difficulty: standard