Skip to content

References: New ways to generate, speculative decoding and diffusion LLMs

Source material:
• Stanford CME 295: Transformers & Large Language Models, Autumn 2025
Instructor: Afshine Amidi & Shervine Amidi, Stanford University
Course site: https://cme295.stanford.edu/
Cheatsheet: https://cme295.stanford.edu/cheatsheet/
Source lectures:
Lecture 3 (Large Language Models) for speculative decoding [01:41:36]
Lecture 9 (Current Trends) for diffusion LLMs [01:11:30 onward]
See course site at https://cme295.stanford.edu/ for the lecture URLs
License (lecture videos): as published on Stanford's public YouTube channel
License (Amidi cheatsheets): MIT
This lesson combines the speculative-decoding section of Stanford CME 295
Lecture 3 (drafting + acceptance-rejection mechanism) with the diffusion-LLM
section of Lecture 9 (Michelangelo analogy, mask-as-noise framing, LLaDA
introduction). Multi-token prediction is also covered briefly per Lecture 3.
Clawdemy provides original notes, summaries, and quizzes derived from this
material for educational purposes. All rights to the original lectures
remain with Stanford and the instructors.

The two papers at the heart of this lesson, plus the multi-token-prediction follow-up.

  • “Fast Inference from Transformers via Speculative Decoding”, Leviathan et al., 2022 (Google). The speculative-decoding paper. Section 2 has the mechanism and the acceptance-rejection scheme; the proof that the marginal distribution matches the target’s is a few lines. Worth reading even at a non-technical level; the elegance of the result is part of why the technique stuck.

  • “Better & Faster Large Language Models via Multi-token Prediction”, Gloeckle et al., 2024 (Meta). The multi-token prediction paper. Embeds the draft mechanism inside the target model via multiple heads. Useful as the natural follow-up to vanilla speculative decoding.

  • “Large Language Diffusion Models”, Nie et al., 2025. The LLaDA paper. Reportedly one of the first diffusion LLMs to approach autoregressive parity on standard benchmarks. Section 2 introduces the masked-diffusion mechanism for text; section 3 has the architecture and training details. Section 4 reports benchmark results. Worth reading after this lesson; the gap between this lesson’s intuition and the paper’s mechanism is small.

To understand text diffusion, it helps to understand image diffusion. Two papers worth knowing.

A short list, chosen for durability.

  • Inference-time compute and “thinking time.” Speculative decoding doesn’t change the output distribution; it just makes generation faster. But other inference-time techniques (like reasoning models from Phase 6) trade compute for quality. Search terms: “test-time compute scaling,” “inference-time compute laws,” “thinking-time-vs-quality.” Active research area.

  • Quantization and distillation. Both reduce LLM inference costs but in different ways than speculative decoding. Quantization uses lower-precision weights for the same model. Distillation trains a smaller model on a bigger model’s outputs to replace it. Speculative decoding uses a draft model alongside the big model. All can be combined; they are not the same.

  • Fill-in-the-middle (FIM) for code generation. This is one of the cleaner production niches for diffusion LLMs. Search terms: “fill-in-the-middle code completion,” “bidirectional code generation,” “code editor LLM autocomplete.” Most autoregressive code models are trained with FIM-specific data; diffusion offers a different shape of the same capability.

  • Stanford CME 295 cheatsheet by the Amidi twins. MIT-licensed. Section 3 (optimizations) covers speculative decoding briefly; diffusion LLMs are not separately covered in the cheatsheet (lectures are the primary source). Worth using as a study reference for the speculative-decoding half of this lesson.

None selected for this lesson. Vendor blog posts (Anthropic, OpenAI, DeepMind) and the academic literature are the better entry points for current state-of-the-art. The diffusion-LLM ecosystem is producing strong technical writing in research-engineering blogs that could become durable references over time.