References: New ways to generate, speculative decoding and diffusion LLMs

Source material

Source material:
• Stanford CME 295: Transformers & Large Language Models, Autumn 2025
  Instructor: Afshine Amidi & Shervine Amidi, Stanford University
  Course site: https://cme295.stanford.edu/
  Cheatsheet: https://cme295.stanford.edu/cheatsheet/
  Source lectures:
    Lecture 3 (Large Language Models) for speculative decoding [01:41:36]
    Lecture 9 (Current Trends) for diffusion LLMs [01:11:30 onward]
    See course site at https://cme295.stanford.edu/ for the lecture URLs
  License (lecture videos): as published on Stanford's public YouTube channel
  License (Amidi cheatsheets): MIT
This lesson combines the speculative-decoding section of Stanford CME 295
Lecture 3 (drafting + acceptance-rejection mechanism) with the diffusion-LLM
section of Lecture 9 (Michelangelo analogy, mask-as-noise framing, LLaDA
introduction). Multi-token prediction is also covered briefly per Lecture 3.
Clawdemy provides original notes, summaries, and quizzes derived from this
material for educational purposes. All rights to the original lectures
remain with Stanford and the instructors.

Foundational papers

The two papers at the heart of this lesson, plus the multi-token-prediction follow-up.

“Fast Inference from Transformers via Speculative Decoding”, Leviathan et al., 2022 (Google). The speculative-decoding paper. Section 2 has the mechanism and the acceptance-rejection scheme; the proof that the marginal distribution matches the target’s is a few lines. Worth reading even at a non-technical level; the elegance of the result is part of why the technique stuck.
“Better & Faster Large Language Models via Multi-token Prediction”, Gloeckle et al., 2024 (Meta). The multi-token prediction paper. Embeds the draft mechanism inside the target model via multiple heads. Useful as the natural follow-up to vanilla speculative decoding.
“Large Language Diffusion Models”, Nie et al., 2025. The LLaDA paper. Reportedly one of the first diffusion LLMs to approach autoregressive parity on standard benchmarks. Section 2 introduces the masked-diffusion mechanism for text; section 3 has the architecture and training details. Section 4 reports benchmark results. Worth reading after this lesson; the gap between this lesson’s intuition and the paper’s mechanism is small.

The image-diffusion foundation

To understand text diffusion, it helps to understand image diffusion. Two papers worth knowing.

“Denoising Diffusion Probabilistic Models”, Ho et al., 2020. The DDPM paper. Sections 2-3 introduce the forward (add noise) + reverse (denoise) framework that text diffusion adapts. The mathematical foundation everyone references.
“High-Resolution Image Synthesis with Latent Diffusion Models”, Rombach et al., 2022. The Stable Diffusion paper. Worth reading for the broader context of how image diffusion models look in practice. The core mechanism of denoising-from-noise is the same that text diffusion adapts.

Going deeper

A short list, chosen for durability.

“Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads”, Cai et al., 2024. A practical multi-token prediction framework that integrates with existing LLM inference servers. Useful for understanding what speculative-decoding deployment looks like at scale.
“EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty”, Li et al., 2024. A speculative-decoding refinement that uses internal model features to make better draft predictions. Useful for the empirical state of the art in serving optimization.
“DiffLM” and successor diffusion-LLM papers. Several research groups are pushing diffusion-LLM quality toward parity. Worth tracking via arXiv if you’re following frontier research.

Adjacent topics

Inference-time compute and “thinking time.” Speculative decoding doesn’t change the output distribution; it just makes generation faster. But other inference-time techniques (like reasoning models from Phase 6) trade compute for quality. Search terms: “test-time compute scaling,” “inference-time compute laws,” “thinking-time-vs-quality.” Active research area.
Quantization and distillation. Both reduce LLM inference costs but in different ways than speculative decoding. Quantization uses lower-precision weights for the same model. Distillation trains a smaller model on a bigger model’s outputs to replace it. Speculative decoding uses a draft model alongside the big model. All can be combined; they are not the same.
Fill-in-the-middle (FIM) for code generation. This is one of the cleaner production niches for diffusion LLMs. Search terms: “fill-in-the-middle code completion,” “bidirectional code generation,” “code editor LLM autocomplete.” Most autoregressive code models are trained with FIM-specific data; diffusion offers a different shape of the same capability.

Stanford CME 295 cheatsheet

Stanford CME 295 cheatsheet by the Amidi twins. MIT-licensed. Section 3 (optimizations) covers speculative decoding briefly; diffusion LLMs are not separately covered in the cheatsheet (lectures are the primary source). Worth using as a study reference for the speculative-decoding half of this lesson.

Community discussion

None selected for this lesson. Vendor blog posts (Anthropic, OpenAI, DeepMind) and the academic literature are the better entry points for current state-of-the-art. The diffusion-LLM ecosystem is producing strong technical writing in research-engineering blogs that could become durable references over time.