Skip to content

Inference, serving a trained model fast

This lesson closes Phase 2 with the other half of the systems story: serving a trained model to users. The source curriculum is Stanford CS336, Lecture 10, by Tatsunori Hashimoto and Percy Liang, with lectures freely available on YouTube and the course at cs336.stanford.edu.

You will distinguish prefill (process the whole prompt in parallel, compute-bound) from decode (generate one token at a time, memory-bound) and see why decode dominates user-visible time; understand the KV cache as the central object and what it grows with; learn the five techniques that make decode efficient (continuous batching, paged attention, speculative decoding, quantization, and GQA from lesson 4); see how parallelism shows up differently at inference than at training (TP comfortable, PP less so); and learn to read a serving-stack speed-up claim with discrimination.

This is lesson 8 of 14, the last lesson of Phase 2 (systems and efficiency). It pairs with lessons 5 and 6 (the hardware and kernels that decode lives on) and with lesson 4 (whose KV cache returns here as the central concern). After it, Phase 3 turns from building and running the model to making it good: scaling laws, evaluation, data, and post-training.

Prerequisites: lesson 4 (the KV cache and grouped-query attention, both of which are central here) and lesson 5 (the GPU memory hierarchy and the compute-bound vs memory-bound distinction this lesson runs on). The arithmetic-intensity vocabulary from lesson 2 is the underlying frame; the kernels lesson 6 makes some of the implementation details (paged attention, fused decode) concrete.

None. The lesson explains the cost profile of each phase and the mechanics of each technique without new formulas. The KV-cache memory expression is a simple multiplication.

The single capability this lesson builds: explain what makes LLM inference expensive and the main techniques that make it efficient (KV cache, batching). Concretely, you will be able to:

  • Distinguish prefill from decode and their cost profiles
  • Explain the KV cache and what it grows with
  • Describe continuous batching and paged attention
  • Explain speculative decoding and why it preserves output distribution
  • Explain why quantization is a bigger lever at inference than training
  • Read time: about 13 minutes
  • Practice time: about 10 minutes (read a serving-stack speed-up claim + diagnose a slow decode, plus flashcards)
  • Difficulty: deep (Stage C; systems-heavy lesson, reads through lessons 2/4/5)