Inference: brief

What you’ll learn

This lesson closes Phase 2 with the other half of the systems story: serving a trained model to users. The source curriculum is Stanford CS336, Lecture 10, by Tatsunori Hashimoto and Percy Liang, with lectures freely available on YouTube and the course at cs336.stanford.edu.

You will distinguish prefill (process the whole prompt in parallel, compute-bound) from decode (generate one token at a time, memory-bound) and see why decode dominates user-visible time; understand the KV cache as the central object and what it grows with; learn the five techniques that make decode efficient (continuous batching, paged attention, speculative decoding, quantization, and GQA from lesson 4); see how parallelism shows up differently at inference than at training (TP comfortable, PP less so); and learn to read a serving-stack speed-up claim with discrimination.

Where this fits

This is lesson 8 of 14, the last lesson of Phase 2 (systems and efficiency). It pairs with lessons 5 and 6 (the hardware and kernels that decode lives on) and with lesson 4 (whose KV cache returns here as the central concern). After it, Phase 3 turns from building and running the model to making it good: scaling laws, evaluation, data, and post-training.

Before you start

Prerequisites: lesson 4 (the KV cache and grouped-query attention, both of which are central here) and lesson 5 (the GPU memory hierarchy and the compute-bound vs memory-bound distinction this lesson runs on). The arithmetic-intensity vocabulary from lesson 2 is the underlying frame; the kernels lesson 6 makes some of the implementation details (paged attention, fused decode) concrete.

About the math

None. The lesson explains the cost profile of each phase and the mechanics of each technique without new formulas. The KV-cache memory expression is a simple multiplication.

By the end, you’ll be able to

The single capability this lesson builds: explain what makes LLM inference expensive and the main techniques that make it efficient (KV cache, batching). Concretely, you will be able to:

Distinguish prefill from decode and their cost profiles
Explain the KV cache and what it grows with
Describe continuous batching and paged attention
Explain speculative decoding and why it preserves output distribution
Explain why quantization is a bigger lever at inference than training

Time and difficulty

Read time: about 13 minutes
Practice time: about 10 minutes (read a serving-stack speed-up claim + diagnose a slow decode, plus flashcards)
Difficulty: deep (Stage C; systems-heavy lesson, reads through lessons 2/4/5)