References: Inference
Source material
Section titled “Source material”Source curriculum (structural mirror, cited as further study):• Stanford CS336, "Language Modeling from Scratch", Lecture 10: Inference Instructors: Tatsunori Hashimoto and Percy Liang (Stanford) Course page: https://cs336.stanford.edu/ Lecture videos: YouTube playlist https://www.youtube.com/playlist?list=PLoROMvodv4rMqXOcazWaTUHhq-yembLCV License: no explicit license is published on the course site; lecture videos are on YouTube under standard terms; slides are public on GitHub without a stated license. Required attribution: "Based on the structure of Stanford CS336, 'Language Modeling from Scratch,' by Tatsunori Hashimoto and Percy Liang (cs336.stanford.edu). This is an independent structural mirror in original prose; it reproduces no course materials, and Stanford does not endorse it."This lesson mirrors the structure of Lecture 10 (inference). Clawdemy'slessons are original prose that follows the pedagogical arc of the course.Because the source publishes no explicit license, we cite it as a recommendedcompanion and reproduce none of its materials. All rights to the originalcourse materials remain with their creators.Watch this next
Section titled “Watch this next”- Stanford CS336, Lecture 10: Inference by Hashimoto and Liang. The lecture this lesson mirrors. It walks the cost analysis and the techniques with the back-of-envelope numbers, the natural next step once the picture here is clear.
Going deeper
Section titled “Going deeper”A short, durable list. Each link is a specific next step, not a generic pile.
-
“Efficient Memory Management for Large Language Model Serving with PagedAttention” by Kwon et al. (2023). The vLLM paper that introduced paged attention and popularized continuous batching as a stack. The clearest worked example of the KV-cache-as-virtual-memory idea.
-
“Fast Inference from Transformers via Speculative Decoding” by Leviathan et al. (2023). The paper behind speculative decoding, including the rejection step that preserves the target model’s output distribution. Short and readable.
-
“GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers” by Frantar et al. (2022). A canonical post-training quantization recipe; pairs well with “AWQ: Activation-aware Weight Quantization” by Lin et al. (2023). Read them together for the modern int4 weight-quantization picture.
Adjacent topics
Section titled “Adjacent topics”Where this connects inside the track.
-
Counting the cost (lesson 2). Decode is the textbook memory-bound case of arithmetic intensity; batching is the textbook fix.
-
Attention alternatives and MoE (lesson 4). GQA was introduced there to shrink the KV cache; here the cache returns as the central serving concern, and the two combine for long-context inference.
-
How models run on hardware (lesson 5) and Writing fast kernels (lesson 6). Decode’s HBM bandwidth is exactly the limit those lessons described; many serving stacks use Triton kernels for fused decode kernels and paged-attention bookkeeping.