Skip to content

References: How transformers scale to real-world data: sliding windows and KV-cache savings

Source material:
• Stanford CME 295: Transformers & Large Language Models, Autumn 2025
Instructor: Afshine Amidi & Shervine Amidi, Stanford University
Course site: https://cme295.stanford.edu/
Cheatsheet: https://cme295.stanford.edu/cheatsheet/
Source lecture (Lecture 2, Transformer-based models & tricks):
https://www.youtube.com/watch?v=yT84Y5zCnaA
License (lecture videos): as published on Stanford's public YouTube channel
License (Amidi cheatsheets): MIT
This lesson adapts the attention-efficiency section of Stanford CME 295
Lecture 2 (~3030s-3760s), which closes the lecture's first half on the
parts of the original transformer that have changed since 2017. The
KV-cache details get full coverage in the next Stanford lecture; we
preserve that division and stay within what this lecture covers.
Clawdemy provides original notes, summaries, and quizzes derived from
this material for educational purposes. All rights to the original
lectures remain with Stanford and the instructors.

A short list, chosen for durability. Each link is for a specific next step.

Topics that build on or sit beside this one.

  • The KV cache, in depth. This lesson stays within what Lecture 2 says about the KV cache (it exists, it grows, MQA/GQA shrinks it). Lecture 3 of CME 295 covers the cache mechanism in full: how it is structured, how it is paged, how techniques like PagedAttention manage it at the system level. We will adapt that lecture in a future Lecture 3 lesson.

  • Other attention approximations. Sliding window is one of several attention-approximation families. Others include sparse attention (BigBird), low-rank attention (Linformer), and kernel-based attention (Performer). The lecture mentions “different versions of the attention” but only develops sliding window in detail. Useful to know exists; not load-bearing for this lesson.

  • The compute-vs-memory tradeoff in inference. The two problems this lesson covers are part of a broader picture of inference cost. Other contributors include weight memory, activation memory, and the per-token decoding latency that depends on memory bandwidth more than compute. Search terms: “transformer inference latency anatomy,” “PagedAttention,” “vLLM.”

  • Where to go next. The next lesson opens the second arc of Lecture 2: transformer-based architectures. We will cover encoder-decoder architectures and T5’s span-corruption objective, then the BERT family across two more lessons. This concludes the “post-2017 changes that stuck” arc.

The primary papers for the techniques covered, in chronological order.

None selected for this lesson. Attention efficiency is an active area of research and the practitioner community discusses it widely on social media and engineering blogs, but threads rotate too quickly to be worth pinning here. Durable references will be added at a future quarterly review.