References: How transformers scale to real-world data: sliding windows and KV-cache savings

Source material

Source material:
• Stanford CME 295: Transformers & Large Language Models, Autumn 2025
  Instructor: Afshine Amidi & Shervine Amidi, Stanford University
  Course site: https://cme295.stanford.edu/
  Cheatsheet: https://cme295.stanford.edu/cheatsheet/
  Source lecture (Lecture 2, Transformer-based models & tricks):
    https://www.youtube.com/watch?v=yT84Y5zCnaA
  License (lecture videos): as published on Stanford's public YouTube channel
  License (Amidi cheatsheets): MIT
This lesson adapts the attention-efficiency section of Stanford CME 295
Lecture 2 (~3030s-3760s), which closes the lecture's first half on the
parts of the original transformer that have changed since 2017. The
KV-cache details get full coverage in the next Stanford lecture; we
preserve that division and stay within what this lecture covers.
Clawdemy provides original notes, summaries, and quizzes derived from
this material for educational purposes. All rights to the original
lectures remain with Stanford and the instructors.

Going deeper

A short list, chosen for durability. Each link is for a specific next step.

“Longformer: The Long-Document Transformer”, Beltagy et al., 2020. The original sliding window attention paper. Section 3 introduces the sliding window pattern, dilated sliding window, and global attention; section 5 has the empirical comparisons. Read this for the historical introduction of the pattern the lecture cites by name.
“Fast Transformer Decoding: One Write-Head is All You Need”, Shazeer, 2019. The MQA paper. Short and to the point: introduces multi-query attention as a single-write-head simplification of MHA. The motivation framing (decode-time memory bandwidth) is exactly what this lesson covers.
“GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints”, Ainslie et al., 2023. The GQA paper. Argues that GQA captures most of MQA’s speedup while preserving most of MHA’s quality, and shows you can convert an MHA-trained model into a GQA model with minimal additional training. Read this for why GQA became the modern default.
“FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness”, Dao et al., 2022. The FlashAttention paper. The lecturer’s reference to “tiling” comes from this family of approaches. Reads at a different layer of the stack than the architectural choices in this lesson (FlashAttention is an IO-aware kernel, not a different attention mechanism), but pairs naturally.
Stanford CME 295 cheatsheet by the Amidi twins. MIT-licensed. Single-page reference for the same material, in their dense visual style.

Adjacent topics

Topics that build on or sit beside this one.

The KV cache, in depth. This lesson stays within what Lecture 2 says about the KV cache (it exists, it grows, MQA/GQA shrinks it). Lecture 3 of CME 295 covers the cache mechanism in full: how it is structured, how it is paged, how techniques like PagedAttention manage it at the system level. We will adapt that lecture in a future Lecture 3 lesson.
Other attention approximations. Sliding window is one of several attention-approximation families. Others include sparse attention (BigBird), low-rank attention (Linformer), and kernel-based attention (Performer). The lecture mentions “different versions of the attention” but only develops sliding window in detail. Useful to know exists; not load-bearing for this lesson.
The compute-vs-memory tradeoff in inference. The two problems this lesson covers are part of a broader picture of inference cost. Other contributors include weight memory, activation memory, and the per-token decoding latency that depends on memory bandwidth more than compute. Search terms: “transformer inference latency anatomy,” “PagedAttention,” “vLLM.”
Where to go next. The next lesson opens the second arc of Lecture 2: transformer-based architectures. We will cover encoder-decoder architectures and T5’s span-corruption objective, then the BERT family across two more lessons. This concludes the “post-2017 changes that stuck” arc.

Original sources

The primary papers for the techniques covered, in chronological order.

“Attention Is All You Need”, Vaswani et al., 2017. The original transformer paper, which used standard MHA with full self-attention.
“Fast Transformer Decoding”, Shazeer, 2019. MQA.
“Longformer: The Long-Document Transformer”, Beltagy et al., 2020. Sliding window attention.
“FlashAttention”, Dao et al., 2022. IO-aware tiling for the attention computation.
“GQA”, Ainslie et al., 2023. Group-query attention as the MHA-MQA compromise.

Community discussion

None selected for this lesson. Attention efficiency is an active area of research and the practitioner community discusses it widely on social media and engineering blogs, but threads rotate too quickly to be worth pinning here. Durable references will be added at a future quarterly review.