Summary: Attention alternatives and mixture of experts

This lesson closes Phase 1 with the two variations that make modern LLMs efficient, one per sublayer. Standard attention has two scaling problems: it is quadratic in sequence length, and its KV cache dominates inference memory and bandwidth. Multi-query (MQA) and grouped-query (GQA) attention shrink the KV cache by sharing keys and values across heads, GQA being the modern default (severalfold smaller cache, little quality loss); sliding-window attention bounds long-context cost. On the FFN side, mixture of experts (MoE) replaces the single FFN with many experts plus a router that runs only a few per token, decoupling total parameters (capacity and memory) from active parameters (per-token compute). Both are resource-allocation moves in lesson 2’s terms, and neither changes the lesson-3 skeleton. This is the scan version; the lesson explains how to read these choices off a model card.

Core ideas

Standard attention’s costs: quadratic in sequence length, and a KV cache that dominates inference memory and bandwidth at long contexts.
MQA and GQA share keys/values across heads to shrink the KV cache. MQA: one shared set for all heads. GQA: shared per group, the modern default, severalfold smaller cache with almost no quality loss.
Sliding-window attention makes cost linear in length by attending to a recent window; often interleaved with full-attention layers.
Mixture of experts (MoE) swaps the single FFN for many experts plus a router that runs a few per token. Total parameters (capacity, memory) decouple from active parameters (per-token compute).
MoE’s costs: all experts must be stored (trades compute for memory), and the router needs load balancing. “Total vs active parameters” is the MoE gap.
Resource-allocation view: GQA/MQA target memory and bandwidth (the KV cache); MoE separates total params (memory) from active params (compute). The skeleton is unchanged.

What changes for you

These two variations explain most of what separates a textbook Transformer from the models actually shipping. A long context window offered at an affordable price is usually GQA (often plus windowing); a huge advertised parameter count that runs surprisingly cheaply is MoE, and the “active parameters” figure tells you the real per-token cost. The deeper takeaway sharpens the budget instinct from lesson 3: designing a model is not only choosing depth versus width, it is choosing where to spend each resource, compute, memory, and bandwidth, independently. GQA buys cheaper memory at a little quality; MoE buys capacity at more memory but the same compute. That is the level at which frontier models are actually engineered, and you can now read those choices off a spec sheet. With Phase 1 complete (tokenizer, accounting, architecture, and these variations), Phase 2 turns to making it all run fast on real hardware.

Modern LLMs keep the skeleton but vary its two sublayers to spend resources deliberately: grouped-query attention for a smaller KV cache, mixture of experts for capacity without per-token cost. Those two choices explain most of a shipping model’s efficiency.