References: Writing fast kernels
Source material
Section titled “Source material”Source curriculum (structural mirror, cited as further study):• Stanford CS336, "Language Modeling from Scratch", Lecture 6: Kernels, Triton, XLA Instructors: Tatsunori Hashimoto and Percy Liang (Stanford) Course page: https://cs336.stanford.edu/ Lecture videos: YouTube playlist https://www.youtube.com/playlist?list=PLoROMvodv4rMqXOcazWaTUHhq-yembLCV License: no explicit license is published on the course site; lecture videos are on YouTube under standard terms; slides are public on GitHub without a stated license. Required attribution: "Based on the structure of Stanford CS336, 'Language Modeling from Scratch,' by Tatsunori Hashimoto and Percy Liang (cs336.stanford.edu). This is an independent structural mirror in original prose; it reproduces no course materials, and Stanford does not endorse it."This lesson mirrors the structure of Lecture 6 (kernels, Triton, XLA, and theFlashAttention fusion). Clawdemy's lessons are original prose that followsthe pedagogical arc of the course. Because the source publishes no explicitlicense, we cite it as a recommended companion and reproduce none of itsmaterials. All rights to the original course materials remain with theircreators.Watch this next
Section titled “Watch this next”- Stanford CS336, Lecture 6: Kernels, Triton, XLA by Hashimoto and Liang. The lecture this lesson mirrors. It walks the kernel/fusion story alongside concrete Triton code, the natural next step once the picture here is clear.
Going deeper
Section titled “Going deeper”A short, durable list. Each link is a specific next step, not a generic pile.
-
“FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness” by Dao et al. (2022). The paper that made fusion famous in the LLM era. Reads cleanly; the section on the tiled softmax (running max and sum) is the key trick.
-
The Triton tutorials. The official walk-throughs, building from a fused softmax to a matmul to attention. The fastest way to write your first non-trivial kernel.
-
The XLA documentation. The official reference for the compiler used by JAX, TF, and
torch.compile. Worth a skim to know what kinds of fusions it does for you automatically.
Adjacent topics
Section titled “Adjacent topics”Where this connects inside the track.
-
Counting the cost (lesson 2). Fusion is the code-level lever that raises arithmetic intensity from there. The number became real here.
-
How models run on hardware (lesson 5). The memory hierarchy and tensor cores explain why fusion works: data stays in SRAM, cores stay fed.
-
Attention alternatives and MoE (lesson 4). FlashAttention is fully compatible with grouped-query attention; the two combine for long-context inference, and MoE dispatch is another standard target for Triton.