References: Writing fast kernels

Source material

Source curriculum (structural mirror, cited as further study):
• Stanford CS336, "Language Modeling from Scratch", Lecture 6:
    Kernels, Triton, XLA
  Instructors: Tatsunori Hashimoto and Percy Liang (Stanford)
  Course page: https://cs336.stanford.edu/
  Lecture videos: YouTube playlist
    https://www.youtube.com/playlist?list=PLoROMvodv4rMqXOcazWaTUHhq-yembLCV
  License: no explicit license is published on the course site; lecture
    videos are on YouTube under standard terms; slides are public on GitHub
    without a stated license.
  Required attribution: "Based on the structure of Stanford CS336,
    'Language Modeling from Scratch,' by Tatsunori Hashimoto and Percy Liang
    (cs336.stanford.edu). This is an independent structural mirror in
    original prose; it reproduces no course materials, and Stanford does
    not endorse it."
This lesson mirrors the structure of Lecture 6 (kernels, Triton, XLA, and the
FlashAttention fusion). Clawdemy's lessons are original prose that follows
the pedagogical arc of the course. Because the source publishes no explicit
license, we cite it as a recommended companion and reproduce none of its
materials. All rights to the original course materials remain with their
creators.

Watch this next

Stanford CS336, Lecture 6: Kernels, Triton, XLA by Hashimoto and Liang. The lecture this lesson mirrors. It walks the kernel/fusion story alongside concrete Triton code, the natural next step once the picture here is clear.

Going deeper

A short, durable list. Each link is a specific next step, not a generic pile.

“FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness” by Dao et al. (2022). The paper that made fusion famous in the LLM era. Reads cleanly; the section on the tiled softmax (running max and sum) is the key trick.
The Triton tutorials. The official walk-throughs, building from a fused softmax to a matmul to attention. The fastest way to write your first non-trivial kernel.
The XLA documentation. The official reference for the compiler used by JAX, TF, and torch.compile. Worth a skim to know what kinds of fusions it does for you automatically.

Adjacent topics

Where this connects inside the track.

Counting the cost (lesson 2). Fusion is the code-level lever that raises arithmetic intensity from there. The number became real here.
How models run on hardware (lesson 5). The memory hierarchy and tensor cores explain why fusion works: data stays in SRAM, cores stay fed.
Attention alternatives and MoE (lesson 4). FlashAttention is fully compatible with grouped-query attention; the two combine for long-context inference, and MoE dispatch is another standard target for Triton.