References: Counting the cost

Source material

Source curriculum (structural mirror, cited as further study):
• Stanford CS336, "Language Modeling from Scratch", Lecture 2:
    PyTorch (einops), resource accounting (FLOPs, memory, arithmetic intensity)
  Instructors: Tatsunori Hashimoto and Percy Liang (Stanford)
  Course page: https://cs336.stanford.edu/
  Lecture videos: YouTube playlist
    https://www.youtube.com/playlist?list=PLoROMvodv4rMqXOcazWaTUHhq-yembLCV
  License: no explicit license is published on the course site; lecture
    videos are on YouTube under standard terms; slides are public on GitHub
    without a stated license.
  Required attribution: "Based on the structure of Stanford CS336,
    'Language Modeling from Scratch,' by Tatsunori Hashimoto and Percy Liang
    (cs336.stanford.edu). This is an independent structural mirror in
    original prose; it reproduces no course materials, and Stanford does
    not endorse it."
This lesson mirrors the structure of Lecture 2 (resource accounting and
einops). Clawdemy's lessons are original prose that follows the pedagogical
arc of the course. Because the source publishes no explicit license, we cite
it as a recommended companion and reproduce none of its materials. All rights
to the original course materials remain with their creators.

Watch this next

Stanford CS336, Lecture 2: resource accounting by Hashimoto and Liang. The lecture this lesson mirrors. It works through FLOP and memory accounting in detail and uses einops throughout, the moving version of the counting here.

Going deeper

A short, durable list. Each link is a specific next step, not a generic pile.

“Scaling Laws for Neural Language Models” by Kaplan et al. (2020). The paper behind the FLOP-and-parameters accounting (the 6ND-style estimates), and the bridge to the scaling-laws lesson later in this track.
The einops documentation. The reference for rearrange, reduce, and einsum-style operations. Short, example-driven, and the fastest way to make your tensor code readable.
The NVIDIA GPU performance background. A clear primer on compute-bound versus memory-bound operations and arithmetic intensity (the roofline idea), straight from the hardware side.

Adjacent topics

Where this connects inside the track.

What “from scratch” means, and the tokenizer (lesson 1). That lesson named efficiency as the through-line and left the vocabulary-size trade-off to be quantified; this lesson is the accounting that quantifies it.
Writing fast kernels: Triton and XLA (lesson 6). The arithmetic-intensity idea here is exactly what kernel fusion exploits; that lesson turns “raise intensity by fusing” into real code.
Scaling laws (lesson 9). The 6ND compute estimate is the input to scaling-law reasoning, which decides how to spend a fixed compute budget across model size and data.