Scaling laws: cheatsheet

The empirical form

loss(compute)     ~ C^(-alpha_C)
loss(parameters)  ~ N^(-alpha_N)
loss(data)        ~ D^(-alpha_D)

Straight on log-log; doubling each reduces loss by a predictable, modest factor.

	Kaplan (2020)	Chinchilla (2022)
Recommendation	Spend on parameters; moderate data	N and D scale equally with compute
Rule of thumb	Big model, moderate tokens	`D* ~= 20 * N*` (~20 tokens per parameter)
Canonical model	GPT-3: 175B / 300B tokens	Chinchilla: 70B / 1.4T tokens
Same-compute outcome		Chinchilla (70B) beat the larger Gopher (280B); also outperformed GPT-3

Given budget C:
  6 * N * D = C       (from lesson 2)
  D / N    = 20       (Chinchilla)
->  N ~= sqrt(C / 120)
    D ~= 20 * N

Example: C = 6e22 FLOPs -> N^2 = 5e20 -> N ~= 22B, D ~= 440B tokens.

Chinchilla minimizes:   training loss for a training budget
Reality minimizes:      total cost = training + inference-over-lifetime

For models that will serve many inference tokens:

Use	How
Pick `(N, D)`	Apply the Chinchilla rule + 6ND = C
Predict loss before spending	Fit on small/medium, extrapolate
Sanity-check a run	If loss diverges from the fit, look at data/optimizer/code
Judge architecture changes	A real win improves the exponent, not just the prefactor

Power law: y proportional to x^(-alpha); straight on log-log.
Compute-optimal: the (N, D) minimizing loss for a fixed compute budget.
Chinchilla-optimal: the specific compute-optimal answer (D ~ 20N).
Over-training: training past compute-optimal on data, trading training cost for inference savings.
Exponent vs prefactor: a scaling-law fit has both; only an exponent improvement survives scaling.

Stanford CS336, Lectures 9 and 11 (Scaling laws), by Hashimoto and Liang. cs336.stanford.edu. Independent structural mirror in original prose; see references.