No, they predict the curve shape. Constants depend on architecture, data, tokenizer, optimization setup.
Chinchilla means smaller models are always better
No, it says for a fixed compute budget, smaller-model + more-data is more efficient. With more compute, the optimal model is still larger; the data target grows in proportion.
FLOPs and FLOP/s are interchangeable
No. FLOPs is a count of operations performed; FLOP/s is the rate a GPU can perform. Use context to disambiguate.
Lower pretraining loss = better at my task
Correlated, not equal. Pretraining loss is next-token-prediction loss on the corpus distribution. Task-specific quality is downstream and depends on tuning + evaluation.
FLOPs (sometimes written FLOPS, capital S): floating-point operations, the count-based unit of pretraining cost.
FLOP/s: floating-point operations per second; the rate of compute a GPU can perform.
Scaling laws: the empirical finding (Kaplan et al. 2020) that loss falls predictably with more compute, more parameters, and more training data, in smooth power-law curves.
Sample efficiency (in this context): how much loss reduction per training token a model produces. Bigger models are more sample-efficient per token (Kaplan).
Chinchilla rule: the compute-optimal balance of roughly 20 training tokens per model parameter (Hoffmann et al. 2022).
Compute-optimal: the model-and-data combination that minimizes loss for a fixed FLOPs budget.
Data-undertrained: a model trained on fewer tokens than the Chinchilla rule recommends for its parameter count. Most pre-2022 frontier models fit this description.
Pre-Chinchilla era: the period (roughly 2019-2022) before the Chinchilla rule was widely adopted, characterized by parameter-scaling outpacing data-scaling.
Scale produces predictable improvement. Chinchilla pinned the optimal split at 20 tokens per parameter. Pre-Chinchilla pretraining left compute on the table.