References: Token by token: how a transformer generates text

Source material

Source material:
• Stanford CME 295: Transformers & Large Language Models, Autumn 2025
  Instructor: Afshine Amidi & Shervine Amidi, Stanford University
  Course site: https://cme295.stanford.edu/
  Cheatsheet: https://cme295.stanford.edu/cheatsheet/
  Source lecture (Lecture 3, Large Language Models): https://www.youtube.com/watch?v=Q5baLehv5So
  License (lecture videos): as published on Stanford's public YouTube channel
  License (Amidi cheatsheets): MIT
This lesson adapts the sampling-strategies, context-length, and temperature
sections of Stanford CME 295 Lecture 3 (Large Language Models). The earlier
Clawdemy lessons in this track adapt Lecture 1 (Transformer architecture).
Clawdemy provides original notes, summaries, and quizzes derived from this
material for educational purposes. All rights to the original lectures remain
with Stanford and the instructors.

Going deeper

A short list, chosen for durability. Each link is for a specific next step, not a generic “learn more.”

“How to generate text: using different decoding methods for language generation with Transformers” by Patrick von Platen on the Hugging Face blog. Side-by-side examples of greedy, beam search, pure sampling, top-k, and top-p on the same prompt, with the actual generated text shown for each. Read this if you want to see the strategies behave on real prompts rather than just understand them in the abstract.
Hugging Face generate() documentation. The reference for the most-used open-source generation API. Lists every decoding parameter, what it does, and how they compose. Useful when you read someone’s inference code and need to look up what a flag does.
“Let’s build the GPT Tokenizer” by Andrej Karpathy. The clearest from-scratch implementation of the prediction loop on the public web. The episode title is about tokenization, but the back half walks through generation, KV caching, and the loop itself in Python. Two hours; watch after this lesson if you want to see the loop running.
Tiktokenizer. The same interactive tool we used in the tokens lesson. Useful here too: paste any prompt, see how it tokenizes, count input tokens against your max_tokens budget. The number of tokens in your prompt determines prefill cost.
Stanford CME 295 cheatsheet by the Amidi twins. MIT-licensed. The inference and decoding section gives a one-page reference for the same material this lesson covers, in their dense visual style.

Adjacent topics

Topics that build on or sit beside this one.

Speculative decoding. A draft model proposes several tokens at a time; the main model verifies them in parallel. Accepted tokens count, rejected tokens cost one extra forward pass. Trades a small accuracy risk for a significant speedup on inference. Worth understanding once you have the basic loop in your head; a future lesson candidate.
Beam search. An older decoding alternative that maintains the top-K candidate sequences in parallel and picks the one with highest joint probability at the end. Standard in older translation systems and still useful when you want a deterministic “best” output rather than a sampled one. Less common in chat models because it tends to produce flat or repetitive prose at scale.
Constrained decoding and structured output. Forcing generation to follow a grammar (JSON, a regex, a specific schema) by zeroing out logits that would violate the constraint. The basis for “JSON mode” and “tool use” features in modern APIs. The mechanism is exactly the lesson’s sample step, with the candidate set restricted before the softmax.
Encoder-only versus decoder-only architectures. This lesson assumes a decoder-only setup (the GPT-style architecture that generates tokens autoregressively). Encoder-only models (the BERT-style architecture) produce embeddings of full inputs and do not generate token-by-token. The block-level mechanics are the same; the inference behavior is different.
Where to go next. Within Lecture 3, the natural next stop is the prompting lesson, which covers how to talk to the trained model. The Lecture 5 lesson on fine-tuning + RLHF covers what happens between pretraining and the chat assistant you actually talk to (the post-training step that bolts instruction-following onto the base model). If you want to broaden out instead, Track 1 covers the human side of working with AI; both tracks are independent of each other.

Original sources

The primary sources for the decoding strategies covered.

“The Curious Case of Neural Text Degeneration”, Holtzman et al., 2019. The paper that introduced top-p (nucleus) sampling. The titular “degeneration” refers to the repetitive, low-quality output of models when decoded greedily; nucleus sampling was the proposed fix. Required reading if you want to understand why “just use greedy” is bad advice for long outputs.
“Hierarchical Neural Story Generation”, Fan et al., 2018. The paper that introduced top-k sampling for neural language models. Predates nucleus sampling by a year and walks through why pure sampling and beam search both fall short for long-form generation.
“Fast Inference from Transformers via Speculative Decoding”, Leviathan et al., 2022. The speculative-decoding paper. Read it if you want to understand the inference-time optimization that several recent open-source models advertise.

Community discussion

None selected for this lesson. The public discussion of decoding strategies has consolidated into the Hugging Face blog post above, the Karpathy video, and the academic papers. If a canonical thread surfaces, it will be added at the next quarterly review.