Cheatsheet: LLM foundations for production
The three properties (at the API level)
Section titled “The three properties (at the API level)”| Property | Implication |
|---|---|
| Works in tokens | Cost, context, latency are denominated in tokens (not characters/words) |
| Stateless between calls | Your app keeps conversation state; replay relevant pieces into the prompt |
| Trained on a fixed corpus to some date | Newer or domain knowledge must come through prompt or retrieval |
Autoregressive generation
Section titled “Autoregressive generation”1. Tokenize the prompt2. Network -> distribution over next tokens3. Sample one (per your sampling settings)4. Append; repeat step 25. Stop on EOS token, max_tokens, or stop sequenceOutput streams one token at a time -> streaming is possible.
Sampling controls
Section titled “Sampling controls”| Knob | What it does | Typical |
|---|---|---|
temperature | Peakiness of distribution (low = deterministic, high = varied) | Answer apps 0-0.4; creative higher; 0 = closest to deterministic |
top_p (nucleus) | Smallest set whose probs sum to p | ~0.9-1.0 common |
max_tokens | Cap on output length | Cap deliberately; controls cost + latency |
| stop sequences | Strings that halt generation | Set when format requires it |
Defaults usually suffice; tune for consistency, variance, or format needs.
The three productive limits
Section titled “The three productive limits”Context length
Section titled “Context length”- Hard input cap per model (tens of K to hundreds of K tokens).
- Shared budget: system + user + retrieved context + few-shot + history +
max_tokensoutput. - Hitting it -> tighter prompts (lesson 3), better retrieval (lesson 4), summarized history, or a longer-context model.
Cost per token
Section titled “Cost per token”- Input and output priced separately; output usually several times more per token.
- Long system prompts compound at scale. A 4K-token system prompt is paid for every request.
- Biggest levers: cap
max_tokens, prompt for conciseness, pick a cheaper model for sub-tasks.
Latency
Section titled “Latency”total time ≈ TTFT + (output_tokens / tokens_per_second)- TTFT rises with input length.
- Total time rises linearly with output length.
- Streaming masks it at the UX layer (lesson 6): user sees tokens within ~1s, not after 5s.
Back-of-envelope template
Section titled “Back-of-envelope template”inputs/req = sys + retrieved + user (+ history)input_cost/day = requests/day x inputs/req x $/M_inputoutput_cost/day= requests/day x avg_output x $/M_outputtotal_time = TTFT + (output_tokens / tps)The reframing for the rest of the track
Section titled “The reframing for the rest of the track”Prompt design, retrieval, UX, and ops decisions are largely deliberate moves against context, cost, or latency. Name the limits; the techniques look like targeted moves rather than recipes.
Words to use precisely
Section titled “Words to use precisely”- Token: the integer unit the model processes; the unit of cost, context, and latency.
- Stateless: each API call stands alone; state lives in your application.
- TTFT: time-to-first-token; user-perceived “the model is responding.”
- Context window: the maximum input tokens the model can take in one call.
Source
Section titled “Source”- Full Stack Deep Learning, LLM Bootcamp (Spring 2023): LLM Foundations.
fullstackdeeplearning.com/llm-bootcamp. Independent structural mirror in original prose; see references.