Skip to content

Summary: LLM foundations for production

The working picture a production builder needs. A hosted LLM is a function from token sequences to next tokens; three properties matter for design: it works in tokens (cost/context/latency all denominated in tokens), it is stateless between calls (your app keeps conversation state), and it was trained on a fixed corpus up to some date (newer or domain knowledge must come through prompt or retrieval). It generates autoregressively: sample one token at a time from a distribution, until end-of-sequence, max_tokens, or a stop sequence. temperature and top_p control sampling. The three productive limits that bound every design decision: context length (the hard input budget, shared with max_tokens output), cost per token (input vs output priced separately, output usually several times more; compounds at scale), and latency (total ≈ TTFT + output_tokens / tokens_per_second; streaming masks it at the UX layer). This is the scan version; the lesson works the numbers.

  • Three properties: tokens (not characters/words), stateless between calls (app keeps state), trained to a date (anything else comes through prompt or retrieval).
  • Autoregressive generation: tokenize -> next-token distribution -> sample -> append -> repeat -> stop. One token at a time; streaming is therefore possible.
  • Sampling controls: temperature (peakiness, low = deterministic, high = varied), top_p (nucleus, smallest set summing to p). Defaults usually suffice.
  • Context length: hard input budget per model; system + user + retrieved context + few-shot + history + max_tokens all share it; hitting it forces tighter prompts, better retrieval, summarized history, or a longer-context model.
  • Cost per token: input vs output priced separately, output usually several times more. Long system prompts compound at scale; capping max_tokens and concise prompts are the biggest cost levers. Different models vary widely in price.
  • Latency: total ≈ TTFT + (output_tokens / tokens_per_second). Long inputs raise TTFT; long outputs raise total time linearly. Streaming masks it at the UX layer.

These three limits are the constraints under which every later design decision lives, and naming them early stops a lot of expensive mistakes. Teams that have not internalized the limits plan unbounded retrieval, get surprised by the bill, or over-optimize the wrong latency number. The techniques the rest of the track teaches (better prompts, retrieval, UX, observability) are largely deliberate moves against context, cost, and latency, and they look like targeted moves once you have the picture rather than recipes to memorize. With foundations in hand, the next lesson is the prompt-engineering toolkit, the highest-leverage way to spend tokens better.

A hosted LLM is a stateless next-token function bounded by context, cost, and latency. Hold that picture, and the prompt, retrieval, UX, and ops decisions that fill the rest of this track stop being arbitrary and start looking like deliberate moves against those three limits.