LLM foundations for production

You shipped a working app in lesson 1. To make sane decisions from here, you need a working picture of what the model actually does behind the API call, not a deep theory, just enough to make sensible choices about prompts, costs, latency, and what the model can and cannot reach. This lesson is that picture. It is short and concrete on purpose: foundations for builders, not for researchers.

What an LLM is, at the level you call it

A modern hosted large language model is, at the level you interact with it, a function that takes a sequence of text tokens and returns the next one. Repeat that step (each new token feeding back into the input) and you get a generated response. Three things about the model matter for your design choices, even before you look at any architecture:

It works in tokens, not characters or words. A tokenizer turns text into integer tokens; the model operates on those. Common English words are usually one token; rare words, code, and non-English text often take more. Cost, context limits, and latency are all denominated in tokens, not in characters or in your sense of “how long the text feels.”
It has no memory between calls. A request is stateless. Whatever the model “knows” in this turn must be present in the prompt for this call. Conversation history, retrieved context, system instructions: if it is not in the prompt, the model cannot see it. (Your application keeps the state; you replay relevant pieces into the prompt.)
It was trained on a fixed corpus up to some date, then post-trained to follow instructions. Its broad knowledge is what was in that training data; anything more recent, or anything specific to your domain, must come through the prompt or through retrieval (lesson 4).

Those three points are the entire framing you need to use a hosted LLM well. Everything below adds detail.

How it actually generates

When your application calls the API, the model does this loop:

1. Tokenize the prompt into integer tokens.
2. Run those through the network; produce a probability distribution
   over possible next tokens.
3. Sample a single next token from that distribution (according to your
   sampling settings).
4. Append the new token to the sequence; repeat from step 2.
5. Stop when an end-of-sequence token is sampled, or when max_tokens
   is reached, or when a stop sequence matches.

That is autoregressive generation: each token is sampled from a distribution conditioned on everything before it. Two practical implications:

The output is one token at a time. This is why streaming (lesson 6) is possible and why time-to-first-token (TTFT) and tokens-per-second are the two latency numbers that matter, not just total time.
Sampling is controllable. Two arguments you will see on every provider’s API:
- Temperature scales how peaky the distribution is. Lower temperature (closer to 0) makes the model pick the highest-probability tokens; output is more deterministic. Higher temperature flattens the distribution; output is more varied. Production answer apps usually run low (0 to 0.4); creative tasks run higher. Setting temperature to 0 is the closest you can get to deterministic responses (within the provider’s implementation).
- Top-p (nucleus sampling) limits sampling to the smallest set of tokens whose cumulative probability reaches p. A common combination is moderate temperature plus top-p around 0.9 or 1.0.

You do not need to tune these often; defaults usually suffice. You do need to know they exist and what they do, so you can dial them when an app needs more or less variance.

The three productive limits

This is the part of the lesson you will think about every week as a builder. A hosted LLM has three limits that bound every design decision:

Context length

The context length (sometimes called the context window) is the maximum number of tokens the model can take as input in a single call. It is a hard limit set by the model: trying to send more rejects the request. Common values today range from tens of thousands of tokens (older models) into the hundreds of thousands or more (current frontier models). Two practical consequences:

Everything you want the model to consider must fit. System prompt + user message + retrieved context + few-shot examples + conversation history all share the same budget.
The output is also bounded by context. Max-tokens for the response is part of the same window; a model with a 200K context will not give you a 200K response on top of a 200K prompt.

When you hit the context wall, your options are to shrink the prompt (better retrieval, lesson 4; tighter prompts, lesson 3), summarize prior turns, or move to a longer-context model.

Cost per token

Hosted APIs price per token, separately for input tokens (your prompt) and output tokens (the model’s response). Output is usually two to several times more expensive per token than input. Practical consequences:

Long prompts are cheap per call but compound at scale. If your application sends a 4,000-token system prompt to every user request, you are paying for those 4,000 tokens every time. Optimizing prompt length is real money at any reasonable volume.
Generation is the more expensive side. Capping max-tokens and prompting for concise responses are the largest cost levers most applications have.
Different models cost very different amounts. Cheaper, smaller models can be wildly more economical for sub-tasks; the build-vs-buy and which-model decisions are economic as much as quality decisions.

Latency

Two numbers matter to a user: time-to-first-token (TTFT), the wall-clock from request to the first streamed token, and tokens-per-second, the streaming rate after that. Total response time is approximately the time-to-first-token plus the output token count divided by the tokens-per-second rate. Practical consequences:

Long inputs raise TTFT. Processing a long prompt takes time before any output begins. The prefill phase from a generation engine’s perspective is the input-handling cost.
Long outputs raise total time linearly. Concise responses do not just save money; they save user time.
Streaming is how you mask latency. Showing tokens as they arrive turns a six-second total into “first words in a second,” which is the difference between a usable app and a slow one. UX details belong in lesson 6, but the latency profile is set here.

Why this matters when you build AI

These three limits are the constraints under which every design decision lives, and naming them early stops a lot of expensive mistakes. The team that does not know context length is bounded plans for unlimited retrieval; the team that does not track cost per token is surprised by the bill; the team that does not separate TTFT from total time over-optimizes the wrong number. Most production-quality LLM applications you admire are well-tuned along these three axes, and the techniques in the rest of the track (better prompts, retrieval, UX, observability) are largely ways to stay within them while still doing useful work. With this picture in hand, the next lesson is the prompt-engineering toolkit, which is the highest-leverage way to spend tokens better.

What you should remember

A hosted LLM is, at the level you call it, a function from a token sequence to the next token, applied repeatedly. It works in tokens (not characters or words), is stateless between calls (your app keeps state), and was trained on a fixed corpus up to some date (anything outside must come through prompt or retrieval).
Generation is autoregressive: sample one token from a probability distribution, append, repeat, until an end token, the max-tokens limit, or a stop sequence stops it. Output streams one token at a time, which is why streaming is possible.
Temperature and top-p control sampling. Lower temperature is more deterministic; higher temperature is more varied. Top-p (nucleus) limits sampling to the smallest set whose probability sums to p. Defaults usually suffice.
Context length is the hard input limit. System plus user plus retrieved context plus few-shot plus history plus max-tokens all share one budget. Hitting it forces tighter prompts, better retrieval, or a longer-context model.
Cost per token differs for input vs output, and output is usually several times more expensive. Long system prompts compound at scale; concise responses are the biggest cost lever.
Latency is the time-to-first-token plus the output token count divided by the tokens-per-second rate. Long inputs raise time-to-first-token; long outputs raise total time linearly. Streaming is how you mask latency at the UX layer (lesson 6).

A hosted LLM is a stateless next-token function bounded by context, cost, and latency. Hold that picture, and the prompt, retrieval, UX, and ops decisions that fill the rest of this track stop being arbitrary and start looking like deliberate moves against those three limits.