LLM foundations: brief

What you’ll learn

You shipped the minimum app in lesson 1. This lesson is the working picture that makes the rest of the track’s decisions sane. The source curriculum is the Full Stack Deep Learning LLM Bootcamp (Spring 2023), by Charles Frye, Sergey Karayev, and Josh Tobin, freely available at fullstackdeeplearning.com/llm-bootcamp with recorded lectures on the Full Stack Deep Learning YouTube channel.

You will state the three properties of a hosted LLM at the API level (works in tokens, stateless between calls, trained to a fixed date); walk the autoregressive generation loop and identify what temperature and top_p control; apply the three productive limits to design decisions (context length as the hard input budget shared with max_tokens; cost per token with input/output priced separately and output usually more; latency as TTFT + output_tokens / tokens_per_second); estimate API spend and total response time for a realistic application; and recognize that the rest of the track’s techniques (prompts, retrieval, UX, ops) are deliberate moves against these three limits.

Where this fits

This is lesson 2 of 11, the second lesson of Phase 1 (foundations and the first app). It sits between the minimum-app of lesson 1 (which surfaced the practical questions) and the prompt-engineering toolkit of lesson 3 (the first deliberate move against the constraints this lesson names). Every later lesson in the track is, in some sense, an answer to context, cost, or latency.

Before you start

Prerequisites: lesson 1 of this track (the minimum-viable app’s five components and pipeline shape, which this lesson adds the constraints under). Familiarity with reading API documentation helps; the lesson references messages.create-style calls but does not require running code in this lesson.

About the math

Arithmetic, not calculus. The practice section computes daily/monthly API spend from per-request token counts and per-million pricing, and decomposes total response time as TTFT + output_tokens / tokens_per_second. No derivations; the only formulas are counting arguments.

By the end, you’ll be able to

The single capability this lesson builds: explain at a working level what an LLM is, how it generates, and the productive limits (context length, cost per token, latency) that bound every design decision. Concretely, you will be able to:

State the three properties of a hosted LLM at the API level
Walk the autoregressive generation loop and identify what temperature and top_p control
Apply the context-length, cost-per-token, and latency limits to design decisions
Estimate API spend and total response time for a realistic application
Recognize that later production techniques are deliberate moves against these limits

Time and difficulty

Read time: about 12 minutes
Practice time: about 12 minutes (a back-of-envelope cost-and-latency exercise on a realistic application, plus flashcards)
Difficulty: standard (no math beyond arithmetic; the work is internalizing the three limits)