LLM foundations for production
What you’ll learn
Section titled “What you’ll learn”You shipped the minimum app in lesson 1. This lesson is the working picture that makes the rest of the track’s decisions sane. The source curriculum is the Full Stack Deep Learning LLM Bootcamp (Spring 2023), by Charles Frye, Sergey Karayev, and Josh Tobin, freely available at fullstackdeeplearning.com/llm-bootcamp with recorded lectures on the Full Stack Deep Learning YouTube channel.
You will state the three properties of a hosted LLM at the API level (works in tokens, stateless between calls, trained to a fixed date); walk the autoregressive generation loop and identify what temperature and top_p control; apply the three productive limits to design decisions (context length as the hard input budget shared with max_tokens; cost per token with input/output priced separately and output usually more; latency as TTFT + output_tokens / tokens_per_second); estimate API spend and total response time for a realistic application; and recognize that the rest of the track’s techniques (prompts, retrieval, UX, ops) are deliberate moves against these three limits.
Where this fits
Section titled “Where this fits”This is lesson 2 of 11, the second lesson of Phase 1 (foundations and the first app). It sits between the minimum-app of lesson 1 (which surfaced the practical questions) and the prompt-engineering toolkit of lesson 3 (the first deliberate move against the constraints this lesson names). Every later lesson in the track is, in some sense, an answer to context, cost, or latency.
Before you start
Section titled “Before you start”Prerequisites: lesson 1 of this track (the minimum-viable app’s five components and pipeline shape, which this lesson adds the constraints under). Familiarity with reading API documentation helps; the lesson references messages.create-style calls but does not require running code in this lesson.
About the math
Section titled “About the math”Arithmetic, not calculus. The practice section computes daily/monthly API spend from per-request token counts and per-million pricing, and decomposes total response time as TTFT + output_tokens / tokens_per_second. No derivations; the only formulas are counting arguments.
By the end, you’ll be able to
Section titled “By the end, you’ll be able to”The single capability this lesson builds: explain at a working level what an LLM is, how it generates, and the productive limits (context length, cost per token, latency) that bound every design decision. Concretely, you will be able to:
- State the three properties of a hosted LLM at the API level
- Walk the autoregressive generation loop and identify what
temperatureandtop_pcontrol - Apply the context-length, cost-per-token, and latency limits to design decisions
- Estimate API spend and total response time for a realistic application
- Recognize that later production techniques are deliberate moves against these limits
Time and difficulty
Section titled “Time and difficulty”- Read time: about 12 minutes
- Practice time: about 12 minutes (a back-of-envelope cost-and-latency exercise on a realistic application, plus flashcards)
- Difficulty: standard (no math beyond arithmetic; the work is internalizing the three limits)