Summary: Augmented language models

Phase 2 opens here. The two patterns that take an LLM beyond what it was trained on: RAG feeds the model fetched context; tool use lets the model call external systems. RAG has seven moving parts: knowledge source, chunking, embedding model, vector store, retriever, prompt composition with sources, and generation with citations. The real work is in the trade-offs: chunk size and overlap, top-k, embedding-model choice, re-ranking (cost vs quality), hybrid search (dense + sparse), and metadata filtering. The recurring failure mode is bad retrieval the model cannot detect; a held-out retrieval-evaluation set is the cheapest defense. Tool use is four steps: declare tools, model emits a tool-call request, your code executes and returns, model continues. RAG is often implemented as a tool, letting the model decide when retrieval is needed (cleaner and cheaper than always-retrieve). Every move respects lesson 2’s three productive limits: context, cost, latency. This is the scan version; the lesson designs the pipeline.

Core ideas

Two augmentation patterns: RAG (fetch context into the prompt) and tool use (let the model call external systems). Modern applications use both, often with RAG implemented as a tool.
RAG’s seven moving parts: knowledge source -> chunking -> embedding model -> vector store -> retriever (top-k) -> prompt composition (with sources) -> generation (with citations).
RAG trade-offs: chunk size and overlap; top-k; embedding-model choice; re-ranking with a more expensive model; hybrid search (dense + sparse / BM25) for both semantic and exact-term matches; metadata filtering for structured data.
Recurring RAG failure mode: bad retrieval the model cannot detect; the model answers confidently from wrong chunks. Defense: a held-out retrieval-evaluation set, measured separately from end-to-end answer quality.
Tool use is four steps: declare tool schemas; model decides (reply or emit a tool-call request); your code executes and returns the result; model continues with another tool, refinement, or final answer.
RAG-as-a-tool > always-retrieve. The model decides when retrieval is needed; simple requests avoid the retrieval cost. Same shape underlies agent behavior (lesson 10).
All three productive limits apply: retrieval shares the context budget, every retrieved chunk per request is paid every time, retrieval is wall-clock latency before generation.

What changes for you

This lesson is where applied LLM work actually lives. The model is the easy part (you call an API); the application’s quality is decided by how well you fetch the right context, how cleanly you let the model call your systems, and how well you evaluate both. Teams shipping strong products spend more time on retrieval quality and tool design than on prompts at this stage of maturity, and the prompts they do iterate on are about how retrieved context or tool results are presented. The next lesson reads a real application end-to-end so the parts you just learned have a worked-example shape; Phase 2 then turns to the UX layer (lesson 6) and the operational layer (lesson 7) that wrap all of this.

RAG and tools are where applied LLM work actually lives. The model is the easy part; the application’s quality is decided by how well you fetch the right context, how cleanly you let the model call your systems, and how well you evaluate both.