How RAG works, in brief

What you’ll learn

This is a lesson in Phase 6 (How models reason and act) in Track 5 (AI Foundations). The earlier lessons covered how the transformer works (Phase 2), how it generates text and how to prompt it (Phase 5), and how it was trained to follow instructions (Phase 4). Course materials are at cme295.stanford.edu.

This lesson covers how to give the trained model knowledge it does not already have, in the right shape, at the right moment. Retrieval-augmented generation (RAG) is the canonical pattern for grounding a language model in fresh, private, or domain-specific information without retraining it. The lesson walks the pipeline stage by stage (chunking documents into 200-800-token pieces, embedding each chunk via a bi-encoder, vector-searching for the chunks closest to the query embedding, constructing a prompt that includes the retrieved chunks as context, generating a grounded response). It explains the bi-encoder + cross-encoder two-stage retrieval pattern that production systems use (bi-encoder for fast recall over millions of chunks, cross-encoder for precision rescoring over a few hundred candidates), names HyDE (hypothetical document embeddings) as the standard fix for the question-vs-answer shape mismatch in dense retrieval, frames how 1M-2M-token context windows in 2026 (with Llama 4 Scout’s 10M as the exception) complement rather than displace RAG, and closes on indirect prompt injection as the structural security issue RAG creates: an attacker poisons content that the application later retrieves on a benign user’s behalf, the operator never sees the attack land.

Where this fits

This is a lesson in Phase 6, How models reason and act. Phase 5 covered inference-time steering (generation loop, prompting mechanics, few-shot, chain-of-thought). This lesson adds retrieval: how to augment a prompt with fresh or private information the model was not trained on. It builds directly on the prompting lesson (RAG is prompting plus retrieval) and the embeddings lesson (vector search is embedding lookup with a similarity step on top). The other Phase 6 lessons cover reasoning models, function calling, and agent loops; together they trace what happens when a single LLM call gets supplemented with thinking time, retrieval, tools, or chained tool sequences.

Before you start

Prerequisites: the prompting lesson and the embeddings lesson are required. RAG is prompting plus retrieval; if you cannot read a prompt and identify what is doing the work in it, the RAG flow will not make sense. Vector search uses the same embedding mechanics covered in the embeddings lesson. Phase 4 (SFT and RLHF) is helpful background for understanding why RAG is often the right alternative to fine-tuning, but is not required.

By the end, you’ll be able to

Explain what RAG is and the three problems it solves (knowledge cutoff, private data, hallucination grounding)
Trace the canonical RAG pipeline (chunking, embedding, retrieval, prompt construction, generation) and identify the role of each stage
Distinguish bi-encoder retrieval (fast recall over millions of chunks) from cross-encoder rerankers (slow precision over the bi-encoder’s top candidates), and recognize HyDE (hypothetical document embeddings) as the standard fix for the question-vs-answer shape mismatch in dense retrieval
Identify indirect prompt injection as the structural security issue RAG introduces and explain why it is harder to defend against than direct injection
Compare RAG with long-context prompting in 2026 (1M-2M-token windows mainstream, Llama 4 Scout reaches 10M; long context complements RAG rather than displacing it; choose between them based on per-query cost, latency, traceability, freshness, and corpus scale)

Time and difficulty

Read time: about 25 minutes
Practice time: about 15 minutes (a pipeline-tracing exercise: walk a query through each RAG stage and predict where it would fail)
Difficulty: standard