References: How RAG works: feeding the model what it does not already know

Source material

Source material:
• Stanford CME 295: Transformers & Large Language Models, Autumn 2025
  Instructor: Afshine Amidi & Shervine Amidi, Stanford University
  Course site: https://cme295.stanford.edu/
  Cheatsheet: https://cme295.stanford.edu/cheatsheet/
  Source lecture (Lecture 7, Agentic LLMs): https://www.youtube.com/watch?v=h-7S6HNq0Vg
  License (lecture videos): as published on Stanford's public YouTube channel
  License (Amidi cheatsheets): MIT
This lesson adapts the retrieval-augmented-generation portion of Stanford
CME 295 Lecture 7 (Agentic LLMs). The lecture also covers function calling,
agents, and the ReAct framework, which we have not yet adapted into Clawdemy
lessons. Clawdemy provides original notes, summaries, and quizzes derived
from this material for educational purposes. All rights to the original
lectures remain with Stanford and the instructors.

Going deeper

A short list, chosen for durability. Each link is for a specific next step, not a generic “learn more.”

“Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks”, Lewis et al., 2020. The original RAG paper from Meta. Establishes the pattern of conditioning a generator on retrieved documents and shows it outperforms parametric-knowledge-only baselines on open-domain question answering. The methodology section is readable; the experiments are dated but the framing has held up.
“Dense Passage Retrieval for Open-Domain Question Answering”, Karpukhin et al., 2020. The DPR paper. Establishes the embedding-based retrieval approach that became the default for RAG. Read this if you want to understand how the embedding side of retrieval moved from sparse (BM25-only) to dense (vector-based) and why hybrid search exists today.
“Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks”, Reimers and Gurevych, 2019. The Sentence-BERT paper. Defines the bi-encoder architecture used for stage-1 candidate retrieval in this lesson, plus the contrastive training that makes embedding-based similarity work. The paper is short and very readable; the comparison table of bi-encoder vs cross-encoder in section 4 is the canonical reference for the architectural distinction this lesson covers.
“Precise Zero-Shot Dense Retrieval without Relevance Labels” (HyDE), Gao et al., 2022. The HyDE paper. Introduces the hypothetical-document-embedding technique. Useful for the empirical evidence that the query-vs-document shape gap is real and addressable; section 3 has the mechanism, section 4 has the experiments.
“Lost in the Middle: How Language Models Use Long Contexts”, Liu et al., 2023. Important empirical finding: when retrieved context is long, models often pay disproportionate attention to the beginning and end and miss content in the middle. Direct implication for prompt construction in RAG (small K with high-precision retrieval often beats large K with more recall).
“RAGAS: Automated Evaluation of Retrieval Augmented Generation”, Es et al., 2023. The RAGAS evaluation framework. Provides metrics for the two distinct axes of RAG quality (retrieval and generation) and an LLM-judge approach to scoring them. The repo and the methodology are both worth time if you are evaluating any RAG system.
“Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection”, Greshake et al., 2023. The paper that named indirect prompt injection and demonstrated it against deployed systems. Required reading before you build anything that puts a model on top of retrieved or user-supplied text. Linked in the prompting lesson too; it is the canonical citation for both lessons.
Simon Willison’s RAG and prompt-injection writeups. The most readable running coverage of indirect-injection attacks in the wild, plus regular RAG-architecture commentary. Good practitioner companion to the academic literature.

Adjacent topics

Topics that build on or sit beside this one.

Reranking. A second-stage model rescores the top-K from retrieval before they reach the language model. Cross-encoder rerankers (e.g., the Cohere Rerank API or open-source equivalents like bge-reranker) often produce a clear win for modest cost. Search terms: “cross-encoder reranker,” “two-stage retrieval.”
Chunking strategies. Active research and active practitioner discussion. Sliding-window with overlap, semantic chunking (split on embedding-space discontinuities), recursive structure-aware chunking, late chunking (embed long passages, chunk after). The right choice depends on document type; experiment with a labeled query set.
Long-context models vs RAG. Models with very long context windows (hundreds of thousands of tokens) raise the question: why retrieve at all, just put the whole corpus in the prompt? The honest answer is that long context is expensive per token, attention cost scales unfavorably, and the “lost in the middle” effect persists at length. RAG remains preferable for cost, latency, and traceability reasons even when long context is technically available.
Agentic retrieval. Approaches where the model itself decides what to search for, possibly across multiple turns, instead of running retrieval once on the user query. Search terms: “ReAct,” “agentic RAG,” “self-RAG,” “tool-using language models.” Sits at the intersection of this lesson and the planned agents lesson.
Knowledge graphs and structured retrieval. RAG over unstructured text is the default; RAG over knowledge graphs, SQL databases, or structured APIs is increasingly common. The pipeline shape is similar (retrieve, frame, generate); the retrieval substrate is different.
Where to go next. This lesson is the first piece of Stanford CME 295 Lecture 7 (Agentic LLMs) we have adapted; the rest of that lecture covers function calling, agents, and the ReAct framework, which are queued for future lessons. Check the tracks index for the latest published lessons.

Original sources

The primary papers for the techniques covered, in chronological order.

“Reading Wikipedia to Answer Open-Domain Questions”, Chen et al., 2017. Pre-RAG and pre-transformer-LLM, but the conceptual root: a retriever-plus-reader architecture for open-domain question answering. The architectural ancestor of modern RAG.
“Dense Passage Retrieval for Open-Domain Question Answering”, Karpukhin et al., 2020. Dense retrieval as we now know it.
“Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks”, Lewis et al., 2020. The RAG paper.
“REALM: Retrieval-Augmented Language Model Pre-Training”, Guu et al., 2020. A parallel approach that integrates retrieval into the pretraining objective itself rather than adding it at inference time. Less common in production today (the inference-time RAG approach won on operational simplicity) but conceptually clarifying.
“Atlas: Few-shot Learning with Retrieval Augmented Language Models”, Izacard et al., 2022. End-to-end retrieval-augmented training at scale; useful counterpoint when thinking about whether to integrate retrieval at training time or only at inference time.
“Not what you’ve signed up for”, Greshake et al., 2023. Indirect prompt injection.

Community discussion

None selected for this lesson. The public discussion of RAG is moving quickly; vendor-published blog posts (LangChain, LlamaIndex, the major LLM providers) are useful for current implementation patterns but rotate too quickly to be worth pinning here. Simon Willison’s writeups (linked in Going deeper) are the most stable practitioner voice on the security side. If a durable practitioner thread on the architecture side surfaces, it will be added at the next quarterly review.