Skip to content

Cheatsheet: How RAG works: feeding the model what it does not already know

Retrieve → put in prompt → generate.
The G is generation. The new piece is what you put in front of the model
when "what to put" is buried in documents the model has never seen.
ProblemWhy direct prompting failsHow RAG fixes it
Knowledge cutoffPretraining ended at a fixed dateIndex has documents from after the cutoff; model uses them at query time
Private dataYour corpus was not in pretraining and should not beIndex your private corpus; model never trains on it, only reads it at query time
Hallucination groundingA confident no-source answer is hard to verifyModel returns answer alongside the chunks it was given; verification is now possible
StageInputOutputRuns when
1. ChunkingRaw documentsCoherent text chunks (200-800 tokens typical)Index time
2. EmbeddingEach chunkA vector in the embedding model’s space; stored in vector DBIndex time
3. RetrievalUser query embedded with the same modelTop-K nearest chunks (often K is 5-10)Query time
4. Prompt constructionRetrieved chunks + grounding instruction + user queryThe prompt the model will seeQuery time
5. GenerationConstructed promptModel’s answer (ideally with citations to chunks)Query time
StageSymptomCommon cause
ChunkingRetrieved chunks lack context, or the answer is split across chunksWrong boundary strategy for the document type; chunks too large or too small
EmbeddingSemantically similar chunks land far apart, or unrelated chunks cluster togetherDomain mismatch between embedding model and corpus
RetrievalModel says “I don’t know” but the answer is in the corpusPure semantic search missing keyword matches; top-K too small; query phrasing very different from document phrasing
Prompt constructionModel uses prior knowledge despite having contextMissing or weak grounding instruction; context buried under boilerplate
GenerationModel cites a chunk that does not actually support the claimConfabulating a citation to satisfy the instruction; right answer was not retrieved
LeverPractical guidance
Same embedding model on both sidesDocuments and queries must go through the same model. Different models = meaningless distances.
Hybrid searchCombine semantic (vector) and lexical (BM25) search, merge rankings. Catches both paraphrases and literal-string queries.
K (top-K)Often 5 or 10. Larger K = more recall but more tokens in the prompt and more noise.
RerankingAn optional second-stage model rescores the top-K from retrieval before passing them to generation. Often cheap, often a clear win.
Bi-encoderCross-encoder
ArchitectureQuery and chunk encoded separatelyQuery and chunk encoded together with self-attention across both
OutputTwo vectors, compared via cosine similaritySingle relevance score
SpeedFast (chunk vectors precomputed at index time)Slow (full encoder per query-chunk pair)
AccuracyRecall-heavy (good first pass)Precision-heavy (good final ranking)
Used forStage 1: candidate retrieval (millions of chunks → ~100)Stage 2: reranking (~100 candidates → top 5-10)
Canonical modelSentence-BERTOften a smaller transformer trained for relevance scoring

Production RAG uses both. Bi-encoder pulls a wide net; cross-encoder rescores for precision.

HyDE: closing the query-vs-document shape gap

Section titled “HyDE: closing the query-vs-document shape gap”
Problem: queries are short questions; documents are longer answers.
Their embeddings often live in different regions of vector space
even when the underlying topic matches.
HyDE solution:
1. Take the user's query
2. Make one extra LLM call: "write a plausible answer document for this query"
3. Embed the HYPOTHETICAL DOCUMENT (not the original query)
4. Use that embedding for retrieval
Why it works: the hypothetical document is shaped like real documents,
so its embedding lands in the right region of vector space.
Costs: one extra LLM call per query.
Wins: ambiguous or short queries against diverse corpora.
Doesn't always win: well-formed queries see no improvement.
You are a helpful assistant. Answer the user's question using only the
context below. If the context does not contain the answer, say "I don't
know based on the provided context." Cite the chunk number for each
claim you make.
[Context]
[1] {chunk_1_text}
[2] {chunk_2_text}
[3] {chunk_3_text}
[User question]
{user_query}
ElementWhy it is there
Source labels ([1], [2], …)Lets the model cite per-chunk and makes ungrounded answers visible
Grounding instructionReduces (does not eliminate) the model falling back on pretraining when context is silent
Refusal clause (“say I don’t know”)Gives the model a graceful fallback when the answer is not in context
Order of context vs questionTest both; the right order varies by model
Use RAG when…Use fine-tuning when…
Facts change over timeYou want a behavior or style baked in once
You need to cite sourcesYou want consistency without per-call token cost
Corpus is privateChange is about how the model writes, not what it knows
Corpus is too large to train onVolume of examples justifies the training run
Documents added/removed on the flyThe behavior should not depend on what is in the prompt

They stack: fine-tune for style, use RAG for facts.

Direct injection: user types attack into the input field
→ operator can at least see the input
Indirect injection: attacker plants attack inside a document
that gets indexed → operator may not have
written the document, may not control it,
and may not have read it
MitigationWhat it doesWhat it does not do
Source provenance (prefer known authors; risk-flag open-submission sources)Reduces the chance an attacker-controlled document is in the indexEliminate the attack surface (any fresh content can be malicious)
Content sanitization at index time (strip instruction-frame patterns)Catches lazy attacksStop adversarial wording the sanitizer did not anticipate
Output filtering (watch for sudden topic changes, unfounded refusals)Catches some compromises after the factPrevent the compromise from happening
Action sandboxing (gate side-effecting actions on confirmations)Limits damage if the model is compromisedStop the model from saying compromising things
Visible citationsMakes injection visible to the userPrevent the injection itself

Design rule: treat the entire retrieval index as untrusted input. Do not give a model on top of it access to anything you would not let the indexed content control directly.

PitfallReality
RAG is a magic upgradeRAG works when retrieval works. Bad chunking + bad embeddings + smartest model = bad answers.
Use RAG for everythingTranslation, summarization of provided text, and format conversion need no retrieval. RAG adds latency and surface area without benefit on these.
Skimping on chunkingDocument-aware chunkers (sentence, paragraph, structure-aware) consistently beat naive fixed-length splits.
Trusting retrieved content blindlyAnything in your index is, in effect, in your prompt. Treat it as untrusted.
Skipping evaluationRAG quality has two axes (did retrieval find it, did the model use it). Measure both with a labeled query set.
Vendor claimWhat it usually means
”Our assistant knows your company docs”RAG with your docs in the index. Almost never fine-tuning.
”95% accuracy on customer questions”Some weighted combination of retrieval recall + generation correctness. Ask which.
”Built-in citations”Source labels in the prompt template + a grounding instruction. Useful; not novel.
”We use [vector DB]“Indexing infrastructure choice. Does not by itself say anything about retrieval quality.
  • RAG (retrieval-augmented generation): the pattern of retrieving relevant documents from a knowledge base and inserting them into the prompt so the model can answer using them.
  • Chunk: a coherent text segment (typically a few hundred tokens) extracted from a source document for indexing.
  • Embedding model: a separate, usually smaller model that maps text to vectors so semantically similar text ends up nearby.
  • Vector database: a database optimized for fast nearest-neighbor lookup over high-dimensional vectors. Examples: FAISS, Pinecone, Weaviate, Chroma.
  • Top-K retrieval: returning the K vector-DB entries closest to a query vector. K is usually 5 or 10 in production RAG.
  • Cosine similarity: the most common similarity metric for text embeddings; measures the angle between two vectors regardless of magnitude.
  • Hybrid search: combining vector similarity with lexical (keyword) search and merging the rankings.
  • BM25: the most common lexical-search ranking function; used as the lexical half of hybrid search.
  • Reranking: an optional second pass that uses a more expensive model to rescore the top-K before they reach the language model.
  • Bi-encoder: retrieval architecture where the query and each chunk pass through the embedding model separately; comparison happens after via cosine similarity. Fast, recall-heavy. Sentence-BERT is the canonical example.
  • Cross-encoder: reranking architecture where the query and chunk pass through the encoder together with self-attention across both. Slow, precision-heavy. Used for the second stage of two-stage retrieval.
  • HyDE (Hypothetical Document Embeddings): retrieval technique that addresses query-vs-document shape mismatch by generating a hypothetical answer document with an LLM call and embedding that document instead of the user’s query.
  • Grounding instruction: the part of the prompt template that tells the model to use only the provided context and refuse otherwise.
  • Ungrounded generation: the model producing an answer using its pretraining knowledge instead of, or in addition to, the retrieved context.
  • Indirect prompt injection: prompt-injection attacks delivered via documents in the retrieval index rather than via the user’s direct input.

Retrieval finds it.
The prompt frames it.
The model writes it.