Skip to content

Cheatsheet: Augmented language models

PatternWhat it doesWhen to use
RAGFetch context into the promptMissing knowledge
Tool useLet the model call external systemsMissing capabilities (search, compute, write actions)

Modern apps use both, often with RAG implemented as a tool.

knowledge source
-> chunking (size + overlap)
-> embedding model
-> vector store (Pinecone / Weaviate / Chroma / pgvector / ...)
-> retriever (embed query -> top-k similar chunks)
-> prompt composition (system + chunks-with-sources + user)
-> generation (with citations)
KnobRule of thumb
Chunk sizeA few hundred tokens; small overlap (50-100). Tune empirically.
Top-kStart 5-10; rarely above 20. More = better recall but bigger prompt.
Embedding modelTry a couple on your data; quality varies by domain.
Re-rankingBig-set cheap retrieval -> small-set expensive re-rank. Adds latency, lifts hard-query quality.
Hybrid searchDense + BM25. Beats either alone for queries that mix semantic and exact terms.
Metadata filteringTag chunks (date, type, customer ID, …); filter at retrieval. Often the biggest single win.

Failure: bad retrieval the model cannot detect; answer is confidently wrong.

Defense: held-out retrieval-evaluation set (queries with their expected relevant chunks); measure retrieval quality separately from end-to-end answer quality.

1. Declare tool schemas (name, description, parameter types)
2. Model decides: reply directly OR emit tool-call request
3. Your code executes the tool; returns result
4. Model continues: more tools, refine, or final answer

The model’s decision (which tool, when, with what args) is itself prompt-engineered: clear tool descriptions are the equivalent of clear instructions.

Always-retrieve: pays retrieval cost + latency on EVERY request
RAG-as-a-tool: model decides per request; simple requests skip retrieval

Cleaner, cheaper, matches the agent shape (lesson 10).

How it lives against the three productive limits

Section titled “How it lives against the three productive limits”
LimitThis lesson’s move
ContextRetrieved chunks share the budget; tighter retrieval / re-rank / metadata filter buys back budget
CostEvery chunk per request paid every time; RAG-as-a-tool eliminates retrieval on requests that don’t need it
LatencyRetrieval adds wall-clock before generation; cache common retrievals, async-fetch, right-size top-k
  • Chunk: a retrieval-sized piece of a source document.
  • Embedding: dense vector representing a chunk’s meaning.
  • Vector store: database for nearest-neighbor search over embeddings.
  • Top-k: how many chunks to retrieve per query.
  • Re-ranking: a second-stage scoring of an initial retrieval set with a more expensive model.
  • Hybrid search: dense (embedding) + sparse (BM25) retrieval combined.
  • Tool use / function calling: model emits a structured request to call a declared function; your code executes; result feeds back.
  • Full Stack Deep Learning, LLM Bootcamp (Spring 2023): Augmented Language Models. fullstackdeeplearning.com/llm-bootcamp. Independent structural mirror in original prose; see references.