Cheatsheet: Augmented language models
Two augmentation patterns
Section titled “Two augmentation patterns”| Pattern | What it does | When to use |
|---|---|---|
| RAG | Fetch context into the prompt | Missing knowledge |
| Tool use | Let the model call external systems | Missing capabilities (search, compute, write actions) |
Modern apps use both, often with RAG implemented as a tool.
RAG pipeline (seven moving parts)
Section titled “RAG pipeline (seven moving parts)”knowledge source -> chunking (size + overlap) -> embedding model -> vector store (Pinecone / Weaviate / Chroma / pgvector / ...) -> retriever (embed query -> top-k similar chunks) -> prompt composition (system + chunks-with-sources + user) -> generation (with citations)RAG trade-offs (where the real work is)
Section titled “RAG trade-offs (where the real work is)”| Knob | Rule of thumb |
|---|---|
| Chunk size | A few hundred tokens; small overlap (50-100). Tune empirically. |
| Top-k | Start 5-10; rarely above 20. More = better recall but bigger prompt. |
| Embedding model | Try a couple on your data; quality varies by domain. |
| Re-ranking | Big-set cheap retrieval -> small-set expensive re-rank. Adds latency, lifts hard-query quality. |
| Hybrid search | Dense + BM25. Beats either alone for queries that mix semantic and exact terms. |
| Metadata filtering | Tag chunks (date, type, customer ID, …); filter at retrieval. Often the biggest single win. |
RAG failure mode + defense
Section titled “RAG failure mode + defense”Failure: bad retrieval the model cannot detect; answer is confidently wrong.
Defense: held-out retrieval-evaluation set (queries with their expected relevant chunks); measure retrieval quality separately from end-to-end answer quality.
Tool use (four steps)
Section titled “Tool use (four steps)”1. Declare tool schemas (name, description, parameter types)2. Model decides: reply directly OR emit tool-call request3. Your code executes the tool; returns result4. Model continues: more tools, refine, or final answerThe model’s decision (which tool, when, with what args) is itself prompt-engineered: clear tool descriptions are the equivalent of clear instructions.
RAG-as-a-tool > always-retrieve
Section titled “RAG-as-a-tool > always-retrieve”Always-retrieve: pays retrieval cost + latency on EVERY requestRAG-as-a-tool: model decides per request; simple requests skip retrievalCleaner, cheaper, matches the agent shape (lesson 10).
How it lives against the three productive limits
Section titled “How it lives against the three productive limits”| Limit | This lesson’s move |
|---|---|
| Context | Retrieved chunks share the budget; tighter retrieval / re-rank / metadata filter buys back budget |
| Cost | Every chunk per request paid every time; RAG-as-a-tool eliminates retrieval on requests that don’t need it |
| Latency | Retrieval adds wall-clock before generation; cache common retrievals, async-fetch, right-size top-k |
Words to use precisely
Section titled “Words to use precisely”- Chunk: a retrieval-sized piece of a source document.
- Embedding: dense vector representing a chunk’s meaning.
- Vector store: database for nearest-neighbor search over embeddings.
- Top-k: how many chunks to retrieve per query.
- Re-ranking: a second-stage scoring of an initial retrieval set with a more expensive model.
- Hybrid search: dense (embedding) + sparse (BM25) retrieval combined.
- Tool use / function calling: model emits a structured request to call a declared function; your code executes; result feeds back.
Source
Section titled “Source”- Full Stack Deep Learning, LLM Bootcamp (Spring 2023): Augmented Language Models.
fullstackdeeplearning.com/llm-bootcamp. Independent structural mirror in original prose; see references.