Skip to content

References: Prompt caching and context management

Source curriculum (structural mirror, cited as further study):
• Anthropic Academy (https://anthropic.skilljar.com/):
"Building with the Claude API" course
(prompt-caching and context-management sections)
License: Anthropic Academy course content is account-gated;
Clawdemy structurally mirrors the Academy's lesson progression
as inspiration and cites it as further study. Every substantive
claim in this lesson is verifiable against the public Anthropic
documentation.
Primary public-doc anchors (every substantive claim verified against):
• Anthropic, "Prompt caching" (the cache_control mechanism,
placement layers, pricing multipliers, the four-breakpoint rule,
minimum-cacheable sizes, usage field meanings, workspace
isolation)
https://platform.claude.com/docs/en/build-with-claude/prompt-caching
• Anthropic, "Context windows" (per-model context-window sizes,
context rot, context awareness, overflow stop reason,
token-counting API pointer)
https://platform.claude.com/docs/en/build-with-claude/context-windows
• Anthropic, "Compaction" (the compact_20260112 edit shape, the
compact-2026-01-12 beta header, the default 150K trigger, the
pause_after_compaction pattern, the system-prompt-cache-survival
recommendation, the usage.iterations billing breakdown,
the tool-call-during-summary workaround)
https://platform.claude.com/docs/en/build-with-claude/compaction
• Anthropic, "Context editing" (clear_tool_uses_20250919 and
clear_thinking_20251015, the context-management-2025-06-27
beta header, clear_at_least and keep parameters, the cache-
interaction asymmetry, client-side SDK compaction note)
https://platform.claude.com/docs/en/build-with-claude/context-editing
• Anthropic, "Effective context engineering for AI agents"
(the broader curation principles)
https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
Verbatim claims sourced from the public docs:
• "Prompt caching optimizes your API usage by allowing resuming
from specific prefixes in your prompts. This significantly
reduces processing time and costs for repetitive tasks or
prompts with consistent elements" (prompt-caching docs opener)
• "By default, the cache has a 5-minute lifetime. The cache is
refreshed for no additional cost each time the cached content
is used" (prompt-caching docs)
• Pricing multipliers verbatim: "5-minute cache write tokens are
1.25 times the base input tokens price; 1-hour cache write
tokens are 2 times the base input tokens price; Cache read
tokens are 0.1 times the base input tokens price"
(prompt-caching docs pricing section)
• "Shorter prompts cannot be cached, even if marked with
cache_control. Any requests to cache fewer than this number
of tokens will be processed without caching, and no error is
returned" (prompt-caching docs minimum-token section)
• "As token count grows, accuracy and recall degrade, a phenomenon
known as context rot. This makes curating what's in context just
as important as how much space is available" (context-windows
docs)
• "Server-side context compaction for managing long conversations
that approach context window limits" (compaction docs purpose)
• "As conversations get longer, models struggle to maintain focus
across the full history. Compaction keeps the active context
focused and performant by replacing stale content with concise
summaries" (compaction docs)
• "Context editing allows you to selectively clear specific
content from conversation history as it grows. Beyond optimizing
costs and staying within limits, this is about actively curating
what Claude sees" (context-editing docs)
• "For most use cases, server-side compaction is the primary
strategy for managing context in long-running conversations.
The strategies on this page are useful for specific scenarios
where you need more fine-grained control over what content is
cleared" (context-editing docs)
Required attribution: "Based on the structure of the Anthropic Academy
'Building with the Claude API' course
(https://anthropic.skilljar.com/). This lesson is an independent
structural mirror in original prose; every substantive claim is
verified against the public Anthropic Claude documentation at
https://platform.claude.com/docs/. Anthropic does not endorse it."
  • Anthropic, “Prompt caching”. The canonical reference: full pricing tables per model, the automatic-vs-explicit caching modes, the precedence rules for placement order, the workspace-isolation policy, and the per-model minimum-token table.
  • Anthropic, “Compaction”. The full beta-feature reference: every parameter of the compact_20260112 edit, the pause_after_compaction pattern in detail, the recommended-system-prompt-cache-breakpoint pattern, the usage.iterations shape, the tool-call-during-summary workaround.
  • Anthropic, “Context editing”. Both server-side strategies in depth (clear_tool_uses_20250919, clear_thinking_20251015), the client-side SDK compaction alternative, configuration options for each strategy.
  • Anthropic, “Effective context engineering for AI agents”. The engineering essay on what to put in context and what to leave out; the principled foundation under the surgical levers this lesson covers.

A short, durable list. Each link is a specific next step inside Track 22.

  • Lesson 8 of this track, “From single call to agent loop.” Where the cached + curated capability set from this lesson becomes the per-step inventory inside a multi-turn loop. Compaction and tool result clearing become load-bearing for any agent that runs for more than a few iterations.
  • Lesson 9 of this track, “Six effective-agent patterns.” Where session shape (single-shot, chained, parallel, orchestrator-workers, evaluator-optimizer, autonomous) decides which context-management posture fits.
  • Lesson 12 of this track, “Shipping a Claude application.” Where the cache-hit ratio and usage.iterations breakdown become production monitoring signals; the eval-set discipline measures whether each cached prefix is actually paying back.

Adjacent tracks (the natural next destinations)

Section titled “Adjacent tracks (the natural next destinations)”
  • Track 20 (AI Agents and Tool Use): pick this if you want the full track-level depth on context engineering for agents, including the principles essay this lesson cites (Effective context engineering for AI agents).
  • Track 21 (LLM Ops and Production): pick this if you want the provider-agnostic view of cost-and-latency monitoring (lesson 7 LLMOps). The cache-hit-ratio telemetry this lesson recommends maps onto the observability pillar from there.

The cost-and-staleness layer the rest of T22 relies on:

  • Lesson 1 (first call): the iterate-the-content-array discipline now includes compaction blocks as another content type that lands in the response, and usage.iterations as the new place where compaction-step billing surfaces.
  • Lesson 2 (production patterns): L2’s stop_reason dispatch enumerated end_turn, max_tokens, stop_sequence, tool_use; L5 added pause_turn (server-tool mid-multi-iteration); this lesson adds model_context_window_exceeded and the “compaction” value. Per-request error handling from L2 extends to the content: null compaction-summary edge case introduced here.
  • Lesson 3 (model selection): cache-write multipliers stack onto the per-model input price from L3; the 1.25x and 2.0x and 0.1x apply to whatever per-MTok rate the chosen model has. The minimum-cacheable-size floor varies by model (4,096 on Opus 4.7 + Haiku 4.5; 1,024 on Sonnet 4.6 + Opus 4.8).
  • Lesson 4 (custom client tools): cache_control on tool definitions in the tools array applies to L4’s custom tools, often the second-highest-leverage breakpoint after the system prompt.
  • Lesson 5 (server tools + Anthropic-schema + tool_search): cache breakpoints apply to L5’s tool entries too; defer_loading keeps tool definitions out of the prefix when unused, which is complementary to caching.
  • Lesson 6 (MCP connector): the mcp_toolset entry’s cache_control field is the seam where MCP tool definitions enter the cached prefix.
  • Lesson 8 onward (agent loop): everything cached and curated here keeps the loop affordable across many iterations; tool result clearing becomes especially load-bearing in tool-heavy agent loops.
  • Lesson 12 (shipping): cache_read_input_tokens / total and usage.iterations are the production telemetry signals for measuring whether the cost-and-staleness posture is actually working.