Skip to content

Summary: Prompt caching and context management

The cost-and-staleness lever for a tool-heavy session. Prompt caching (the docs verbatim: Prompt caching optimizes your API usage by allowing resuming from specific prefixes in your prompts) attaches the cache_control field with type set to ephemeral (optional ttl of 1h) to the system prompt, tool definitions across L4 + L5 + L6 (including mcp_toolset entries), and stable message blocks. Up to four explicit breakpoints per request; placement order is toolssystemmessages. Pricing multipliers off base input: 1.25x for the 5-minute write, 2.0x for the 1-hour write, 0.1x for cache hits (about 90 percent off on reads). Break-even is one hit at 5 minutes, roughly two to four at 1 hour. Minimum cacheable size: 4,096 tokens on Opus 4.7 / 4.6 / 4.5 / Haiku 4.5; 1,024 on Opus 4.8 / Sonnet 4.6 / Sonnet 4.5. Caching is exact-match (one character change invalidates the prefix) and workspace-isolated on Claude API + AWS + Foundry. Usage tells you what hit: cache_creation_input_tokens + cache_read_input_tokens + input_tokens = total input. Context windows: 1M tokens on Opus 4.8 / Mythos / 4.7 / 4.6 / Sonnet 4.6, 200K on Sonnet 4.5 and older. The docs name context rot explicitly: as token count grows, accuracy and recall degrade … curating what’s in context just as important as how much space is available. Overflow on Claude 4.5+ models is a clean stop_reason: model_context_window_exceeded. Compaction (beta compact-2026-01-12; edit compact_20260112; default trigger 150K input tokens) is the recommended long-session strategy: the API summarizes older history into a compaction block, then drops prior messages on subsequent turns. Cache the system prompt separately so it survives compaction; an extra sampling step is billed under usage.iterations. Context editing (beta context-management-2025-06-27) is the surgical alternative: clear_tool_uses_20250919 for agentic workflows (use clear_at_least so cache invalidation pays back; placeholder text replaces cleared tool_result / mcp_tool_result blocks); clear_thinking_20251015 for extended-thinking sessions (the keep parameter; thinking-block clearing preserves more cache than tool-result clearing). Unifying frame: cache the stable parts (cost), compact the long parts (staleness), clear the heavy parts (surgical). Reach for each deliberately; do not turn them all on by default. Phase 2 closes here; L8 opens Phase 3 by turning the per-step capability set into an agent loop where everything cached and pruned in this lesson keeps the loop affordable across many iterations.

  • Prompt caching is the cost lever. cache_control on system text blocks, tools entries (custom + server + Anthropic-schema + mcp_toolset), and messages content blocks. Four explicit breakpoints max; placement order toolssystemmessages. 20-block lookback from each breakpoint.
  • Pricing math: 1.25x write (5m), 2.0x write (1h), 0.1x hits/refreshes. 5-minute pays back at one hit; 1-hour pays back at roughly two to four hits. Minimum cacheable size 4,096 tokens (Opus 4.7 / 4.6 / 4.5 / Haiku 4.5) or 1,024 (Opus 4.8 / Sonnet 4.6 / Sonnet 4.5). Exact-match; workspace-isolated on Claude API + AWS + Foundry.
  • usage fields: cache_creation_input_tokens + cache_read_input_tokens + input_tokens = total input. Watch the read-to-total ratio (often 90 percent+ on a well-cached production stack).
  • Context windows: 1M tokens on Opus 4.8 / Mythos / 4.7 / 4.6 / Sonnet 4.6 (Claude API + Bedrock + Vertex); 200K on Sonnet 4.5 and older. Context rot is real: curation matters as much as raw size. Context awareness on Sonnet 4.6 / Sonnet 4.5 / Haiku 4.5 gives the model an explicit token budget. Overflow on 4.5+: stop_reason: model_context_window_exceeded.
  • Compaction (recommended long-session default): beta compact-2026-01-12 + compact_20260112 edit; default trigger 150K input tokens. Replaces older history with a compaction block; on subsequent requests, prior messages are dropped automatically. pause_after_compaction lets you preserve recent turns verbatim. Cache the system prompt at the end with cache_control so it survives compaction. Same model writes the summary; cost shows up under usage.iterations.
  • Context editing (surgical alternative): beta context-management-2025-06-27. clear_tool_uses_20250919 clears old tool_result / mcp_tool_result blocks in chronological order (with placeholder text); clear_at_least keeps cache invalidation worth the write cost. clear_thinking_20251015 clears thinking blocks; the keep parameter controls depth. Default to compaction; reach for editing when a specific block class needs to leave while the rest stays verbatim.
  • Cache-interaction summary: putting a cache_control breakpoint at the end of the system prompt is the move that makes compaction + caching coexist (system survives compaction). Tool result clearing invalidates the cache at the clear point; thinking-block clearing preserves cache when blocks are kept.
  • Trade-off: caching saves money on hits but costs 25 to 100 percent more on the write turn; compaction costs an extra sampling step per crossing; tool result clearing costs cache invalidation on each clear. None is free; pick the lever that matches the session’s actual shape.
  • Where this fits: Phase 2 closer; lesson 8 opens Phase 3 with the agent loop where cached + curated context keeps long-running iterations affordable.

Before this lesson, every turn of a tool-heavy Claude application paid the full input-token price for the entire stable prefix, and any conversation that ran long either hit the context-window limit hard or quietly suffered context rot. After this lesson, three production-grade levers are available: caching for the parts that do not change (the system prompt and tool stack are usually the highest-leverage breakpoints; about a 90 percent input-cost reduction on hits is realistic with stable prefixes), compaction for sessions that will run long (enabled with one beta header and one context_management entry; pair with a system-prompt cache breakpoint so the system survives), and context editing for agentic workflows where old tool results dominate the context (server-side clearing with placeholder substitution; pair with clear_at_least so cache invalidation pays back). The highest-leverage move this week: instrument usage.cache_read_input_tokens / total in your production telemetry; if the ratio is below 50 percent on stable-prefix workloads, add a single cache_control breakpoint to your system prompt and watch the bill drop. The second-highest: if your sessions regularly approach the 100K-token mark, opt in to compaction with a 150K trigger and a cached system, and measure the focus improvement (production users notice the model staying on task in long sessions). Lesson 8 (next) opens Phase 3 by turning your now-cached, curatable capability set into a multi-turn agent loop.