Summary: Prompt caching and context management

The cost-and-staleness lever for a tool-heavy session. Prompt caching (the docs verbatim: Prompt caching optimizes your API usage by allowing resuming from specific prefixes in your prompts) attaches the cache_control field with type set to ephemeral (optional ttl of 1h) to the system prompt, tool definitions across L4 + L5 + L6 (including mcp_toolset entries), and stable message blocks. Up to four explicit breakpoints per request; placement order is tools → system → messages. Pricing multipliers off base input: 1.25x for the 5-minute write, 2.0x for the 1-hour write, 0.1x for cache hits (about 90 percent off on reads). Break-even is one hit at 5 minutes, roughly two to four at 1 hour. Minimum cacheable size: 4,096 tokens on Opus 4.7 / 4.6 / 4.5 / Haiku 4.5; 1,024 on Opus 4.8 / Sonnet 4.6 / Sonnet 4.5. Caching is exact-match (one character change invalidates the prefix) and workspace-isolated on Claude API + AWS + Foundry. Usage tells you what hit: cache_creation_input_tokens + cache_read_input_tokens + input_tokens = total input. Context windows: 1M tokens on Opus 4.8 / Mythos / 4.7 / 4.6 / Sonnet 4.6, 200K on Sonnet 4.5 and older. The docs name context rot explicitly: as token count grows, accuracy and recall degrade … curating what’s in context just as important as how much space is available. Overflow on Claude 4.5+ models is a clean stop_reason: model_context_window_exceeded. Compaction (beta compact-2026-01-12; edit compact_20260112; default trigger 150K input tokens) is the recommended long-session strategy: the API summarizes older history into a compaction block, then drops prior messages on subsequent turns. Cache the system prompt separately so it survives compaction; an extra sampling step is billed under usage.iterations. Context editing (beta context-management-2025-06-27) is the surgical alternative: clear_tool_uses_20250919 for agentic workflows (use clear_at_least so cache invalidation pays back; placeholder text replaces cleared tool_result / mcp_tool_result blocks); clear_thinking_20251015 for extended-thinking sessions (the keep parameter; thinking-block clearing preserves more cache than tool-result clearing). Unifying frame: cache the stable parts (cost), compact the long parts (staleness), clear the heavy parts (surgical). Reach for each deliberately; do not turn them all on by default. Phase 2 closes here; L8 opens Phase 3 by turning the per-step capability set into an agent loop where everything cached and pruned in this lesson keeps the loop affordable across many iterations.

Core ideas

Prompt caching is the cost lever. cache_control on system text blocks, tools entries (custom + server + Anthropic-schema + mcp_toolset), and messages content blocks. Four explicit breakpoints max; placement order tools → system → messages. 20-block lookback from each breakpoint.
Pricing math: 1.25x write (5m), 2.0x write (1h), 0.1x hits/refreshes. 5-minute pays back at one hit; 1-hour pays back at roughly two to four hits. Minimum cacheable size 4,096 tokens (Opus 4.7 / 4.6 / 4.5 / Haiku 4.5) or 1,024 (Opus 4.8 / Sonnet 4.6 / Sonnet 4.5). Exact-match; workspace-isolated on Claude API + AWS + Foundry.
usage fields: cache_creation_input_tokens + cache_read_input_tokens + input_tokens = total input. Watch the read-to-total ratio (often 90 percent+ on a well-cached production stack).
Context windows: 1M tokens on Opus 4.8 / Mythos / 4.7 / 4.6 / Sonnet 4.6 (Claude API + Bedrock + Vertex); 200K on Sonnet 4.5 and older. Context rot is real: curation matters as much as raw size. Context awareness on Sonnet 4.6 / Sonnet 4.5 / Haiku 4.5 gives the model an explicit token budget. Overflow on 4.5+: stop_reason: model_context_window_exceeded.
Compaction (recommended long-session default): beta compact-2026-01-12 + compact_20260112 edit; default trigger 150K input tokens. Replaces older history with a compaction block; on subsequent requests, prior messages are dropped automatically. pause_after_compaction lets you preserve recent turns verbatim. Cache the system prompt at the end with cache_control so it survives compaction. Same model writes the summary; cost shows up under usage.iterations.
Context editing (surgical alternative): beta context-management-2025-06-27. clear_tool_uses_20250919 clears old tool_result / mcp_tool_result blocks in chronological order (with placeholder text); clear_at_least keeps cache invalidation worth the write cost. clear_thinking_20251015 clears thinking blocks; the keep parameter controls depth. Default to compaction; reach for editing when a specific block class needs to leave while the rest stays verbatim.
Cache-interaction summary: putting a cache_control breakpoint at the end of the system prompt is the move that makes compaction + caching coexist (system survives compaction). Tool result clearing invalidates the cache at the clear point; thinking-block clearing preserves cache when blocks are kept.
Trade-off: caching saves money on hits but costs 25 to 100 percent more on the write turn; compaction costs an extra sampling step per crossing; tool result clearing costs cache invalidation on each clear. None is free; pick the lever that matches the session’s actual shape.
Where this fits: Phase 2 closer; lesson 8 opens Phase 3 with the agent loop where cached + curated context keeps long-running iterations affordable.

What changes for you

Before this lesson, every turn of a tool-heavy Claude application paid the full input-token price for the entire stable prefix, and any conversation that ran long either hit the context-window limit hard or quietly suffered context rot. After this lesson, three production-grade levers are available: caching for the parts that do not change (the system prompt and tool stack are usually the highest-leverage breakpoints; about a 90 percent input-cost reduction on hits is realistic with stable prefixes), compaction for sessions that will run long (enabled with one beta header and one context_management entry; pair with a system-prompt cache breakpoint so the system survives), and context editing for agentic workflows where old tool results dominate the context (server-side clearing with placeholder substitution; pair with clear_at_least so cache invalidation pays back). The highest-leverage move this week: instrument usage.cache_read_input_tokens / total in your production telemetry; if the ratio is below 50 percent on stable-prefix workloads, add a single cache_control breakpoint to your system prompt and watch the bill drop. The second-highest: if your sessions regularly approach the 100K-token mark, opt in to compaction with a 150K trigger and a cached system, and measure the focus improvement (production users notice the model staying on task in long sessions). Lesson 8 (next) opens Phase 3 by turning your now-cached, curatable capability set into a multi-turn agent loop.