Skip to content

Summary: Shipping a Claude application

The track closer. L1-L11 built the substrate; this lesson is what changes when the result runs for real users behind a deploy. Five disciplines: (1) Cost monitoring. Per-call telemetry on the usage object (log request_id / model / stop_reason / usage.input_tokens / usage.output_tokens / usage.cache_creation_input_tokens / usage.cache_read_input_tokens / usage.iterations / latency on every response). Cache-hit ratio (cache_read_input_tokens / total input) is the headline metric, often 90 percent+ on well-cached production. Plus the organization-level Usage and Cost Admin API (verbatim: programmatic and granular access to historical API usage and cost data for your organization): GET /v1/organizations/usage_report/messages (token counts; group by model / workspace_id / api_key_id / service_tier / context_window / inference_geo / speed; bucket widths 1m / 1h / 1d) and GET /v1/organizations/cost_report (USD costs; daily granularity). Requires an Admin API key (sk-ant-admin…; distinct from regular API keys). Data appears within about 5 minutes; one-per-minute polling sustained. Priority Tier costs never appear in the Cost endpoint (different billing model); not available on Claude Platform on AWS; not available on individual accounts. The four cost levers (model selection L3, effort dial L3, prompt caching L7, batches L2) plus per-subagent model L11 cover the cost picture. (2) Latency budgets per surface (TTFT for chat; full-response for long generations; throughput for batches). Levers: streaming (L2), effort (L3), caching (L7), routing (L9.2), Subagent parallelization (L9.3 + L11), batches (L2). Never tune latency without a budget. (3) Eval-set discipline. Held-out test set as the deploy gate; cross-ref T21 L7 LLMOps; pattern 5 (evaluator-optimizer from L9) is the loop. Every model swap, prompt edit, tool addition runs through the eval set before traffic shifts. (4) Rollout = feature flags + canary (1 percent → 10 percent → 100 percent) + A/B against current production + rehearsed rollback. request_id logging (L2) makes incident response tractable; model name + system prompt + tool list + cache_control placement + MCP servers + Subagent configs all live behind flags. (5) Lifecycle handling. Anthropic deprecation policy: Active → Legacy → Deprecated → Retired. At least 60 days notice for publicly released models (verbatim). Audit via Console Usage Export. Date-pinned IDs in production (4.6 generation and later are dateless and already pinned; pre-4.6 use date-suffixed). One smaller deprecation: temperature / top_p / top_k return 400 on Opus 4.7+ at non-default values. The rollout checklist pulls L1-L11 into a single deploy gate: smallest primitive solid (L1); production patterns (L2: streaming + batches + stop_reason + request_id); model + effort by eval (L3); tools deliberate (L4-L6); caching + compaction + clearing live (L7); agent loop disciplines (L8); pattern by decision tree (L9); Skills + harness chosen (L10); right primitive per workload (L11); cost dashboards live; latency budget tracked; eval set as deploy gate; feature flags + canary + rollback; deprecation watch. The track is closed. Lesson 1 had you make a call. Lesson 12 ships the application.

  • Per-call telemetry on every Claude response. Log request_id (L2) + model + stop_reason + usage.input_tokens + usage.output_tokens + usage.cache_creation_input_tokens + usage.cache_read_input_tokens + usage.iterations (when compaction enabled) + request latency. Pair with user_id, session_id, feature_flag_state.
  • Cache-hit ratio = cache_read_input_tokens / total input tokens. Headline metric; often 90 percent+ on well-cached production with stable prefixes. Per-workload tracking is the highest-leverage cost-monitoring habit.
  • Usage and Cost Admin API: GET /v1/organizations/usage_report/messages (token counts; bucket 1m / 1h / 1d) + GET /v1/organizations/cost_report (USD; daily). Admin API key (sk-ant-admin…) required, distinct from regular keys. Data ~5 minutes fresh; polling ≤ once per minute. Group by model / workspace_id / api_key_id / service_tier / context_window / inference_geo / speed.
  • Two caveats: Priority Tier costs use a different billing model and never appear in the Cost endpoint (track Priority Tier usage in the Usage endpoint); programmatic Usage and Cost API NOT available on Claude Platform on AWS (Console only there).
  • Four cost levers from earlier lessons: model selection (L3), effort dial (L3), prompt caching (L7), batches (L2). Plus per-subagent model (L11) for orchestrator-workers.
  • Latency budgets per surface, not per call. Streaming (L2) cuts TTFT perceptually; effort (L3) trades cost-quality-latency; caching (L7) cuts TTFT on hits; routing (L9.2) sends easy queries to faster models; Subagents parallelize (L11); batches are the “no latency requirement” lever.
  • Eval-set discipline as deploy gate. Build held-out test set; run every change against it before shipping. T21 L7 LLMOps is the deeper playbook; L9 pattern 5 (evaluator-optimizer) is the loop.
  • Rollout four moves: feature flags (config-not-deploy), canary (1 percent → 10 percent → 100 percent gated by eval + dashboards), A/B against current production (pick cheapest that passes), rehearsed rollback (flag toggles back in one motion). request_id logging from L2 enables incident response.
  • Anthropic deprecation policy: four states (Active, Legacy, Deprecated, Retired); at least 60 days notice for publicly released models (verbatim); audit via Console Usage Export. Date-pinned IDs for production (L3): 4.6 generation + later are dateless and already pinned; pre-4.6 use date-suffixed.
  • temperature / top_p / top_k deprecated on Opus 4.7+ (including 4.8): return 400 on non-default values. Omit; use prompting to guide behavior.
  • The rollout checklist is the deploy gate. Pulls L1-L11 plus the five disciplines into one list. Treat as required, not aspirational.
  • Where this fits: Phase 4 / lesson 12 / track closer. Closes the track.

Before this lesson, what you had was a working prototype. After this lesson, you have the production disciplines to ship and maintain it: per-call telemetry on every Claude response (with request_id and cache-hit ratio as the headline metrics), an organization-level Usage and Cost API for cross-call aggregation, per-surface latency budgets, an eval set as the deploy gate, a feature-flag plus canary plus rollback rollout pattern, and a deprecation watch with at least 60 days runway. The two highest-leverage takeaways for the week: (1) add cache-hit-ratio tracking to your production telemetry today (it is one line per request in your existing logging). If the ratio is below 70 percent on stable-prefix workloads, fix the cache placement before any further optimization, because it is almost certainly the largest single cost lever you have not pulled. (2) If you do not have an eval set yet, build the smallest possible one (5-10 prompts, a rubric per prompt) and put it in front of the next change you make, even if it is “just” a prompt tweak. The discipline of running changes through an eval before shipping is what prevents the silent-quality-regression class of production failure that dominates Claude-integration incidents. Track 22 is closed; Track 21 lesson 7 LLMOps and Track 20 (AI Agents and Tool Use) are the deeper next destinations.