Skip to content

Cheatsheet: LLMOps

PillarWhat it does
ObservabilityLog every request with enough fields to debug
Evaluation in productionSample + score live; A/B test changes; surface poor responses for human labeling
Prompt versioningTreat prompts as code (source control + version + registry); record in logs
Cost + latency monitoringDashboards + budget alerts + p50/p95 latency regression alerts
Regression testingSuite of 100s-1000s of examples; run before every change
- prompt (or prompt id)
- model + parameters (temperature, max_tokens, ...)
- prompt version
- retrieved context ids + tool calls + their results
- response
- token counts (input + output)
- latency (TTFT + total)
- cost (computed from tokens × provider rate)
- user feedback signal (thumbs / edit / none)
- anonymized trace id

Without these, debugging and evaluation are guessing.

MoveWhat it gives you
Sample 1-5% of responsesA small, manageable stream of live examples to review
LLM-as-judge with rubricCheap automatic scoring per sample
Human review for low scores / negative feedbackGround truth for the edge cases
Labels feed back into offline test setSuite grows with the application
A/B test prompt/pipeline changesCausal evidence change helped
prompt = constant or registry entry (versioned)
prompt_version = field in every log line
deployment = staged (canary -> full), rollback ready
MetricBreakdown
Per-request costby route, model, prompt version, user segment
Aggregate costdaily / weekly / monthly with budget alerts
Latency p50 + p95regression alerts
Cost-per-resolutionwhen “per job” is the business metric
build the suite from production failures (start at 50, grow to 100s-1000s)
score automatically where possible (regex, structured, LLM-as-judge)
run before every prompt / model / pipeline change
refuse merges that regress without explicit override
new production failure -> new test example

Makes model upgrades safe.

Smallest practical first stack (days, not months)

Section titled “Smallest practical first stack (days, not months)”
  1. Expand logging to 7-10 fields
  2. Version the prompt (move out of app.py to a constant or registry)
  3. Build a 50-example regression suite from early production traffic
  4. Adopt an LLM-observability platform (LangSmith / Langfuse / Helicone / Arize Phoenix / WandB Prompts)
  5. Sample 1-5% for in-production evaluation
  6. Add budget + p95 latency regression alerts
  7. Run a first A/B test on the next change
  • LangSmith (LangChain ecosystem)
  • Langfuse (open-source, self-hostable)
  • Helicone (proxy-based; minimal-code)
  • Arize Phoenix (open-source)
  • Weights & Biases Prompts
  • Gantry (FSDL’s instructor’s company; production-LLM-focused)

Plus platform observability (Datadog, Honeycomb, Grafana) for the broader application.

The tools matter less than the discipline.

  • Incident-disclosure policy
  • Vendor-failure / SLA / liability questions
  • Compliance frameworks (SOC 2, ISO, sector-specific regimes)

Real and important; require their own framing in their own forum with the right people. This lesson is the engineering discipline.

  • Observability: enough logs and traces to debug after the fact.
  • Evaluation in production: scoring live responses (not just offline test sets).
  • A/B testing: parallel deployment of two versions to compare causally.
  • Regression suite: held-out examples rerun on every change.
  • Cost-per-resolution: cost per business unit-of-work (vs per token).
  • Full Stack Deep Learning, LLM Bootcamp (Spring 2023): LLMOps. fullstackdeeplearning.com/llm-bootcamp. Independent structural mirror in original prose; see references.