LLMOps: cheatsheet

Five pillars of LLMOps

Pillar	What it does
Observability	Log every request with enough fields to debug
Evaluation in production	Sample + score live; A/B test changes; surface poor responses for human labeling
Prompt versioning	Treat prompts as code (source control + version + registry); record in logs
Cost + latency monitoring	Dashboards + budget alerts + p50/p95 latency regression alerts
Regression testing	Suite of 100s-1000s of examples; run before every change

Per-request log fields (7-10)

- prompt (or prompt id)
- model + parameters (temperature, max_tokens, ...)
- prompt version
- retrieved context ids + tool calls + their results
- response
- token counts (input + output)
- latency (TTFT + total)
- cost (computed from tokens × provider rate)
- user feedback signal (thumbs / edit / none)
- anonymized trace id

Without these, debugging and evaluation are guessing.

Evaluation-in-production moves

Move	What it gives you
Sample 1-5% of responses	A small, manageable stream of live examples to review
LLM-as-judge with rubric	Cheap automatic scoring per sample
Human review for low scores / negative feedback	Ground truth for the edge cases
Labels feed back into offline test set	Suite grows with the application
A/B test prompt/pipeline changes	Causal evidence change helped

Prompt versioning (lesson 3 + lesson 7)

prompt = constant or registry entry (versioned)
prompt_version = field in every log line
deployment = staged (canary -> full), rollback ready

Cost + latency dashboards

Metric	Breakdown
Per-request cost	by route, model, prompt version, user segment
Aggregate cost	daily / weekly / monthly with budget alerts
Latency p50 + p95	regression alerts
Cost-per-resolution	when “per job” is the business metric

Regression testing

build the suite from production failures (start at 50, grow to 100s-1000s)
score automatically where possible (regex, structured, LLM-as-judge)
run before every prompt / model / pipeline change
refuse merges that regress without explicit override
new production failure -> new test example

Makes model upgrades safe.

Smallest practical first stack (days, not months)

Expand logging to 7-10 fields
Version the prompt (move out of app.py to a constant or registry)
Build a 50-example regression suite from early production traffic
Adopt an LLM-observability platform (LangSmith / Langfuse / Helicone / Arize Phoenix / WandB Prompts)
Sample 1-5% for in-production evaluation
Add budget + p95 latency regression alerts
Run a first A/B test on the next change

Tools (pick one, picks differ by stack)

LangSmith (LangChain ecosystem)
Langfuse (open-source, self-hostable)
Helicone (proxy-based; minimal-code)
Arize Phoenix (open-source)
Weights & Biases Prompts
Gantry (FSDL’s instructor’s company; production-LLM-focused)

Plus platform observability (Datadog, Honeycomb, Grafana) for the broader application.

The tools matter less than the discipline.

What this lesson does NOT cover

Incident-disclosure policy
Vendor-failure / SLA / liability questions
Compliance frameworks (SOC 2, ISO, sector-specific regimes)

Real and important; require their own framing in their own forum with the right people. This lesson is the engineering discipline.

Words to use precisely

Observability: enough logs and traces to debug after the fact.
Evaluation in production: scoring live responses (not just offline test sets).
A/B testing: parallel deployment of two versions to compare causally.
Regression suite: held-out examples rerun on every change.
Cost-per-resolution: cost per business unit-of-work (vs per token).

Source

Full Stack Deep Learning, LLM Bootcamp (Spring 2023): LLMOps. fullstackdeeplearning.com/llm-bootcamp. Independent structural mirror in original prose; see references.