Cheatsheet: LLMOps
Five pillars of LLMOps
Section titled “Five pillars of LLMOps”| Pillar | What it does |
|---|---|
| Observability | Log every request with enough fields to debug |
| Evaluation in production | Sample + score live; A/B test changes; surface poor responses for human labeling |
| Prompt versioning | Treat prompts as code (source control + version + registry); record in logs |
| Cost + latency monitoring | Dashboards + budget alerts + p50/p95 latency regression alerts |
| Regression testing | Suite of 100s-1000s of examples; run before every change |
Per-request log fields (7-10)
Section titled “Per-request log fields (7-10)”- prompt (or prompt id)- model + parameters (temperature, max_tokens, ...)- prompt version- retrieved context ids + tool calls + their results- response- token counts (input + output)- latency (TTFT + total)- cost (computed from tokens × provider rate)- user feedback signal (thumbs / edit / none)- anonymized trace idWithout these, debugging and evaluation are guessing.
Evaluation-in-production moves
Section titled “Evaluation-in-production moves”| Move | What it gives you |
|---|---|
| Sample 1-5% of responses | A small, manageable stream of live examples to review |
| LLM-as-judge with rubric | Cheap automatic scoring per sample |
| Human review for low scores / negative feedback | Ground truth for the edge cases |
| Labels feed back into offline test set | Suite grows with the application |
| A/B test prompt/pipeline changes | Causal evidence change helped |
Prompt versioning (lesson 3 + lesson 7)
Section titled “Prompt versioning (lesson 3 + lesson 7)”prompt = constant or registry entry (versioned)prompt_version = field in every log linedeployment = staged (canary -> full), rollback readyCost + latency dashboards
Section titled “Cost + latency dashboards”| Metric | Breakdown |
|---|---|
| Per-request cost | by route, model, prompt version, user segment |
| Aggregate cost | daily / weekly / monthly with budget alerts |
| Latency p50 + p95 | regression alerts |
| Cost-per-resolution | when “per job” is the business metric |
Regression testing
Section titled “Regression testing”build the suite from production failures (start at 50, grow to 100s-1000s)score automatically where possible (regex, structured, LLM-as-judge)run before every prompt / model / pipeline changerefuse merges that regress without explicit overridenew production failure -> new test exampleMakes model upgrades safe.
Smallest practical first stack (days, not months)
Section titled “Smallest practical first stack (days, not months)”- Expand logging to 7-10 fields
- Version the prompt (move out of
app.pyto a constant or registry) - Build a 50-example regression suite from early production traffic
- Adopt an LLM-observability platform (LangSmith / Langfuse / Helicone / Arize Phoenix / WandB Prompts)
- Sample 1-5% for in-production evaluation
- Add budget + p95 latency regression alerts
- Run a first A/B test on the next change
Tools (pick one, picks differ by stack)
Section titled “Tools (pick one, picks differ by stack)”- LangSmith (LangChain ecosystem)
- Langfuse (open-source, self-hostable)
- Helicone (proxy-based; minimal-code)
- Arize Phoenix (open-source)
- Weights & Biases Prompts
- Gantry (FSDL’s instructor’s company; production-LLM-focused)
Plus platform observability (Datadog, Honeycomb, Grafana) for the broader application.
The tools matter less than the discipline.
What this lesson does NOT cover
Section titled “What this lesson does NOT cover”- Incident-disclosure policy
- Vendor-failure / SLA / liability questions
- Compliance frameworks (SOC 2, ISO, sector-specific regimes)
Real and important; require their own framing in their own forum with the right people. This lesson is the engineering discipline.
Words to use precisely
Section titled “Words to use precisely”- Observability: enough logs and traces to debug after the fact.
- Evaluation in production: scoring live responses (not just offline test sets).
- A/B testing: parallel deployment of two versions to compare causally.
- Regression suite: held-out examples rerun on every change.
- Cost-per-resolution: cost per business unit-of-work (vs per token).
Source
Section titled “Source”- Full Stack Deep Learning, LLM Bootcamp (Spring 2023): LLMOps.
fullstackdeeplearning.com/llm-bootcamp. Independent structural mirror in original prose; see references.