Cheatsheet: Training your own LLM
Decision: should you fine-tune?
Section titled “Decision: should you fine-tune?”Fine-tune ONLY when ALL THREE are true:
1. Prompting consistently fails on a specific recurring task at scale (the "where prompts run out" line from L3)2. Retrieval / tools (L4) do not fix it (failure is "wrong behavior," not "missing knowledge")3. Volume is high enough that inference savings justify upfront cost (L8 economics, L2 productive limits applied at lifetime scale)Most production apps never need to fine-tune. Train-from-scratch is almost never right for an app team (Track 15 territory).
The staged pipeline
Section titled “The staged pipeline”| Stage | What you do |
|---|---|
| 1. Pick an open checkpoint | Llama / Mistral / Phi family; size class fits the task |
| 2. Curate SFT data | Small + high-quality; format-matched (chat template); synthetic-then-filtered with held-out eval |
| 3. LoRA training | TRL or Axolotl; single capable GPU; hours to a day; managed provider (Together/Modal/Lambda/Anyscale) |
| 4. Optional: DPO | Preference tuning when you need ranked-output quality |
| 5. Evaluate + A/B test | Pass the held-out set; A/B test on real traffic; verify cost/latency moved (lessons 2 + 7) |
Tooling
Section titled “Tooling”| Tool | What it is |
|---|---|
| TRL | Hugging Face; SFTTrainer, DPOTrainer |
| Axolotl | Config-driven wrapper over TRL |
| PEFT / LoRA | Parameter-efficient fine-tuning; standard choice |
| Managed compute | Together, Modal, Lambda Labs, Anyscale |
Economics rule
Section titled “Economics rule”hosted_lifetime = hosted_cost_per_call × calls_per_period × periodsfine-tune_lifetime = train_cost + (serving_cost_per_call × calls × periods)
Fine-tune if: fine-tune_lifetime << hosted_lifetime AND quality ≥ hosted (on held-out eval)Worked example. 200K calls/month, 24 months, hosted $0.005/call, fine-tune $2,500, serving $0.0005/call:
- Hosted: 200K × 24 × 0.005 = $24,000
- Fine-tuned: $2,500 + (200K × 24 × 0.0005) = $4,900
- Saving: ~$19,000 (~80% reduction); crossover at ~3 months.
The mix architecture fit
Section titled “The mix architecture fit”Inner sub-tasks (high-volume, narrow): router, classifier, extractor, retriever-rewriter, eval-as-judge -> FINE-TUNED specialized models (smaller, cheaper to serve)
Outer user-facing synthesis (the main response): -> FRONTIER HOSTED modelMost production fine-tunes = one or two specific inner sub-tasks; the rest stays hosted.
When to consider what
Section titled “When to consider what”| Symptom | Reach for |
|---|---|
| Vague answers, format misses | Prompt engineering (L3) |
| Missing knowledge / current data needed | Retrieval / tools (L4) |
| Wrong behavior on a specific task at scale | Fine-tune (this lesson) |
| Need a foundation model nobody else has | Train from scratch (Track 15; almost never the right answer for app teams) |
What this lesson does NOT cover
Section titled “What this lesson does NOT cover”- Training-data policy
- Alignment debates / contested safety claims
- Sector-specific compliance for trained models
Real and important; require their own framing in their own forum with the right stakeholders. This lesson is the engineering (when / how / what it costs / what you get) discipline.
Words to use precisely
Section titled “Words to use precisely”- Fine-tuning: continued training from an existing checkpoint on a small task-specific dataset.
- SFT (supervised fine-tuning): training on
(instruction, response)pairs with the model’s chat template. - DPO (Direct Preference Optimization): simpler successor to RLHF; trains on
(prompt, preferred, dispreferred)pairs directly. - LoRA: parameter-efficient fine-tuning by training small added matrices instead of updating all weights.
- Crossover: the volume × lifetime point where fine-tuned-and-served cost equals hosted cost.
Source
Section titled “Source”- Full Stack Deep Learning, LLM Bootcamp (Spring 2023): How to train your own LLM (guest: Reza Shabani, Replit).
fullstackdeeplearning.com/llm-bootcamp. Independent structural mirror in original prose; see references.