Skip to content

Cheatsheet: Training your own LLM

Fine-tune ONLY when ALL THREE are true:

1. Prompting consistently fails on a specific recurring task at scale
(the "where prompts run out" line from L3)
2. Retrieval / tools (L4) do not fix it
(failure is "wrong behavior," not "missing knowledge")
3. Volume is high enough that inference savings justify upfront cost
(L8 economics, L2 productive limits applied at lifetime scale)

Most production apps never need to fine-tune. Train-from-scratch is almost never right for an app team (Track 15 territory).

StageWhat you do
1. Pick an open checkpointLlama / Mistral / Phi family; size class fits the task
2. Curate SFT dataSmall + high-quality; format-matched (chat template); synthetic-then-filtered with held-out eval
3. LoRA trainingTRL or Axolotl; single capable GPU; hours to a day; managed provider (Together/Modal/Lambda/Anyscale)
4. Optional: DPOPreference tuning when you need ranked-output quality
5. Evaluate + A/B testPass the held-out set; A/B test on real traffic; verify cost/latency moved (lessons 2 + 7)
ToolWhat it is
TRLHugging Face; SFTTrainer, DPOTrainer
AxolotlConfig-driven wrapper over TRL
PEFT / LoRAParameter-efficient fine-tuning; standard choice
Managed computeTogether, Modal, Lambda Labs, Anyscale
hosted_lifetime = hosted_cost_per_call × calls_per_period × periods
fine-tune_lifetime = train_cost + (serving_cost_per_call × calls × periods)
Fine-tune if: fine-tune_lifetime << hosted_lifetime AND quality ≥ hosted
(on held-out eval)

Worked example. 200K calls/month, 24 months, hosted $0.005/call, fine-tune $2,500, serving $0.0005/call:

  • Hosted: 200K × 24 × 0.005 = $24,000
  • Fine-tuned: $2,500 + (200K × 24 × 0.0005) = $4,900
  • Saving: ~$19,000 (~80% reduction); crossover at ~3 months.
Inner sub-tasks (high-volume, narrow):
router, classifier, extractor, retriever-rewriter, eval-as-judge
-> FINE-TUNED specialized models (smaller, cheaper to serve)
Outer user-facing synthesis (the main response):
-> FRONTIER HOSTED model

Most production fine-tunes = one or two specific inner sub-tasks; the rest stays hosted.

SymptomReach for
Vague answers, format missesPrompt engineering (L3)
Missing knowledge / current data neededRetrieval / tools (L4)
Wrong behavior on a specific task at scaleFine-tune (this lesson)
Need a foundation model nobody else hasTrain from scratch (Track 15; almost never the right answer for app teams)
  • Training-data policy
  • Alignment debates / contested safety claims
  • Sector-specific compliance for trained models

Real and important; require their own framing in their own forum with the right stakeholders. This lesson is the engineering (when / how / what it costs / what you get) discipline.

  • Fine-tuning: continued training from an existing checkpoint on a small task-specific dataset.
  • SFT (supervised fine-tuning): training on (instruction, response) pairs with the model’s chat template.
  • DPO (Direct Preference Optimization): simpler successor to RLHF; trains on (prompt, preferred, dispreferred) pairs directly.
  • LoRA: parameter-efficient fine-tuning by training small added matrices instead of updating all weights.
  • Crossover: the volume × lifetime point where fine-tuned-and-served cost equals hosted cost.
  • Full Stack Deep Learning, LLM Bootcamp (Spring 2023): How to train your own LLM (guest: Reza Shabani, Replit). fullstackdeeplearning.com/llm-bootcamp. Independent structural mirror in original prose; see references.