Summary: Training your own LLM

The deep dive on the fine-tune point of lesson 8’s build-vs-buy spectrum. Most teams should not train their own model. Stay on hosted; fine-tune only when three things are true simultaneously: prompting consistently fails on a specific recurring task at scale; retrieval and tool use do not fix it; the failing task has high enough volume that inference savings justify upfront training cost. Train-from-scratch is almost never right for an application team (Track 15 territory). When you do fine-tune, follow the staged pipeline: start from a strong open checkpoint (Llama/Mistral/Phi); curate a small high-quality SFT dataset (often LLM-generated then filtered); run LoRA training via TRL or Axolotl on one GPU; optionally DPO for ranked-output quality; evaluate against a held-out set, then A/B test in production per lesson 7. The economics rule: per-task hosted cost × expected lifetime volume vs (fine-tune cost + serving cost), fine-tune if meaningfully cheaper and quality is at least equal. Fine-tuning fits the mix architecture from lesson 8: small specialized models for high-volume inner sub-tasks (router, classifier, extractor), frontier hosted for the user-facing outer synthesis. Taught technical-primer throughout: mechanical “when and how,” with broader debates explicitly out of scope.

Core ideas

Most teams should not train their own model. The three-things-true-at-once test: prompting fails consistently + retrieval/tools don’t fix it + volume justifies upfront cost.
Train-from-scratch is almost never right for an application team (Track 15 territory).
Staged pipeline: open checkpoint → curated small SFT data (often synthetic, filtered) → LoRA training (TRL/Axolotl, one GPU, hours-to-a-day) → optionally DPO → held-out eval + A/B test in production.
Data is the leverage. Quality > quantity; format-match to the task; synthetic-then-filtered is the cheapest practical source (with Track 15 lesson 12’s teacher-blind-spots caveat); hold out a real eval set.
Economics rule. Per-task hosted cost × expected lifetime volume vs (fine-tune cost + serving cost over same period). Fine-tune if the fine-tuned-and-served path is meaningfully cheaper AND quality is at least equal.
Mix architecture fit. Fine-tune the high-volume narrow inner sub-tasks; keep the user-facing outer synthesis on frontier hosted. Most production fine-tunes are one or two specific inner sub-tasks; the rest stays hosted.
Tooling. TRL (HF) provides SFTTrainer and DPOTrainer; Axolotl is a config-driven wrapper; LoRA/PEFT is the standard parameter-efficient approach; managed compute providers (Together, Modal, Lambda Labs, Anyscale) for the actual training.
Out of scope. Training-data policy, alignment debates, contested safety claims. Same discipline as Track 14 lesson 10 and Track 15 lesson 13.

What changes for you

Knowing when not to fine-tune is at least as valuable as knowing how. Most production applications stay on hosted models forever and that is correct; the cost of training too early, or training when the volume doesn’t justify it, is real budget and real engineering time spent on the wrong thing. For the specific teams where the three-things-true-at-once test is met, this lesson is a well-trodden pipeline: pick an open base, curate data, LoRA, optionally DPO, evaluate honestly, A/B test, watch the cost/latency move. The decision becomes a build-economics calculation, not a research project. The next lesson takes the agents direction from lesson 8 deeper; the capstone closes Phase 3 with the industry-perspective synthesis.

Training your own model is a specific tool in the build-vs-buy spectrum, not a default. Most teams never need to use it; teams that should have a specific high-volume sub-task whose failure prompting cannot fix, and they follow a well-trodden pipeline. Know the criteria; reach for it deliberately.