Multimodal agents in production
What you’ll learn
Section titled “What you’ll learn”This is lesson 9 of Track 24, in Phase 4 (Advanced multimodal directions). By the end you will be able to name the practical constraints multimodal models face inside shipping products and apply the engineering-informs-vs-settles distinction to separate engineering territory from product, business, and policy territory. The one capability to walk away with: given a production-engineering question, apply the operational test (what instruments would settle it?) and route engineering-settleable questions to engineering work while routing product/business/policy questions to the right different conversations.
The lesson maps to Karina Nguyen’s CS25 V5 guest lecture (April 8, 2025); full attribution is in this lesson’s references.
Where this fits
Section titled “Where this fits”This lesson stays in production-multimodal territory but returns to consumer-product land from the science-application detour of lesson 8. Where lesson 4 stacked perception + reasoning + tools + alignment as the architecture of a multimodal reasoning system, this lesson asks what changes when that stack lives inside a shipping product. Lesson 10 closes the track by synthesizing cross-cutting threads and naming the frontiers we did not cover.
Before you start
Section titled “Before you start”Prerequisite: Lesson 4, Reasoning over multimodal inputs. You need the four-layer stack established there (perception + reasoning + tool use + alignment), because this lesson asks what changes when that stack has to ship. Familiarity with general RL concepts (reward models, fine-tuning) helps but is not strictly required; the lesson presents RLHF and RLAIF at intuition level.
By the end, you’ll be able to
Section titled “By the end, you’ll be able to”- Distinguish production from research constraints
- Explain the benchmark-vs-usability gap and the co-design loop
- Describe RL co-design (RLHF, RLAIF) and asymmetric verification
- Identify multimodal-specific production challenges and standard responses
- Apply the engineering-informs-vs-settles distinction
Time and difficulty
Section titled “Time and difficulty”- Read time: about 13 minutes
- Practice time: about 15 minutes (a settles-vs-informs classification on 6 production decisions, a constraint-to-response matching exercise on 4 multimodal-specific challenges, and flashcards)
- Difficulty: standard