Skip to content

Multimodal agents in production

This is lesson 9 of Track 24, in Phase 4 (Advanced multimodal directions). By the end you will be able to name the practical constraints multimodal models face inside shipping products and apply the engineering-informs-vs-settles distinction to separate engineering territory from product, business, and policy territory. The one capability to walk away with: given a production-engineering question, apply the operational test (what instruments would settle it?) and route engineering-settleable questions to engineering work while routing product/business/policy questions to the right different conversations.

The lesson maps to Karina Nguyen’s CS25 V5 guest lecture (April 8, 2025); full attribution is in this lesson’s references.

This lesson stays in production-multimodal territory but returns to consumer-product land from the science-application detour of lesson 8. Where lesson 4 stacked perception + reasoning + tools + alignment as the architecture of a multimodal reasoning system, this lesson asks what changes when that stack lives inside a shipping product. Lesson 10 closes the track by synthesizing cross-cutting threads and naming the frontiers we did not cover.

Prerequisite: Lesson 4, Reasoning over multimodal inputs. You need the four-layer stack established there (perception + reasoning + tool use + alignment), because this lesson asks what changes when that stack has to ship. Familiarity with general RL concepts (reward models, fine-tuning) helps but is not strictly required; the lesson presents RLHF and RLAIF at intuition level.

  • Distinguish production from research constraints
  • Explain the benchmark-vs-usability gap and the co-design loop
  • Describe RL co-design (RLHF, RLAIF) and asymmetric verification
  • Identify multimodal-specific production challenges and standard responses
  • Apply the engineering-informs-vs-settles distinction
  • Read time: about 13 minutes
  • Practice time: about 15 minutes (a settles-vs-informs classification on 6 production decisions, a constraint-to-response matching exercise on 4 multimodal-specific challenges, and flashcards)
  • Difficulty: standard