Skip to content

Cheatsheet: Multimodal agents in production

Research vs production (the constraint shift)

Section titled “Research vs production (the constraint shift)”
AspectResearchProduction
Judged onbenchmark performancelatency + cost + reliability + safety + deployment evaluation
Inputsclean test setmessy real users, ambiguous goals
Ground truthknownabsent at inference time
Variance tolerancehighlow (consistent behavior expected)
Evaluationheld-out test setA/B testing on real users
GapClosure
Benchmarks ≠ real-world usefulnesstight CO-DESIGN LOOP: scientists prototype, users probe, signals shape next iteration
Outcomeevaluation metrics that measure real-world usability, not what traditional benchmarks happen to record
ItemDetail
Co-design framingoptimization target shaped by product context; reward from real product feedback
RLHFreward model from human preferences; policy fine-tuned against it
RLAIFreward from AI-generated feedback; faster to iterate, own tradeoffs
Asymmetric verificationchecking < generating in difficulty; use smaller verifier as reward signal
ChallengeStandard response
Variable input sizes (text / image / PDF / video)chunking + retrieval; size-aware routing
Output streaming quirks (image meaningless until rendered)latent diffusion + progressive rendering; partial-preview
Tool-use latency budgets (each call adds wait)bounded budgets, short-circuit, parallel dispatch
Cross-modal quality variancerouting to stronger model on uncertainty; user warnings
Engineering SETTLESEngineering INFORMS (but does not settle)
Latency budgetsProduct strategy (what to ship)
Cost per queryMarket positioning (who for)
A/B-test signalsDeployment policy (what to allow in edge cases)
Benchmark performanceOrganizational priorities (engagement vs satisfaction tradeoffs)
Evaluation-harness designVendor-comparative ranking
If the question is settled by…It is…
Latency profiling, A/B tests, cost measurement, benchmarks, evaluation harnessIN SCOPE (engineering)
Product strategy, market analysis, deployment-policy positions, organizational priorities, vendor-comparative rankingOUT OF SCOPE (different conversations)
PitfallReality
”Production = research, but bigger”structurally different constraint sets
”RL is the answer”RL has structural risks (engagement vs satisfaction); needs co-design loop to stay honest
”Multimodal = text + bigger”variable inputs / streaming / tool-latency / quality calibration are real differences
”Engineering settles the product question”engineering INFORMS; product/business/policy still required