Cheatsheet: Multimodal agents in production
Research vs production (the constraint shift)
Section titled “Research vs production (the constraint shift)”| Aspect | Research | Production |
|---|---|---|
| Judged on | benchmark performance | latency + cost + reliability + safety + deployment evaluation |
| Inputs | clean test set | messy real users, ambiguous goals |
| Ground truth | known | absent at inference time |
| Variance tolerance | high | low (consistent behavior expected) |
| Evaluation | held-out test set | A/B testing on real users |
The benchmark-vs-usability gap
Section titled “The benchmark-vs-usability gap”| Gap | Closure |
|---|---|
| Benchmarks ≠ real-world usefulness | tight CO-DESIGN LOOP: scientists prototype, users probe, signals shape next iteration |
| Outcome | evaluation metrics that measure real-world usability, not what traditional benchmarks happen to record |
RL as co-design
Section titled “RL as co-design”| Item | Detail |
|---|---|
| Co-design framing | optimization target shaped by product context; reward from real product feedback |
| RLHF | reward model from human preferences; policy fine-tuned against it |
| RLAIF | reward from AI-generated feedback; faster to iterate, own tradeoffs |
| Asymmetric verification | checking < generating in difficulty; use smaller verifier as reward signal |
Multimodal-specific production challenges
Section titled “Multimodal-specific production challenges”| Challenge | Standard response |
|---|---|
| Variable input sizes (text / image / PDF / video) | chunking + retrieval; size-aware routing |
| Output streaming quirks (image meaningless until rendered) | latent diffusion + progressive rendering; partial-preview |
| Tool-use latency budgets (each call adds wait) | bounded budgets, short-circuit, parallel dispatch |
| Cross-modal quality variance | routing to stronger model on uncertainty; user warnings |
Engineering: settles vs informs
Section titled “Engineering: settles vs informs”| Engineering SETTLES | Engineering INFORMS (but does not settle) |
|---|---|
| Latency budgets | Product strategy (what to ship) |
| Cost per query | Market positioning (who for) |
| A/B-test signals | Deployment policy (what to allow in edge cases) |
| Benchmark performance | Organizational priorities (engagement vs satisfaction tradeoffs) |
| Evaluation-harness design | Vendor-comparative ranking |
The operational test (this watch zone)
Section titled “The operational test (this watch zone)”| If the question is settled by… | It is… |
|---|---|
| Latency profiling, A/B tests, cost measurement, benchmarks, evaluation harness | IN SCOPE (engineering) |
| Product strategy, market analysis, deployment-policy positions, organizational priorities, vendor-comparative ranking | OUT OF SCOPE (different conversations) |
Common pitfalls
Section titled “Common pitfalls”| Pitfall | Reality |
|---|---|
| ”Production = research, but bigger” | structurally different constraint sets |
| ”RL is the answer” | RL has structural risks (engagement vs satisfaction); needs co-design loop to stay honest |
| ”Multimodal = text + bigger” | variable inputs / streaming / tool-latency / quality calibration are real differences |
| ”Engineering settles the product question” | engineering INFORMS; product/business/policy still required |