Practice: Multimodal agents in production

Self-check

Seven short questions. Try to answer each one before opening the collapsible.

1. Name three constraints production faces that research benchmarks do not measure.

Show answer

Any three of: latency (how long the user waits), cost per query, reliability (consistent behavior across sessions), safety in the wild (long-tail attacks and edge cases), and evaluation in deployment (A/B signals from real users with messy inputs and ambiguous goals).

2. What is the “benchmark-vs-real-usability gap” and what closes it?

Show answer

Traditional benchmarks measure performance on a clean test set with known ground truth; real-world usability measures whether the model helps actual users with messy goals and no ground truth. The two are correlated but not identical. The tight co-design loop (scientists prototype, users probe, signals from real usage shape the next iteration) is what closes the gap by making the metric reflect what the product actually needs to do.

3. What does “RL as co-design” mean in production?

Show answer

The optimization targets are shaped by the product context, not just by abstract benchmark scores. The reward signals come from real product feedback; the iteration loop ties research to product needs directly. RL is no longer a separate stage applied to a base model but a co-design between research and product.

4. What are RLHF and RLAIF, and how do they differ?

Show answer

RLHF: Reinforcement Learning from Human Feedback. Train a reward model on human preferences over outputs, fine-tune the policy against that reward. RLAIF: Reinforcement Learning from AI Feedback. Use AI-generated feedback instead of (or alongside) human feedback. RLAIF is faster to iterate on and more scalable; it has its own tradeoffs the co-design loop has to manage.

5. What is asymmetric verification, and why is it useful?

Show answer

Checking whether an answer is good is often easier than producing a good answer from scratch. That asymmetry is exploitable: a verifier model only needs to be good at the easier task, and it can serve as the reward signal training a stronger generator. Common pattern in modern post-training (RLAIF judges, tool-output checks, generate-then-verify loops).

6. Name two production challenges specific to multimodal vs text-only systems.

Show answer

Any two of: variable input sizes (text dozens of tokens, image hundreds-to-thousands, PDF/video much larger), output-streaming quirks (image streaming is meaningless until rendered), tool-use latency budgets (each tool call adds wait), and cross-modal quality calibration (multimodal variance across input types may require routing).

7. State the engineering-informs-vs-settles distinction.

Show answer

Engineering settles the engineering-instrumented questions (latency, cost per query, A/B-test signals, benchmark performance, evaluation-harness design). Engineering informs but does not settle product strategy, deployment policy, organizational priorities, and questions of what the product should do in genuinely ambiguous cases. The operational test (what instruments settle the question?) separates the two.

Try it yourself: settles or informs?

For each production decision, label whether engineering settles it or engineering only informs it (and identify, for the latter, which non-engineering instrument is needed).

A. Whether a new multimodal feature can run within the 2-second
   latency budget the product team has committed to.
B. Whether the company should ship a new image-generation product
   for use in journalism.
C. Whether implementation A or implementation B has lower
   per-query cost.
D. What the model should do when it detects a prompt asking it to
   produce content the company's policy explicitly forbids.
E. Whether the new model's A/B-test performance exceeds the
   incumbent's on the team's defined success metric.
F. Whether to prioritize image generation, video generation, or
   audio generation in the next quarter's roadmap.

Show answer

A: engineering settles. Latency profiling on production-like hardware gives a clear yes/no.
B: engineering INFORMS only. The shipping decision requires product strategy, market analysis, sector-policy review (journalism standards), and organizational deliberation. Engineering can characterize feasibility; it cannot decide whether to ship.
C: engineering settles. Cost-per-query measurement is a defined engineering instrument.
D: engineering INFORMS only. The policy itself is a values and policy decision; engineering can implement detection and enforcement once the policy exists, but the policy itself lives elsewhere (deliberative alignment work, organizational policy, sometimes legal review).
E: engineering settles. A/B-test outcomes are exactly what the engineering instrument measures.
F: engineering INFORMS only. Roadmap prioritization is product strategy informed by engineering feasibility but settled by business judgment, market positioning, and organizational priorities.

The procedural pattern: questions answered by engineering instruments (latency, cost, A/B tests, benchmarks, harness design) are settled engineering work. Questions requiring product, business, policy, or organizational instruments are informed by engineering data but not settled by it. Conflating the two overstates what an engineering team can decide on its own.

Try it yourself: match the production constraint to the engineering response

Match each multimodal production constraint (left) to the standard engineering response (right).

Constraints:                                  Responses:
A. Long PDF input (50,000 tokens worth         1. Routing: send uncertain queries
   of patch and text tokens) within a             to a stronger model; warn users
   3-second latency budget                        on weakly-supported combinations
B. Image generation output that needs to       2. Chunking + retrieval; process
   feel responsive to the user                    relevant sections in parallel
C. Multimodal model whose quality varies       3. Latent diffusion + progressive
   substantially across modality combinations     rendering; partial-image preview
                                                  during generation
D. Reasoning-plus-tool-use loop where each     4. Bounded tool-use budgets;
   tool call adds noticeable wait                 short-circuit conditions;
                                                  parallel tool dispatch

Show answer

A -> 2: long input -> chunking + retrieval; process relevant sections in parallel
B -> 3: image streaming -> latent diffusion + progressive rendering; partial preview
C -> 1: cross-modal quality variance -> routing to stronger models on uncertainty;
        user warnings on weak combinations
D -> 4: tool-use latency -> bounded budgets, short-circuit, parallel tool dispatch

The pattern: each multimodal production challenge has a standard engineering response that addresses it. The lesson’s four challenges all have corresponding patterns the production-engineering literature has converged on, even if no specific vendor’s product implements them identically.

Flashcards

Ten cards. Click any card to reveal the answer. Use the Print flashcards button for one card per page.

Q. Name three constraints production faces beyond benchmark performance.

Any three of: latency, cost per query, reliability across sessions, safety in the wild (long-tail attacks), evaluation in deployment (A/B signals from real users).

Q. What is the benchmark-vs-real-usability gap?

Benchmarks measure on clean test sets with known ground truth; real-world usability measures whether the model helps actual users with messy goals. Correlated but not identical; the gap is where products succeed or fail.

Q. What is RL as co-design?

RL is co-designed with the product itself: optimization targets shaped by product context, reward signals from real product feedback, iteration loop tying research to product needs. Not a separate post-training stage.

Q. What is RLHF?

Reinforcement Learning from Human Feedback. Train a reward model on human preferences, fine-tune the policy against that reward. The well-known modern post-training pattern.

Q. What is RLAIF?

Reinforcement Learning from AI Feedback. Use AI-generated feedback instead of (or alongside) human feedback. Faster to iterate, more scalable, with its own tradeoffs.

Q. What is asymmetric verification?

Checking whether an answer is good is often easier than producing a good answer from scratch. That asymmetry enables training-loop designs where a smaller verifier serves as the reward signal for a stronger generator.

Q. Name two multimodal-specific production challenges.

Any two of: variable input sizes (text vs image vs video), output-streaming quirks (image is meaningless until rendered), tool-use latency budgets, cross-modal quality calibration (variance across input types may require routing).

Q. What does engineering settle?

The engineering-instrumented questions: latency budgets, cost per query, A/B-test signals between implementations, benchmark performance, evaluation-harness design. Anything decidable by measurement on defined instruments.

Q. What does engineering inform but NOT settle?

Product strategy (what to ship), market positioning (who for), deployment policy (what to allow in edge cases), organizational priorities (how to weigh engagement vs satisfaction). Engineering provides the data; product/business/policy judgment provides the answer.

Q. What is the operational test for this watch zone?

What instruments would you use to settle the question? If A/B testing, latency analysis, or engineering-harness measurement settles it, technique territory. If business judgment, market analysis, deployment-policy positions, or vendor-comparative ranking is required, it lives in a different conversation.