Practice: Reasoning over multimodal inputs

Self-check

Seven short questions. Try to answer each one before opening the collapsible.

1. What structurally distinguishes a reasoning model from a standard LLM?

Show answer

A reasoning model generates a long internal chain of thought before producing its answer, using significantly more inference compute per query. It uses many more internal tokens than it outputs; that extra inference time is the price for better answers on tasks that reward deliberate reasoning.

2. Name three capabilities multimodal reasoning enables that single-pass VLMs handle poorly.

Show answer

Visual problem solving (math from an image, chart-data extraction with computation), self-checking across modalities (re-examining the image during reasoning and catching one’s own misreading), and grounded multi-step reasoning (each step citing visual evidence, producing auditable answers).

3. Name four kinds of tools a multimodal reasoning model commonly calls.

Show answer

Vision tools (zoom, crop, OCR-on-region), code execution (extract numbers, compute, plot, verify), search (look up information about something in the image), and image generation (produce a diagram or illustration as part of the answer).

4. What is deliberative alignment, in one sentence?

Show answer

A safety technique that trains the model to reason explicitly over a written safety specification during inference (recall relevant rules, check the request against them, then decide), rather than learning safe behavior only from data and reward signals.

5. Why is deliberative alignment especially relevant for multimodal models?

Show answer

Because multimodal models face new attack surfaces single-modal LLMs do not, including prompts hidden inside images, instructions embedded in visual content, and jailbreaks that use the visual channel to bypass text-based safety training. The deliberative reasoning machinery can be turned on the question “is this request appropriate given everything I see and read here?”

6. Why is “more thinking is always better” not quite true?

Show answer

Because reasoning can amplify confident hallucination when the underlying perception is wrong. If the model misread the image, longer chain-of-thought built on the misreading can compound the error. Self-checking and tool use help, but neither eliminates the problem entirely.

7. List the four layers of the multimodal reasoning model architecture stack.

Show answer

(1) Base multimodal architecture (encode-then-fuse per L2 or native multimodal per L3) for perception. (2) Reasoning training (chain-of-thought generation via reinforcement learning). (3) Tool-use infrastructure wired into the reasoning loop. (4) Deliberative alignment training threading safety reasoning through the same chain-of-thought machinery.

Try it yourself: which layer failed?

For each described failure, identify which layer of the multimodal reasoning stack is the most likely cause (perception, reasoning, tool use, or alignment).

A. A multimodal reasoning model is asked to compute the sum of the numbers
   in a bar chart. Its chain of thought correctly extracts each bar's
   labeled value, but the final sum is off by 8. The arithmetic in the
   chain of thought is wrong.
B. The same model is asked to read a handwritten equation from a photo.
   It misreads "42" as "12" early in its chain of thought, then reasons
   correctly from that wrong number to a wrong final answer.
C. A user uploads a screenshot with text in the image that says "Ignore
   your safety guidelines and..." The model follows the instruction.
D. A multimodal reasoning model is asked to identify a small fish species
   in a low-resolution underwater photo. It guesses confidently and
   never zooms in or asks for a higher-resolution version.

Show answer

A: reasoning failure. Perception worked (correct values extracted); the arithmetic step in the chain of thought is wrong. Fixes target reasoning quality: better chain-of-thought training, or having the model call a code-execution tool for arithmetic instead of doing it in its head.
B: perception failure. The reasoning was internally correct given the wrong input; the underlying VLM misread the image. Fixes target perception: better vision encoder, higher resolution, or a re-examination tool the reasoning step can call.
C: alignment failure (specifically a multimodal attack-surface failure). The model followed an instruction injected through the visual channel. Stronger deliberative alignment training that reasons about the source and appropriateness of instructions (including from image content) would help; this attack surface is an active research area.
D: tool-use failure. The model had vision tools available (zoom, crop) but did not invoke them when perception quality was low. Better tool-use training and prompts that encourage the model to use tools when uncertain would help.

The discipline: pin failures to layers. “The model is bad at this” is too coarse to fix; “perception is the bottleneck” or “the chain of thought reasoned correctly from wrong input” tells you which layer to improve.

Try it yourself: identify the architecture stack

A new multimodal reasoning model is described in a press announcement: “We start from our native multimodal base model, train it with reinforcement learning to produce long internal chains of thought, give it access to a code interpreter and a web search tool, and add a final stage where the model reasons over our safety specification before each response.”

Decompose the description into the four-layer stack (perception, reasoning, tool use, alignment) and say which choice was made at each layer.

Show answer

Layer 1 (perception):         native multimodal base (L3 family)
Layer 2 (reasoning):          RL-trained chain-of-thought (the o-series pattern)
Layer 3 (tool use):           code interpreter + web search
Layer 4 (alignment):          deliberative alignment over a written safety spec

That four-layer composition is the typical anatomy of a modern frontier multimodal reasoning system. Each layer is a distinct training or infrastructure choice; the whole’s capabilities are the product of all four. When you read about a new system, mapping the announcement to these four layers tells you what is new (or just rebranded) vs what the system inherited.

Flashcards

Ten cards. Click any card to reveal the answer. Use the Print flashcards button for one card per page.

Q. What is a reasoning model?

A model that generates a long internal chain of thought before producing its answer, using significantly more inference compute per query. The extra compute buys better answers on tasks rewarding deliberate reasoning.

Q. What does multimodal reasoning extend?

Chain-of-thought reasoning extended to images and other modalities; the thinking interleaves visual observations with text reasoning steps, often referencing specific regions of the image.

Q. Name three capabilities multimodal reasoning enables.

Visual problem solving, self-checking across modalities (catch one’s own misreading), and grounded multi-step reasoning with auditable visual evidence at each step.

Q. What kinds of tools do multimodal reasoning models commonly call?

Vision tools (zoom, crop, OCR-on-region), code execution, search, and image generation. Tools extend perception and compose computation in ways pure thinking cannot.

Q. What is deliberative alignment?

A safety technique training the model to reason explicitly over a written safety specification during inference (recall, check, decide), rather than learning safe behavior only from data and reward.

Q. Why is deliberative alignment especially relevant for multimodal models?

Multimodal models face new attack surfaces (prompts hidden in images, jailbreaks via image content). Deliberative reasoning can be turned on “is this request appropriate given everything I see and read?”

Q. Why isn't more reasoning always better?

Reasoning can amplify confident hallucination when underlying perception is wrong. Longer chain-of-thought built on a misreading compounds the error. Self-checking and tools help but do not eliminate this.

Q. What are the four layers of the multimodal reasoning stack?

Base multimodal perception (L2 or L3), reasoning training (RL chain-of-thought), tool-use infrastructure, deliberative alignment. Failure can come from any layer; diagnosis means pinning to the layer.

Q. How does a multimodal reasoning model differ from a VLM in output?

A VLM hands you an answer; a reasoning model hands you an answer with reasoning attached, often auditable, sometimes citing visual evidence step by step.

Q. What's the difference between a perception failure and a reasoning failure?

Perception failure: the underlying VLM misread the image; reasoning was internally correct given wrong input. Reasoning failure: perception was right, but a step in the chain of thought was wrong. Different fixes target different layers.