Skip to content

Summary: Reasoning over multimodal inputs

Modern multimodal reasoning models stack four layers: a multimodal perception base (L2 or L3), reasoning training that generates long chains of thought, tool-use infrastructure (vision tools, code, search, image generation), and deliberative alignment that reasons over a safety specification. The whole’s capability is the product of all four; failures pin to layers, and diagnosing which one is the practical skill. This summary is the scan version of the full lesson, which closes Phase 2.

  • Reasoning models generate a long internal chain of thought before answering, using many more tokens to think than to output. Inference-time compute pays for better answers on tasks rewarding deliberate reasoning.
  • Multimodal reasoning extends this to images and other modalities. The chain of thought interleaves visual observations with text reasoning steps, references specific image regions, catches its own misreading via self-checking, and produces grounded multi-step answers.
  • Tool use makes reasoning much more powerful. Vision tools (zoom, crop, OCR-on-region), code execution, search, and image generation can be called during the chain of thought, extending perception and composing computation.
  • Deliberative alignment trains the model to reason explicitly over a written safety specification before acting (recall relevant rules, check the request, decide). For multimodal models it brings the same chain-of-thought machinery to bear on new attack surfaces images introduce (prompts hidden in images, jailbreaks via visual content).
  • The architecture stack: perception (L2 or L3) + reasoning training (RL chain-of-thought) + tool-use infrastructure + deliberative alignment. Each layer is a distinct training or infrastructure choice.
  • Failure diagnosis pins to layers. Perception failure (misread the image), reasoning failure (wrong arithmetic from correct values), tool-use failure (didn’t call zoom when needed), alignment failure (followed an instruction injected via the image). Different fixes for different layers.

When you upload a math photo and the model takes 30 seconds to think, when you give it a chart and it extracts numbers and computes the implication, when you show it broken code in a screenshot and it reasons about the bug, you are using a multimodal reasoning model. The latency you feel is buying capability. The four-layer stack is the lens worth carrying when you read about new systems: which layer is genuinely new, which is inherited from prior work, and what kind of failures the system is most likely to make. The lesson also closes Phase 2 on building large multimodal models, with both the perceptual side (L2 encode-then-fuse, L3 native) and the reasoning side (this lesson) covered. The next phase turns to generation: how transformer-based architectures produce images and video as output.