Summary: Reasoning over multimodal inputs

Modern multimodal reasoning models stack four layers: a multimodal perception base (L2 or L3), reasoning training that generates long chains of thought, tool-use infrastructure (vision tools, code, search, image generation), and deliberative alignment that reasons over a safety specification. The whole’s capability is the product of all four; failures pin to layers, and diagnosing which one is the practical skill. This summary is the scan version of the full lesson, which closes Phase 2.

Core ideas

Reasoning models generate a long internal chain of thought before answering, using many more tokens to think than to output. Inference-time compute pays for better answers on tasks rewarding deliberate reasoning.
Multimodal reasoning extends this to images and other modalities. The chain of thought interleaves visual observations with text reasoning steps, references specific image regions, catches its own misreading via self-checking, and produces grounded multi-step answers.
Tool use makes reasoning much more powerful. Vision tools (zoom, crop, OCR-on-region), code execution, search, and image generation can be called during the chain of thought, extending perception and composing computation.
Deliberative alignment trains the model to reason explicitly over a written safety specification before acting (recall relevant rules, check the request, decide). For multimodal models it brings the same chain-of-thought machinery to bear on new attack surfaces images introduce (prompts hidden in images, jailbreaks via visual content).
The architecture stack: perception (L2 or L3) + reasoning training (RL chain-of-thought) + tool-use infrastructure + deliberative alignment. Each layer is a distinct training or infrastructure choice.
Failure diagnosis pins to layers. Perception failure (misread the image), reasoning failure (wrong arithmetic from correct values), tool-use failure (didn’t call zoom when needed), alignment failure (followed an instruction injected via the image). Different fixes for different layers.

What changes for you

When you upload a math photo and the model takes 30 seconds to think, when you give it a chart and it extracts numbers and computes the implication, when you show it broken code in a screenshot and it reasons about the bug, you are using a multimodal reasoning model. The latency you feel is buying capability. The four-layer stack is the lens worth carrying when you read about new systems: which layer is genuinely new, which is inherited from prior work, and what kind of failures the system is most likely to make. The lesson also closes Phase 2 on building large multimodal models, with both the perceptual side (L2 encode-then-fuse, L3 native) and the reasoning side (this lesson) covered. The next phase turns to generation: how transformer-based architectures produce images and video as output.