Skip to content

Reasoning over multimodal inputs

This is lesson 4 of Track 24, the close of Phase 2 (Building large multimodal models). By the end you will be able to explain how a reasoning model uses images and diagrams within chain-of-thought, what tool use adds, and what deliberative alignment brings to the safety side. The one capability to walk away with: given a multimodal reasoning system, decompose it into its four-layer architecture stack (perception, reasoning, tools, alignment) and diagnose failures by pinning them to the responsible layer.

The lesson maps to Hongyu Ren’s CS25 V5 guest lecture. At drafting time the recording was not publicly available; the lesson covers the three topics the lecture title names, grounded on the publicly-available OpenAI deliberative-alignment paper and the documented o-series literature.

Phase 2 has covered the two perceptual architectures (L2 encode-then-fuse, L3 native multimodal) and now the reasoning capability built on top of either. The four-layer stack this lesson assembles (perception + reasoning + tool use + alignment) is the anatomy of every modern frontier multimodal system; reading announcements through this lens is the literacy you carry forward. Phase 3 opens with the generative direction: how transformer-based architectures produce images and video as output.

Prerequisite: Lesson 3, Native multimodal intelligence (which itself prerequires L2). You need both perceptual architectures in hand so the “reasoning is added on top of perception” framing lands. Familiarity with reasoning models more generally (the o-series and successors) from product use or other tracks is helpful but not strictly required; the lesson establishes the distinction.

  • Explain what structurally distinguishes a reasoning model
  • Describe how multimodal reasoning extends chain-of-thought to images
  • Name the common categories of tools and what each extends
  • Explain deliberative alignment and the multimodal attack surfaces it addresses
  • Decompose a multimodal reasoning system into the four-layer stack and diagnose failures by layer
  • Read time: about 13 minutes
  • Practice time: about 15 minutes (a which-layer-failed diagnostic exercise, an architecture-stack identification, and flashcards)
  • Difficulty: standard