Reasoning over multimodal inputs: brief

What you’ll learn

This is lesson 4 of Track 24, the close of Phase 2 (Building large multimodal models). By the end you will be able to explain how a reasoning model uses images and diagrams within chain-of-thought, what tool use adds, and what deliberative alignment brings to the safety side. The one capability to walk away with: given a multimodal reasoning system, decompose it into its four-layer architecture stack (perception, reasoning, tools, alignment) and diagnose failures by pinning them to the responsible layer.

The lesson maps to Hongyu Ren’s CS25 V5 guest lecture. At drafting time the recording was not publicly available; the lesson covers the three topics the lecture title names, grounded on the publicly-available OpenAI deliberative-alignment paper and the documented o-series literature.

Where this fits

Phase 2 has covered the two perceptual architectures (L2 encode-then-fuse, L3 native multimodal) and now the reasoning capability built on top of either. The four-layer stack this lesson assembles (perception + reasoning + tool use + alignment) is the anatomy of every modern frontier multimodal system; reading announcements through this lens is the literacy you carry forward. Phase 3 opens with the generative direction: how transformer-based architectures produce images and video as output.

Before you start

Prerequisite: Lesson 3, Native multimodal intelligence (which itself prerequires L2). You need both perceptual architectures in hand so the “reasoning is added on top of perception” framing lands. Familiarity with reasoning models more generally (the o-series and successors) from product use or other tracks is helpful but not strictly required; the lesson establishes the distinction.

By the end, you’ll be able to

Explain what structurally distinguishes a reasoning model
Describe how multimodal reasoning extends chain-of-thought to images
Name the common categories of tools and what each extends
Explain deliberative alignment and the multimodal attack surfaces it addresses
Decompose a multimodal reasoning system into the four-layer stack and diagnose failures by layer

Time and difficulty

Read time: about 13 minutes
Practice time: about 15 minutes (a which-layer-failed diagnostic exercise, an architecture-stack identification, and flashcards)
Difficulty: standard