Native multimodal intelligence, in brief

What you’ll learn

This is lesson 3 of Track 24, the second lesson of Phase 2 (Building large multimodal models). By the end you will be able to contrast bolt-on multimodal extensions (the encode-then-fuse family from L2) with natively-multimodal architectures, and explain what joint co-evolution across modalities actually buys. The one capability to walk away with: given any multimodal system, decide whether it is encode-then-fuse or native, and predict the consequences for cross-modal grounding, generation, and latency.

The lesson maps to Victoria Lin’s CS25 V6 guest lecture (May 21, 2026). At drafting time the recording was pending publication; the lesson covers the topic of native multimodal intelligence using the publicly-documented native-multimodal literature (Chameleon, GPT-4o, Gemini) until Lin’s recording is posted and can be cited specifically.

Where this fits

This lesson is the natural counterpart to L2’s encode-then-fuse walkthrough. Together they cover the two dominant ways to build a multimodal model: separate-then-bridge (L2, the practical default) and joint-from-scratch (L3, the frontier direction). Phase 2 closes with lesson 4 on reasoning over multimodal inputs, which builds on top of whichever architectural family the underlying model belongs to. Phase 3 then turns to the generative side (image and video generation with transformers).

Before you start

Prerequisite: Lesson 2, From language models to large multimodal models. You need the encode-then-fuse family in hand so the contrast with native multimodal lands. Familiarity with transformer fundamentals (attention, token streams) from prior tracks (T11, T13, T20) helps for the tokenizer discussion, but no mathematics beyond the conceptual level is required.

By the end, you’ll be able to

State the architectural difference between native and encode-then-fuse in one sentence
Explain how text, image, audio, and video become tokens for a unified transformer
Identify the three capabilities native multimodal enables
Name the four main costs and why encode-then-fuse still wins for many systems
Recognize Chameleon, GPT-4o, and Gemini as native multimodal examples

Time and difficulty

Read time: about 13 minutes
Practice time: about 15 minutes (a native-vs-encode-then-fuse classification, a tokenizer-bottleneck judgment, and flashcards)
Difficulty: standard