Native multimodal intelligence
What you’ll learn
Section titled “What you’ll learn”This is lesson 3 of Track 24, the second lesson of Phase 2 (Building large multimodal models). By the end you will be able to contrast bolt-on multimodal extensions (the encode-then-fuse family from L2) with natively-multimodal architectures, and explain what joint co-evolution across modalities actually buys. The one capability to walk away with: given any multimodal system, decide whether it is encode-then-fuse or native, and predict the consequences for cross-modal grounding, generation, and latency.
The lesson maps to Victoria Lin’s CS25 V6 guest lecture (May 21, 2026). At drafting time the recording was pending publication; the lesson covers the topic of native multimodal intelligence using the publicly-documented native-multimodal literature (Chameleon, GPT-4o, Gemini) until Lin’s recording is posted and can be cited specifically.
Where this fits
Section titled “Where this fits”This lesson is the natural counterpart to L2’s encode-then-fuse walkthrough. Together they cover the two dominant ways to build a multimodal model: separate-then-bridge (L2, the practical default) and joint-from-scratch (L3, the frontier direction). Phase 2 closes with lesson 4 on reasoning over multimodal inputs, which builds on top of whichever architectural family the underlying model belongs to. Phase 3 then turns to the generative side (image and video generation with transformers).
Before you start
Section titled “Before you start”Prerequisite: Lesson 2, From language models to large multimodal models. You need the encode-then-fuse family in hand so the contrast with native multimodal lands. Familiarity with transformer fundamentals (attention, token streams) from prior tracks (T11, T13, T20) helps for the tokenizer discussion, but no mathematics beyond the conceptual level is required.
By the end, you’ll be able to
Section titled “By the end, you’ll be able to”- State the architectural difference between native and encode-then-fuse in one sentence
- Explain how text, image, audio, and video become tokens for a unified transformer
- Identify the three capabilities native multimodal enables
- Name the four main costs and why encode-then-fuse still wins for many systems
- Recognize Chameleon, GPT-4o, and Gemini as native multimodal examples
Time and difficulty
Section titled “Time and difficulty”- Read time: about 13 minutes
- Practice time: about 15 minutes (a native-vs-encode-then-fuse classification, a tokenizer-bottleneck judgment, and flashcards)
- Difficulty: standard