Skip to content

References: Native multimodal intelligence

Source material:
• Stanford CS25 V6 (May 21, 2026):
"From Language Models to Native Multimodal Intelligence"
Speaker: Victoria X. Lin (Thinking Machines Lab)
Course page: https://web.stanford.edu/class/cs25/
YouTube: recording pending publication
License (when posted): as published on Stanford's public CS25 YouTube channel
(link-out only)
PENDING-RECORDING NOTE (transparent attribution): at the time of this lesson's
drafting (2026-05-25), Victoria Lin's V6 L8 lecture had been delivered only
four days earlier and the recording was not yet posted on the Stanford CS25
YouTube channel. Stanford publishes recordings approximately two weeks after
each talk. This lesson is structured around the topic of native multimodal
intelligence (the subject of Lin's lecture per its published title and abstract)
and the publicly-documented native-multimodal systems (Chameleon, GPT-4o,
Gemini); specific claims attributed to Lin's lecture are deliberately avoided
until the recording is publicly available. The Lead wires the verified YouTube
URL into source_material.primary_url at promotion when the recording posts.
Clawdemy provides original notes, summaries, and quizzes derived from publicly
available material for educational purposes. All rights to Lin's lecture remain
with Stanford and the speaker.

This lesson is the structural-mirror counterpart to Lin’s V6 L8 lecture on the topic of native multimodal intelligence. The technical content (unified token stream, per-modality tokenizers, joint co-evolution, the encode-then-fuse vs native contrast, named systems like Chameleon / GPT-4o / Gemini, and the cost analysis) draws on the publicly-published native-multimodal literature rather than on Lin’s specific lecture, since the recording was pending at draft time. When the recording is published, the Lead will reconcile any direct claims that should be attributed specifically to Lin’s framing.

The contrast architecture (L2 encode-then-fuse vs L3 native) and the “what limits visual quality” tokenizer-as-ceiling framing in practice are Clawdemy’s own connective tissue.

  • Reasoning over multimodal inputs (the next lesson). Built on top of either encode-then-fuse or native architectures, modern reasoning models use images and diagrams within chain-of-thought, deliberative alignment, and tool use.
  • Image and audio tokenizers. A research area in its own right (VQ-VAE, RQ-VAE, Encodec, and successors); the tokenizer’s reconstruction quality is the floor and ceiling of any native multimodal model’s per-modality quality.
  • Mixture-of-experts in multimodal models. A scaling pattern several native multimodal systems have adopted; orthogonal to the encode-then-fuse vs native distinction but often combined with it.

None selected for this lesson at draft time. The Distill / arxiv / official-announcement set above is the strongest public reading; if Lin’s recording or a canonical thread surfaces post-publication, it will be added at the next review.