References: What multimodal AI actually is
Source material
Section titled “Source material”Source material (the track's primary curriculum):• Stanford CS25: Transformers United V4 (Spring 2024): https://web.stanford.edu/class/cs25/past/cs25-v4/ V5 (Spring 2025): https://web.stanford.edu/class/cs25/past/cs25-v5/ V6 (Spring 2026): https://web.stanford.edu/class/cs25/ Instructors: Steven Feng, Karan Singh, Michael C. Frank, Christopher Manning, with rotating guest speakers each edition License: as published on Stanford's public CS25 YouTube channel (link-out only)
This lesson is the Clawdemy-authored orientation opener of T24. It draws on theCS25 series as a whole rather than mirroring a single lecture; each subsequentlesson in the track maps to a specific CS25 guest lecture and cites it as thatlesson's primary source.
Clawdemy provides original notes, summaries, and quizzes derived from thismaterial for educational purposes. All rights to the original lectures remainwith Stanford and the speakers.What this lesson draws from
Section titled “What this lesson draws from”This opener is a synthesis of the multimodal-AI threads that run across the three CS25 editions the track covers (V4, V5, V6). No single CS25 lecture corresponds to it; the framing (modalities, fusion challenge, encode-then-fuse vs tokenize-everything, the three operating modes) is Clawdemy’s own, set up so the technical lessons that follow have clean scaffolding to attach to.
Going deeper
Section titled “Going deeper”- Stanford CS25: Transformers United. The current edition and links to past editions. Each subsequent lesson in this track cites the specific CS25 lecture it draws from; the series as a whole is the natural home for readers who want the full guest-lecture context.
- Stanford CS25 YouTube playlist. The official Stanford Online playlist of CS25 lectures. The track’s references each cite individual videos as further study.
Adjacent topics
Section titled “Adjacent topics”- From language models to large multimodal models (the next lesson). The encode-then-fuse path in depth, walking through how CogVLM extended an existing LLM to handle images.
- Vision Transformers (ViT) and CLIP. The foundational architectures that made the encode-then-fuse pattern work. Not covered by CS25 V4-V6 directly (they predate the editions in scope), but worth knowing as the technical ancestors of the systems lesson 2 covers. Outside this track’s structural-mirror scope.
- Diffusion models. The generative half of modern multimodal AI. This track covers them through their transformer integration (lesson 5); deeper diffusion-specific tracks live elsewhere in Clawdemy.
Community discussion
Section titled “Community discussion”None selected for this opener. The CS25 series itself is the canonical public discussion of these topics; later lessons may add specific external resources if they materially deepen a particular technical point.