References: What multimodal AI actually is

Source material

Source material (the track's primary curriculum):
• Stanford CS25: Transformers United
  V4 (Spring 2024): https://web.stanford.edu/class/cs25/past/cs25-v4/
  V5 (Spring 2025): https://web.stanford.edu/class/cs25/past/cs25-v5/
  V6 (Spring 2026): https://web.stanford.edu/class/cs25/
  Instructors: Steven Feng, Karan Singh, Michael C. Frank, Christopher Manning,
               with rotating guest speakers each edition
  License: as published on Stanford's public CS25 YouTube channel (link-out only)

This lesson is the Clawdemy-authored orientation opener of T24. It draws on the
CS25 series as a whole rather than mirroring a single lecture; each subsequent
lesson in the track maps to a specific CS25 guest lecture and cites it as that
lesson's primary source.

Clawdemy provides original notes, summaries, and quizzes derived from this
material for educational purposes. All rights to the original lectures remain
with Stanford and the speakers.

What this lesson draws from

This opener is a synthesis of the multimodal-AI threads that run across the three CS25 editions the track covers (V4, V5, V6). No single CS25 lecture corresponds to it; the framing (modalities, fusion challenge, encode-then-fuse vs tokenize-everything, the three operating modes) is Clawdemy’s own, set up so the technical lessons that follow have clean scaffolding to attach to.

Going deeper

Stanford CS25: Transformers United. The current edition and links to past editions. Each subsequent lesson in this track cites the specific CS25 lecture it draws from; the series as a whole is the natural home for readers who want the full guest-lecture context.
Stanford CS25 YouTube playlist. The official Stanford Online playlist of CS25 lectures. The track’s references each cite individual videos as further study.

Adjacent topics

From language models to large multimodal models (the next lesson). The encode-then-fuse path in depth, walking through how CogVLM extended an existing LLM to handle images.
Vision Transformers (ViT) and CLIP. The foundational architectures that made the encode-then-fuse pattern work. Not covered by CS25 V4-V6 directly (they predate the editions in scope), but worth knowing as the technical ancestors of the systems lesson 2 covers. Outside this track’s structural-mirror scope.
Diffusion models. The generative half of modern multimodal AI. This track covers them through their transformer integration (lesson 5); deeper diffusion-specific tracks live elsewhere in Clawdemy.

Community discussion

None selected for this opener. The CS25 series itself is the canonical public discussion of these topics; later lessons may add specific external resources if they materially deepen a particular technical point.