Skip to content

What multimodal AI actually is

This is lesson 1 of Track 24, the opener of Phase 1 (Orientation). By the end you will be able to look at any AI system and place it on the multimodal map: name the modalities it handles, decide whether it is genuinely multimodal or a multi-model pipeline, identify which of the three operating modes it occupies, and recognize whether it follows the encode-then-fuse path or the natively-multimodal one. That single map is what the rest of the track fills in lecture by lecture.

The track structurally mirrors the multimodal-AI threads across three editions of Stanford CS25 “Transformers United” (V4, V5, V6), curated from a multi-instructor guest-lecture series into ~10 Clawdemy lessons. Full attribution and the per-edition links are in this lesson’s references.

This is the orientation opener of a Stage D advanced standalone track. Lessons 2 through 9 each map to a specific CS25 guest lecture, walking the multimodal frontier in technical depth (large multimodal models, native multimodal architectures, multimodal reasoning, image and video generation, JEPA, world models, multimodal agents). Lesson 10 closes with a Clawdemy-authored synthesis of cross-cutting themes. This opener exists to set scope and vocabulary, so the technical lessons that follow have clean scaffolding to attach to.

No specific Clawdemy lesson prerequisite, but this is a Stage D advanced track and assumes prior comfort with transformer fundamentals: attention, tokenization, and the broad shape of how LLMs work. If you have not seen those before, Tracks 11 (Intro to Deep Learning), 13 (Build Neural Networks from Scratch), or 20 (AI Agents and Tool Use) are the natural lead-ins; any equivalent background is fine.

  • Define a modality and name the main ones
  • Distinguish multimodal systems from multi-model pipelines
  • Explain the fusion challenge and the two dominant strategies
  • Name the three operating modes and place real systems on the map
  • Avoid the common confusions (multimodal vs multi-task, vs multi-model, “the model sees”)
  • Read time: about 12 minutes
  • Practice time: about 15 minutes (a multimodal-or-multi-model identification, an operating-mode classification, and flashcards)
  • Difficulty: standard