Skip to content

Summary: What multimodal AI actually is

A modality is the form of information (text, image, audio, video); multimodal AI is the family of systems that fuse multiple modalities inside one model, sharing internal representations so the model can reason across them jointly. Fusion is the central technical challenge, and the rest of the track walks through how the field has solved it. This summary is the scan version of the opener of T24.

  • Modality = form of information. Text, image, audio, video, structured signals. Historically each had its own model family, because the data shape and statistics were so different.
  • Multimodal AI fuses modalities inside one model. A pipeline of separate single-modal models passing text between them is multi-model, not multimodal; it loses everything below the level of the intermediate captions.
  • Fusion is the core technical challenge. Modalities differ in dimensionality, statistics, and length; getting them into a shared representational space is the work.
  • Two dominant strategies. Encode-then-fuse: vision encoder + language model + adapter or cross-attention (most vision-language models). Tokenize-everything: discrete tokens for every modality, single transformer trained on the mixed stream from the start (natively multimodal, the frontier).
  • Three operating modes. Multimodal input / single output (the common pattern: image plus text in, text out). Single input / multimodal output (generative: text in, image or video out). Multimodal in and out (the frontier).
  • A model does not “see.” It processes image tokens aligned with text tokens by training; the visual experience is yours, the representational alignment is the model’s.

You can now read multimodal product announcements and research papers with the right structural questions in hand: is this a real multimodal model or a multi-model pipeline? Does it fuse via encode-then-fuse or natively? Which operating mode is it (input-only, output-only, both)? Those three questions cover most of what differs between systems the press lumps together under “multimodal AI.” The rest of the track unpacks the answers lecture by lecture: how LLMs got eyes via the encode-then-fuse path (Phase 2), how generative multimodal output works (Phase 3), and where the frontier is heading toward natively multimodal everything (Phase 4). When you are done you will be able to look at a system and place it on this map without needing the marketing copy.