Skip to content

Lesson: What multimodal AI actually is

A moment ago you watched a short clip. You saw a face, you heard a voice, you read the caption underneath, and your brain did all of it as one thing. You did not switch between separate “perception modules”; the image, the sound, and the words landed as a single understanding. For most of artificial intelligence’s history, a model could only handle one of those at a time. A language model read text. A convolutional network looked at images. A speech model listened to audio. Each lived inside its own architecture, was trained on its own data, and represented its world in its own incompatible way.

Multimodal AI is the family of systems built to break that wall. This track is about how the wall broke, what now stands in its place, and where it is going. This first lesson sets the scope, defines the words, and lays out the operating modes the rest of the track will unpack lecture by lecture.

In AI, a modality is the form of information being processed: not the topic, the medium. Five matter most for our purposes:

  • Text: sequences of discrete tokens.
  • Images: grids of pixels, usually treated as a 2D field of patches.
  • Audio: continuous waveforms, often represented as spectrograms over time.
  • Video: sequences of images, optionally with audio, indexed by time.
  • Structured signals that round out the modern picture: tabular data, sensor readings, robotic state, biological measurements.

Historically each modality had its own model family. Text grew up with recurrent networks, then BERT and GPT. Images were the province of convolutional networks. Audio commonly used CTC-trained models. The boundaries were real because the data was so different: a sentence is 10 to 50 tokens; a single 224x224 image is more than 50,000 pixels; ten seconds of audio is hundreds of thousands of samples. The architectures and the math had to fit the shape of the input, which meant they could not be reused easily across modalities.

A multimodal AI system either takes inputs from more than one modality, produces outputs in more than one modality, or both. That sounds obvious until you ask what it means for the model to “understand” two modalities at once. A pipeline that runs a vision model and then pipes its caption into a language model is multi-model, not multimodal: the two models pass text strings between them and lose any structure the image had below the level of the caption.

Multimodal in the strong sense means the modalities meet inside the same model, sharing representations early enough to support joint reasoning. The image of a dog wagging its tail and the sentence “the dog is wagging its tail” land in compatible internal states, so the model can ground a phrase in the right region of an image, or answer questions whose answers depend on details that no caption would have preserved.

Getting modalities to share representations turns out to be the core technical challenge of the field, and it is what the rest of the track is about.

Modalities differ in dimensionality, statistics, and length. Putting them in the same representational space is not free. Two broad strategies have dominated, and the track follows both:

  • Encode separately, then fuse. Train (or borrow) a vision encoder that turns an image into a sequence of vectors. Train (or borrow) a language model that handles text as vectors. Connect the two with a small adapter or with cross-attention so the language model can attend to the image’s vectors. Most “vision-language models” built on top of an existing LLM use this pattern; the major closed systems you have probably interacted with, like GPT-4V, are widely understood to work this way. (Some frontier models, like Gemini, are instead natively multimodal, a distinction lesson 3 takes up.) Phase 2 of the track follows this path through CogVLM (lesson 2) and the rapidly-evolving alternatives that built on it.

  • Tokenize everything, train jointly. Turn images, audio, video into discrete tokens (via patch embeddings, learned codecs, or similar tricks), then feed all of those tokens into a single transformer that was trained from the beginning on the mixed stream. This is the natively multimodal direction, and it is where the frontier is moving. Lesson 3 unpacks what “native” buys over the bolt-on approach.

The difference between the two is not just engineering. A natively multimodal model can, in principle, reason across modalities with the same kind of unified representation a single-modal transformer has for text. A bolt-on multimodal model can answer image questions but often handles cross-modal reasoning a layer at a time, with the seams showing.

It is useful to distinguish three patterns of multimodal model up front, because the rest of the track touches all three:

  • Multimodal input, single-modal output. Image plus text in, text out. This is the most common pattern in product today: ask a model to describe what is in a screenshot, transcribe a slide, or reason about a chart. Lessons 2, 3, and 4 are squarely in this regime.
  • Single-modal input, multimodal output. Text in, image or video out. This is the generative direction: Stable Diffusion, Sora, modern image and video generation systems. Lessons 5 and 6 cover the transformer architectures that power this.
  • Multimodal input and multimodal output. Any modality in, any modality out. The natively multimodal frontier. Lessons 3, 7, 8, and 10 all touch this direction in different ways.

Most systems are not yet fully multimodal in both directions; the path of the field is from the first pattern (vision plus language in, text out) toward the third (genuinely multimodal in both directions), with the second pattern (generative) developing in parallel.

Why fusion matters more than parallel models

Section titled “Why fusion matters more than parallel models”

A useful contrast keeps the central point sharp. Suppose you build a system by calling an image-captioning API, then feeding its caption into an LLM. That pipeline is two models in a row; the LLM only ever sees the caption, not the image. If the caption misses something (the dog is brown, not black; the cup is half-empty), the LLM has no recourse. Worse, fine-grained referring questions (“which side is the handle on?”) often cannot be answered from a caption at all.

A multimodal model, with the image’s representation living inside it alongside the text’s, can attend to the relevant region of the image directly. That is the qualitative gap: shared representation enables grounded reasoning that no pipeline of single-modal models can reach. It is the reason the field bothered to merge the modalities in the first place.

Almost every frontier system shipping in 2026 is multimodal. When you ask Claude or GPT to analyze a screenshot and write code that recreates it, when you upload a photo to identify a plant, when you describe a scene and a model generates an image of it, you are using the technology this track describes. The shift from “AI handles text well” to “AI can see, hear, and write across modalities” is the central capability change of the past three years. Understanding how that shift happened, and what its limits still are, is the literacy this track is built to give you.

  • Multimodal versus multi-task. A multi-task model does several text tasks (translation, classification, summarization). A multimodal model handles several modalities. They are different axes; a model can be either, neither, or both.
  • Multimodal versus multi-model. A pipeline that runs vision and language models in sequence is multi-model, not multimodal. Multimodal means joint internal representations, not a chain of calls.
  • The model does not “see.” It processes image tokens (small vector representations of image patches) that, through training, have been aligned with text tokens enough to support joint reasoning. The visual experience is yours, the representational alignment is the model’s.
  • “Vision-language model,” “large multimodal model,” “native multimodal model” mostly refer to overlapping things from different generations. The track will be precise where the distinctions matter (especially in lessons 2 and 3).
  • A modality is the form of information (text, image, audio, video, structured signals), not its topic.
  • Multimodal AI fuses modalities inside one model, sharing internal representations early enough to support joint reasoning. A pipeline of single-modal models is multi-model, not multimodal.
  • Fusion is the central technical challenge. Two strategies dominate: encode-then-fuse (build on an existing LLM with a vision encoder) and tokenize-everything (native multimodal from the start).
  • Three operating modes: multimodal-in / single-out (vision-language models), single-in / multimodal-out (generative), and multimodal-both (the frontier).

The rest of the track unpacks these ideas in depth. The next lesson walks the first and most common path: taking an existing large language model and extending it to “see” by attaching a vision encoder. That is how the first wave of large multimodal models, including the family the consumer products you have used are descended from, came to be.