Practice: What multimodal AI actually is

Self-check

Seven short questions. Try to answer each one before opening the collapsible.

1. In one sentence, what is a “modality” in AI?

Show answer

The form of information being processed (text, image, audio, video, structured signals), not its topic. Each modality has historically had its own model family because the data shape and statistics are so different.

2. What is the difference between a multimodal model and a multi-model pipeline?

Show answer

A multimodal model fuses modalities inside one model, sharing internal representations. A multi-model pipeline runs separate single-modal models in sequence, passing text or other intermediate outputs between them. Only the multimodal model has joint representations available for grounded cross-modal reasoning.

3. What is the central technical challenge of multimodal AI?

Show answer

Fusion: putting modalities with different dimensionalities, statistics, and lengths into a shared representational space so the model can reason across them jointly. A sentence is 10 to 50 tokens; an image is tens of thousands of pixels; the model has to land both in compatible internal states.

4. Name the two dominant strategies for building multimodal models.

Show answer

Encode-then-fuse: train (or borrow) a vision encoder and a language model, then connect them with an adapter or cross-attention. Tokenize-everything: turn images, audio, video into discrete tokens and feed all of them into a single transformer trained on the mixed stream from the start (natively multimodal).

5. Name the three operating modes of multimodal models.

Show answer

Multimodal input / single-modal output (most common today: GPT-4V style; image plus text in, text out). Single-modal input / multimodal output (generative: text in, image or video out). Multimodal input and output (the frontier: any modality in, any out).

6. Why does a multi-model pipeline lose information compared to a true multimodal model?

Show answer

Because it passes only the intermediate output (usually a caption) between stages, throwing away everything the earlier model saw but did not write down. A caption preserves what it described; questions about details below the level of the caption (which side the handle is on, fine-grained colors, spatial relationships) have no way to be answered from text alone.

7. Does a model “see” an image, in the human sense?

Show answer

No. It processes image tokens (small vector representations of image patches) that, through training, have been aligned with text tokens enough to support joint reasoning. The visual experience is yours; the representational alignment is the model’s. Mistaking one for the other inflates expectations.

Try it yourself: multimodal or multi-model?

For each system, decide whether it is best described as multimodal (modalities fused inside one model) or multi-model (separate single-modal models in a pipeline). State the reason in one line.

A. A vision API returns a caption for an image; a separate LLM is then
   asked to summarize the caption.
B. A single transformer takes in both an image (as patch tokens) and a
   question (as text tokens) and produces a text answer.
C. A speech-to-text model transcribes audio, then a text-only LLM answers
   the user's question, then a text-to-speech model speaks the reply.
D. A model is trained from scratch on interleaved image, audio, and text
   tokens, and at inference accepts any combination of them.

Show answer

A: multi-model. The vision and language stages are separate models; only the caption text crosses the boundary. Detail below the caption is lost.
B: multimodal. The two modalities share one transformer, so the model can attend to image patches when answering the text question. Joint representation.
C: multi-model. Three single-modal models chained by text. The conversation is felt as multimodal by the user, but the system is a pipeline.
D: multimodal (natively multimodal). A single model with joint representations across all three modalities from the start.

The tell: ask where the modalities meet. If they meet only as text (or some other intermediate that flattens them), it is a pipeline. If they meet as shared internal representations inside one model, it is multimodal.

Try it yourself: name the operating mode

For each scenario, name the operating mode: multimodal-in / single-out, single-in / multimodal-out, or multimodal-in / multimodal-out.

1. You upload a screenshot and ask the model to explain what each button does.
2. You type a sentence and the model produces a 10-second video.
3. You give a model a podcast clip and ask it to summarize the speakers'
   positions in writing.
4. You hand a model an image and ask it to generate both a written caption
   and a soundtrack that fits the scene.

Show answer

Multimodal-in / single-out. Image plus text in, text out. The most common pattern.
Single-in / multimodal-out. Text in, video (image plus audio) out. The generative direction.
Multimodal-in / single-out. Audio plus text in, text out.
Multimodal-in / multimodal-out. Image in, text and audio out. The frontier direction; very few production systems do this today.

The track unpacks each pattern in depth. Phase 2 covers the first; Phase 3 covers the second; Phases 3 and 4 push toward the third.

Flashcards

Ten cards. Click any card to reveal the answer. Use the Print flashcards button for one card per page.

Q. What is a modality in AI?

The form of information processed (text, image, audio, video, structured signals), not its topic. Each modality historically had its own model family.

Q. What is multimodal AI?

Systems that fuse multiple modalities inside one model, sharing internal representations early enough to support joint cross-modal reasoning.

Q. Multimodal vs multi-model: what is the difference?

Multimodal fuses modalities inside one model. Multi-model is a pipeline of separate single-modal models passing intermediate outputs (usually text) between them, losing detail at every boundary.

Q. What is the central technical challenge of multimodal AI?

Fusion: putting modalities with very different dimensionalities, statistics, and lengths into a shared representational space so the model can reason across them jointly.

Q. What is the encode-then-fuse strategy?

Use (or train) a vision encoder and a language model separately, then connect them with an adapter or cross-attention. Most “vision-language models” follow this pattern.

Q. What is the tokenize-everything strategy?

Turn images, audio, video into discrete tokens and feed all of them into a single transformer trained from the start on the mixed stream. The natively multimodal direction.

Q. What are the three operating modes?

Multimodal-in / single-out (vision-language models, most common). Single-in / multimodal-out (generative: text -> image/video). Multimodal-in / multimodal-out (the frontier: any modality both directions).

Q. Why does a multi-model pipeline lose information?

It passes only intermediate outputs (usually captions) between stages, throwing away everything below the level of those outputs. Fine-grained referring questions cannot be answered.

Q. Does a model 'see' an image?

No. It processes image tokens that have been aligned with text tokens during training. The visual experience is yours; the representational alignment is the model’s.

Q. What is the difference between multimodal and multi-task?

Multi-task: one model doing several text tasks (translation, classification). Multimodal: one model handling several modalities. Different axes; a model can be either, neither, or both.