Cheatsheet: What multimodal AI actually is
Modalities
Section titled “Modalities”| Modality | Shape | Historical model family |
|---|---|---|
| Text | sequence of tokens | RNNs, BERT, GPT |
| Image | grid of pixels / patches | CNNs, ResNets |
| Audio | waveform / spectrogram | CTC models, speech transformers |
| Video | sequence of images, often + audio | I3D, video transformers |
| Structured | tabular, sensor, robotic state | gradient-boosted trees, MLPs |
Multimodal vs multi-model
Section titled “Multimodal vs multi-model”| Aspect | Multimodal | Multi-model |
|---|---|---|
| How modalities meet | shared internal representations in one model | pipeline; only text (or other intermediate) crosses |
| Detail preserved | yes, below caption level | no, lost at every stage boundary |
| Cross-modal reasoning | grounded, joint | only as good as the intermediate text |
The fusion challenge
Section titled “The fusion challenge”| Issue | Why it matters |
|---|---|
| Different dimensionalities | a sentence ~30 tokens vs an image ~50,000 pixels |
| Different statistics | discrete tokens vs continuous pixels / waveforms |
| Different lengths | a 10s audio clip is ~hundreds of thousands of samples |
Two strategies
Section titled “Two strategies”| Strategy | How | Examples |
|---|---|---|
| Encode-then-fuse | vision encoder + language model + adapter/cross-attention | GPT-4V, CogVLM (lesson 2) |
| Tokenize-everything (native) | discrete tokens per modality, single transformer trained jointly | Gemini, Chameleon; natively multimodal frontier (lesson 3) |
Three operating modes
Section titled “Three operating modes”| Mode | Input | Output | Examples |
|---|---|---|---|
| MM-in / single-out | image + text | text | ”describe this screenshot”; lessons 2-4 |
| Single-in / MM-out | text | image / video | Stable Diffusion, Sora; lessons 5-6 |
| MM-in / MM-out | any | any | natively multimodal frontier; lessons 3, 7, 10 |
Common confusions
Section titled “Common confusions”| Often confused | Reality |
|---|---|
| Multimodal vs multi-task | different axes: modalities vs text-task variety |
| Multimodal vs multi-model | multimodal = one model; multi-model = pipeline |
| ”The model sees” | it processes aligned image tokens; no visual experience |
| ”VLM” vs “LMM” vs “native MM” | overlapping generations; precise distinctions in lessons 2-3 |