Skip to content

Cheatsheet: What multimodal AI actually is

ModalityShapeHistorical model family
Textsequence of tokensRNNs, BERT, GPT
Imagegrid of pixels / patchesCNNs, ResNets
Audiowaveform / spectrogramCTC models, speech transformers
Videosequence of images, often + audioI3D, video transformers
Structuredtabular, sensor, robotic stategradient-boosted trees, MLPs
AspectMultimodalMulti-model
How modalities meetshared internal representations in one modelpipeline; only text (or other intermediate) crosses
Detail preservedyes, below caption levelno, lost at every stage boundary
Cross-modal reasoninggrounded, jointonly as good as the intermediate text
IssueWhy it matters
Different dimensionalitiesa sentence ~30 tokens vs an image ~50,000 pixels
Different statisticsdiscrete tokens vs continuous pixels / waveforms
Different lengthsa 10s audio clip is ~hundreds of thousands of samples
StrategyHowExamples
Encode-then-fusevision encoder + language model + adapter/cross-attentionGPT-4V, CogVLM (lesson 2)
Tokenize-everything (native)discrete tokens per modality, single transformer trained jointlyGemini, Chameleon; natively multimodal frontier (lesson 3)
ModeInputOutputExamples
MM-in / single-outimage + texttext”describe this screenshot”; lessons 2-4
Single-in / MM-outtextimage / videoStable Diffusion, Sora; lessons 5-6
MM-in / MM-outanyanynatively multimodal frontier; lessons 3, 7, 10
Often confusedReality
Multimodal vs multi-taskdifferent axes: modalities vs text-task variety
Multimodal vs multi-modelmultimodal = one model; multi-model = pipeline
”The model sees”it processes aligned image tokens; no visual experience
”VLM” vs “LMM” vs “native MM”overlapping generations; precise distinctions in lessons 2-3