Skip to content

Cheatsheet: Native multimodal intelligence

Encode-then-fuse vs native (the core contrast)

Section titled “Encode-then-fuse vs native (the core contrast)”
AspectEncode-then-fuse (L2)Native multimodal (L3)
Architectureseparate encoder + bridge + LLMone transformer over all modalities
Training3+ runs stitchedone joint training run
Cross-modal interactionbolted on at bridgeevery layer, every block
Reuses pretrained LLMyesno
Generation across modalitiesmostly text-only outputfirst-class any modality
ModalityTokenization
TextBPE / SentencePiece (standard)
Imagelearned image tokenizer (VQ-VAE, modern variants); image -> sequence of discrete codes
Audioneural audio codec (Encodec-style); audio -> token stream at some sample rate
Videoframe tokens + temporal positioning; sometimes motion-aware compression
A training example:
[ <img_tokens>, "Describe what's happening here:", <text_tokens>,
<img_tokens>, "Now the next frame:", <img_tokens>, ... ]
One transformer. One objective: predict the next token.
No "image side" or "text side" anywhere in the model.
CapabilityWhy native enables it
Any-modality generationoutput is next-token prediction; one machinery for all
Low-latency voiceno speech-to-text intermediary; audio tokens in, audio tokens out
Deep cross-modal groundingalignment at every layer, not just at a bridge
Any-to-any input/outputunified token stream supports any combination
CostWhy it hurts
Tokenizer designtokenizer reconstruction is a hard ceiling on quality
Data scalecannot reuse pretrained LLMs; learn everything from scratch
Computejoint training is expensive from step 1
Slow non-text outputimage generation can require thousands of token predictions
SystemOrgModalities
ChameleonMetatext + image (academic reference design)
GPT-4oOpenAItext + image + audio (the “omni” model)
GeminiGoogletext + image + audio + video (multimodal from start)