Cheatsheet: Native multimodal intelligence
Encode-then-fuse vs native (the core contrast)
Section titled “Encode-then-fuse vs native (the core contrast)”| Aspect | Encode-then-fuse (L2) | Native multimodal (L3) |
|---|---|---|
| Architecture | separate encoder + bridge + LLM | one transformer over all modalities |
| Training | 3+ runs stitched | one joint training run |
| Cross-modal interaction | bolted on at bridge | every layer, every block |
| Reuses pretrained LLM | yes | no |
| Generation across modalities | mostly text-only output | first-class any modality |
Tokens per modality
Section titled “Tokens per modality”| Modality | Tokenization |
|---|---|
| Text | BPE / SentencePiece (standard) |
| Image | learned image tokenizer (VQ-VAE, modern variants); image -> sequence of discrete codes |
| Audio | neural audio codec (Encodec-style); audio -> token stream at some sample rate |
| Video | frame tokens + temporal positioning; sometimes motion-aware compression |
The training picture
Section titled “The training picture”A training example: [ <img_tokens>, "Describe what's happening here:", <text_tokens>, <img_tokens>, "Now the next frame:", <img_tokens>, ... ]
One transformer. One objective: predict the next token.No "image side" or "text side" anywhere in the model.What native buys
Section titled “What native buys”| Capability | Why native enables it |
|---|---|
| Any-modality generation | output is next-token prediction; one machinery for all |
| Low-latency voice | no speech-to-text intermediary; audio tokens in, audio tokens out |
| Deep cross-modal grounding | alignment at every layer, not just at a bridge |
| Any-to-any input/output | unified token stream supports any combination |
The costs
Section titled “The costs”| Cost | Why it hurts |
|---|---|
| Tokenizer design | tokenizer reconstruction is a hard ceiling on quality |
| Data scale | cannot reuse pretrained LLMs; learn everything from scratch |
| Compute | joint training is expensive from step 1 |
| Slow non-text output | image generation can require thousands of token predictions |
Named production examples
Section titled “Named production examples”| System | Org | Modalities |
|---|---|---|
| Chameleon | Meta | text + image (academic reference design) |
| GPT-4o | OpenAI | text + image + audio (the “omni” model) |
| Gemini | text + image + audio + video (multimodal from start) |