Skip to content

Summary: Native multimodal intelligence

Native multimodal trains one transformer on a mixed stream of text, image, audio, and video tokens from the very first training step, in contrast to encode-then-fuse’s separate-then-bridge architecture. The payoff is joint co-evolution across modalities, deeper grounding at every layer, first-class generation of any modality, and the low-latency cross-modal interaction that distinguishes systems like GPT-4o from earlier pipelines. This summary is the scan version of the full lesson.

  • The architectural shift. Encode-then-fuse: pretrained vision encoder + bridge + pretrained LLM, three trainings stitched. Native: one transformer, one training run, mixed-modality tokens from step 1.
  • Every modality becomes tokens. Text via BPE; images via a learned VQ-VAE-style image tokenizer (an image becomes a sequence of “visual words”); audio via neural codecs; video as frame tokens plus temporal positioning.
  • One training objective. Next-token prediction over the interleaved stream. The model has no inherent modality boundaries; alignment emerges from data, layer by layer.
  • What native buys: generation of any modality is first-class (same machinery as text), low-latency cross-modal interaction (no speech-to-text intermediary), and fine-grained joint grounding the bridge in encode-then-fuse cannot match.
  • The tokenizer is the floor and ceiling. A poor image tokenizer caps visual quality before the transformer attends to anything; tokenizer research is itself a major axis of native-multimodal work.
  • Named examples. Chameleon (Meta), GPT-4o (OpenAI), Gemini (Google) are the canonical production-scale natively-multimodal systems.
  • The costs. Tokenizer design, large multimodal data requirements, joint-training compute, and slow non-text output. These are why encode-then-fuse remains the practical choice for many systems.

When you talk to GPT-4o by voice and the model responds with sub-second conversational latency in the same model that handles your text and images, you are using a natively-multimodal system. The smoothness you feel is the architecture, not just clever engineering. As natively-multimodal training scales, more capabilities cross this gap: first-class image and video generation, fine-grained spatial reasoning, and any-to-any low-latency interaction become the default for the most demanding products. The pattern to carry: when you read about a new multimodal model, ask whether it bolts modalities together or trains them jointly. That single question often predicts what it will and will not do well. The next lesson goes deeper on one specific capability built on these architectures: reasoning over multimodal inputs (images and diagrams in chain-of-thought, tool use, deliberative alignment).