Practice: Native multimodal intelligence

Self-check

Seven short questions. Try to answer each one before opening the collapsible.

1. State the one-sentence difference between native multimodal and encode-then-fuse.

Show answer

Encode-then-fuse trains a vision encoder and an LLM separately and bridges them afterward. Native multimodal trains one transformer on mixed-modality tokens from the very first training step, so the modalities co-evolve.

2. How does a native multimodal model handle an image?

Show answer

A learned image tokenizer (usually a VQ-VAE or similar) breaks the image into patches and assigns each patch a discrete code from a fixed vocabulary. The image becomes a sequence of “visual words” that the transformer attends to like any other tokens. There is no separate vision encoder.

3. What is the training objective for a native multimodal model?

Show answer

Next-token prediction across an interleaved stream of all modalities. The model has no inherent notion of which token is “an image” or “text”; it predicts the next token in a long mixed sequence, and the alignment emerges from training data.

4. Name three capabilities native multimodal makes easier than encode-then-fuse can.

Show answer

Any three of: first-class generation of any modality (image, audio, text from the same machinery), low-latency cross-modal interaction (e.g., direct audio-to-audio with no speech-to-text intermediary), fine-grained joint reasoning grounded at every layer, and any-to-any modality input/output.

5. Why is tokenizer design especially important for native multimodal?

Show answer

Because the image tokenizer caps how much visual information can flow through the model. A poor tokenizer (one that compresses badly or loses detail) limits the model’s visual quality before the transformer even attends to the tokens. Encode-then-fuse models avoid this by reusing off-the-shelf vision encoders; native multimodal cannot.

6. Name three costs of the native multimodal approach.

Show answer

Any three of: harder tokenizer design (cannot borrow off-the-shelf encoders), more multimodal training data needed (cannot lean on pretrained text-only models), much more compute (joint training from scratch), and slow non-text output (image generation can require thousands of token predictions in sequence).

7. Why do GPT-4o voice conversations feel low-latency in a way pipeline systems cannot?

Show answer

Because the audio path does not pass through a speech-to-text intermediate. The model attends directly to audio tokens and produces audio tokens. That structural fact (native architecture, not pipeline) is what makes the conversational latency feel like a conversation rather than a chain of calls.

Try it yourself: encode-then-fuse or native?

For each described system, label it as encode-then-fuse or native multimodal, and give one defining feature you used to decide.

A. A pretrained ViT produces patch embeddings; a small MLP projects them
   into a pretrained Llama's embedding space; the system is fine-tuned on
   image-text dialogue.
B. A single transformer is pretrained from scratch on interleaved text
   tokens and discrete image tokens produced by a VQ-VAE, with one
   next-token-prediction objective throughout.
C. A model accepts audio input by first transcribing the audio with
   Whisper, then feeding the transcript into a text-only LLM.
D. A model takes microphone input directly, attends to audio tokens
   inside its own transformer, and produces audio tokens as output
   without any intermediate text representation.

Show answer

A: encode-then-fuse. Separate pretrained vision encoder and LLM, bridged by a projector and fine-tuned together. Defining feature: separate trainings, bridged after the fact.
B: native multimodal. One transformer, one training run, mixed-modality tokens from step 1. Defining feature: joint co-evolution with a unified token stream.
C: NOT multimodal at all (it is a pipeline, multi-model not multimodal). Whisper and the LLM are separate systems; only the transcript text crosses the boundary. From L1: this is multi-model, not multimodal.
D: native multimodal. Audio-to-audio with no intermediate text; the modality is a first-class citizen in the model. Defining feature: end-to-end same-model handling of a single modality at low latency.

C is the trap question, a useful one because press coverage often calls these systems “multimodal” when they are pipelines. The architectural test from L1 holds: where do the modalities meet?

Try it yourself: what limits visual quality?

A team trains a native multimodal model and notices that, although text generation is sharp, the model’s image-generation outputs look noticeably blurry on fine details (text inside images, small faces). Where is the most likely structural bottleneck, and what design choice would they target first?

Show answer

The most likely bottleneck is the image tokenizer. Because every image (input or output) flows through the discrete codebook the tokenizer uses, the tokenizer’s reconstruction quality is a hard ceiling on the model’s image quality. If the tokenizer cannot reconstruct fine text or small faces from its discrete codes, the transformer attending to those codes cannot generate sharper outputs than the tokenizer can represent.

The first thing to target is the tokenizer design: a higher-resolution tokenizer, a larger codebook, finer patch granularity, or a more capable VQ-VAE successor. A bigger transformer alone will not fix it, because the bottleneck sits before the transformer sees the tokens (on input) or after the transformer emits them (on output reconstruction).

This is the practical reason tokenizer research is so active in native-multimodal teams: the tokenizer is the floor and ceiling at the same time.

Flashcards

Ten cards. Click any card to reveal the answer. Use the Print flashcards button for one card per page.

Q. What is native multimodal AI?

One transformer trained on a mixed stream of text, image, audio, and video tokens from the very first training step, so the modalities co-evolve rather than being bridged after the fact.

Q. How does it differ from encode-then-fuse?

Encode-then-fuse trains modality-specific models separately and bridges them with a projector. Native multimodal trains one transformer on all modalities together. Cross-modal interaction lives at every layer, not just at a bridge.

Q. How are images turned into tokens in native multimodal?

A learned image tokenizer (usually a VQ-VAE or descendant) breaks the image into patches and assigns each patch a discrete code from a fixed vocabulary. The image becomes a sequence of “visual words.”

Q. How is audio tokenized?

A neural audio codec (Encodec-style and successors) discretizes audio into a stream of tokens at some sample rate, so the transformer can attend to audio like any other token sequence.

Q. What is the training objective?

Next-token prediction across the interleaved mixed-modality stream. The model has no inherent modality boundaries; it learns alignment from data.

Q. Name three capabilities native multimodal enables.

Any three of: first-class generation of any modality, low-latency cross-modal interaction (e.g. direct audio-to-audio), fine-grained joint grounding, and unified any-to-any input/output.

Q. Why is tokenizer design crucial?

Because the tokenizer’s reconstruction quality is a hard ceiling on the model’s modality quality. A poor image tokenizer makes the whole system blurry, no matter how good the transformer is.

Q. Name three costs of native multimodal.

Any three of: harder tokenizer design, more data needed (cannot reuse pretrained text-only LLMs), more compute (joint training from scratch), slow non-text output (image/audio generation is token-expensive).

Q. Why does GPT-4o voice feel low-latency?

The audio path does not go through a speech-to-text intermediate; the model attends directly to audio tokens and produces audio tokens. That structural fact is the smoothness you feel.

Q. Name three production native multimodal systems.

Chameleon (Meta), GPT-4o (OpenAI), Gemini (Google) are the canonical examples. All ship text, image, and (in some cases) audio handling through a unified transformer trained from scratch on mixed modalities.