Practice: Where multimodal AI is going

Self-check

Seven short questions. Try to answer each one before opening the collapsible.

1. State the unifying architectural pattern in one sentence.

Show answer

Tokenize what you want to model, put the tokens through one transformer, and train at scale. Every major frontier multimodal system shipping in 2026 is a variation on this template.

2. Describe the fusion-gets-pushed-earlier trajectory.

Show answer

L2 encode-then-fuse bolted modalities together after pretraining; L3 native multimodal fused them from training step one; L5/L6 MM-DiT fuses text and image tokens inside the generative transformer; L9 production RL co-design fuses research and product feedback at the loss level. Every generation moves fusion deeper and earlier.

3. Why is the tokenizer described as “the floor and ceiling” of system quality?

Show answer

In any system operating on discrete codes (native multimodal, image and video diffusion, scientific applications), the tokenizer’s reconstruction quality bounds what the transformer can express. A bad tokenizer caps system quality before the transformer attends to anything; a bigger transformer alone does not fix it.

4. What does JEPA offer as an alternative to generative pretraining, and where does it sit in 2026?

Show answer

JEPA predicts in embedding space (target-representation prediction) rather than raw output space. The bet is that capacity not spent rendering surface detail does more semantic work, especially for representation learning and world modeling. As of 2026 it is research-strong but generative pretraining still dominates production; the paradigm tension is live.

5. Name the five layers of a modern multimodal capability stack.

Show answer

Perception (encode-then-fuse or native multimodal), reasoning (chain-of-thought, often as inference-time compute), tool use (vision tools, code, search, image generation), alignment (e.g. deliberative alignment), and production engineering (RL co-design, latency budgets, evaluation in deployment).

6. State the operational scope test that recurs across the track.

Show answer

What instruments would you use to settle the question? If engineering instruments settle it (benchmarks, A/B tests, FVD, latency, evaluation harness), the question is technique. If different instruments are required (legal precedent, clinical trials, sectoral policy, philosophical argument, business judgment), the question lives in a different conversation evaluated by different methods.

7. Name two trajectories the field is moving in (2026 onward).

Show answer

Any two of: truly native everything (text + image + audio + video as first-class citizens in one model, fully multimodal in and out); more efficient training paradigms (JEPA, mixture-of-experts, sparsity); production and product co-design (RL co-design, evaluation in deployment, engineering-informs-vs-settles discipline).

Try it yourself: identify the threads

For each system description, name which of the six cross-cutting threads it most clearly illustrates. Some descriptions illustrate more than one; name all that apply.

A. A new model trains on interleaved text and image tokens from the
   first training step, with the same transformer handling both
   modalities throughout pretraining.
B. A research group reports that their image-generation model's
   resolution improved substantially after they replaced the latent
   autoencoder with a higher-quality version, without changing the
   transformer.
C. A team deploys a new multimodal capability and notes that the
   feature's success or failure was decided by A/B testing on real
   users, but the decision to invest in the feature in the first place
   was a product-strategy call informed by but not settled by the
   engineering data.
D. A frontier system handles voice input by attending directly to
   audio tokens and producing audio tokens as output, with no
   speech-to-text intermediate step.
E. A research team trains a vision model with a masked-prediction
   objective in embedding space, contrasted with the standard pixel-
   reconstruction approach used by most baselines.

Show answer

A: Threads 1 + 2 + 4. Thread 1 (tokenize-everything + one transformer): explicit. Thread 2 (fusion pushed earlier): from training step one is the deepest fusion. Thread 4 (generative pretraining dominates): “from the first training step” implies generative pretraining as default.
B: Thread 3 (tokenizer is the floor and ceiling). The improvement came from the autoencoder/tokenizer, not from the transformer. Canonical example of tokenizer-as-bottleneck.
C: Threads 5 + 6. Thread 5 (capability stack including production engineering); Thread 6 (scope-line discipline: A/B testing settles engineering questions; product strategy is informed but not settled by engineering).
D: Threads 1 + 2. Thread 1 (one transformer, tokenized audio); Thread 2 (fusion at the audio-to-output level, no intermediate text-fusion step). Hallmark of native multimodal.
E: Thread 4 (JEPA-style alternative to generative pretraining). Embedding-space prediction vs pixel reconstruction is the JEPA contrast in miniature.

The discriminating procedure: read the description for the architectural move, the training paradigm, and the deployment context, and route to the thread that organizes that move.

Try it yourself: in / out / next?

For each topic, identify whether it was covered in this track, deliberately deferred to other forums (a §6-watch-zone topic), or listed in “what we did not cover” (a content gap the track explicitly named).

A. The U-Net to DiT shift in image-generation diffusion.
B. How to evaluate a new image-generation system's adherence to
   journalistic disclosure standards.
C. The architecture of natively multimodal models like Chameleon
   and GPT-4o.
D. 3D scene generation with Gaussian splatting.
E. The clinical-trial-grade evidence required to claim a drug
   identified by a multimodal model works in humans.
F. Multimodal alignment safety techniques beyond deliberative
   alignment.

Show answer

A: COVERED IN THIS TRACK. Lesson 5 squarely. Architecture territory.
B: DELIBERATELY DEFERRED (sector-specific policy from L5/L6 §6 watch zone). Journalism has its own institutions and disclosure standards; the track defers there.
C: COVERED IN THIS TRACK. Lesson 3 squarely. Native multimodal architecture.
D: LISTED IN “WHAT WE DID NOT COVER”. 3D and 4D generation are a whole sub-field the closer explicitly names as a gap with its own dedicated literature.
E: DELIBERATELY DEFERRED (medical-AI from L8 §6 watch zone, instruments: clinical trials, regulatory review). The “ML benchmark vs clinical utility” gap is precisely the L8 named pitfall.
F: LISTED IN “WHAT WE DID NOT COVER”. L4 introduced deliberative alignment; broader multimodal alignment safety is named as an active research area the track did not cover.

The pattern: COVERED means the lesson teaches it; DEFERRED means the lesson names it as someone else’s conversation with explicit instruments; NOT COVERED means the track explicitly says “this is a real and important topic the track did not address; here’s where to look.”

Flashcards

Ten cards. Click any card to reveal the answer. Use the Print flashcards button for one card per page.

Q. State the unifying architectural pattern of modern multimodal AI.

Tokenize what you want to model, put the tokens through one transformer, train at scale. Every major frontier system in 2026 is a variation on this template.

Q. Describe the fusion-gets-pushed-earlier trajectory.

L2 bolt-on after the fact, L3 native from training step one, L5/L6 MM-DiT in the generative transformer, L9 RL co-design at the loss level. Every generation moves fusion deeper and earlier.

Q. Why is the tokenizer the floor and ceiling?

In any system operating on discrete codes, the tokenizer’s reconstruction quality bounds the system’s quality. A bigger transformer alone does not fix a poor tokenizer.

Q. What does JEPA bet against generative pretraining?

That predicting in embedding space (semantic-state prediction) does more useful work per unit of capacity than rendering surface detail. Research-strong, not production-dominant as of 2026.

Q. Name the five layers of a modern multimodal capability stack.

Perception, reasoning (chain-of-thought), tool use, alignment (e.g. deliberative alignment), and production engineering (RL co-design + latency + evaluation in deployment).

Q. State the operational scope test.

What instruments would you use to settle the question? Engineering instruments (benchmarks, A/B tests, FVD, latency) = technique territory. Different instruments (legal precedent, clinical trials, sectoral policy, philosophical argument) = different conversation.

Q. Name three trajectories multimodal AI is moving in.

Truly native everything (any-to-any modality at first-class citizenship), more efficient training paradigms (JEPA, MoE, sparsity), production and product co-design (the L9 themes).

Q. Name three topics the track explicitly did NOT cover.

Any three of: embodied AI/robotics, 3D/4D generation, multimodal alignment safety beyond deliberative alignment, specific frontier-model technical reports, the economic/market story, long-context multimodal.

Q. What is the difference between COVERED, DEFERRED, and NOT COVERED in this track?

COVERED = the lesson teaches it. DEFERRED = a §6 watch-zone topic the lesson names as someone else’s conversation with explicit instruments. NOT COVERED = a gap the closer explicitly names, pointing to where the topic lives.

Q. What is the right next step from this track?

Depends on what pulled you: CS25 continues each year for the architectural side; JEPA + world-model papers for the training-paradigm side; production-engineering writing for the deployment side; adjacent Clawdemy tracks (T11, T13, T20) for foundational depth.