Skip to content

Joint embedding predictive architectures (JEPA) and world modeling

This is lesson 7 of Track 24, the opener of Phase 4 (Advanced multimodal directions). By the end you will be able to contrast generative pretraining (the dominant training paradigm in Phases 2 and 3) with JEPA’s embedding-space alternative, walk through the I-JEPA and V-JEPA recipes, and connect the framework to world modeling. The one capability to walk away with: given a training setup, identify whether the loss compares predicted-to-actual raw outputs or predicted-to-actual representations, and predict which kind of capability the model is biased toward.

The lesson maps to Hazel Nam and Lucas Maes’s CS25 V6 guest lecture (April 9, 2026); full attribution is in this lesson’s references.

This opens Phase 4 by surfacing the most articulated alternative to generative pretraining in modern multimodal AI. Phases 2 and 3 all leaned on generative pretraining (predict next token, predict next denoising step, predict next spacetime patch); this lesson asks what changes if the prediction is moved into embedding space. The world-modeling extension sets up the rest of Phase 4: lesson 8 takes a specific multimodal-world-model application (drug discovery) into scientific territory; lesson 9 covers multimodal agents in production; lesson 10 closes the track with synthesis.

Prerequisite: Lesson 3, Native multimodal intelligence. You need the generative-pretraining paradigm established there (one transformer, mixed-modality tokens, next-token prediction across the stream) because this lesson is the principled contrast to it. The diffusion-side generative work from lessons 5 and 6 also helps but is not strictly required; the contrast point is generative-pretraining-as-a-family.

  • State the surface-reproduction-tax objection to generative pretraining
  • Walk the JEPA training loop (mask, encode context and target, predict in embedding space)
  • Distinguish I-JEPA from V-JEPA
  • Explain the JEPA-style world-modeling connection
  • Apply the operational scope test to JEPA + world-modeling claims
  • Read time: about 13 minutes
  • Practice time: about 15 minutes (a generative-or-JEPA classification, an operational scope-test exercise, and flashcards)
  • Difficulty: standard