Joint embedding predictive architectures (JEPA) and world modeling
What you’ll learn
Section titled “What you’ll learn”This is lesson 7 of Track 24, the opener of Phase 4 (Advanced multimodal directions). By the end you will be able to contrast generative pretraining (the dominant training paradigm in Phases 2 and 3) with JEPA’s embedding-space alternative, walk through the I-JEPA and V-JEPA recipes, and connect the framework to world modeling. The one capability to walk away with: given a training setup, identify whether the loss compares predicted-to-actual raw outputs or predicted-to-actual representations, and predict which kind of capability the model is biased toward.
The lesson maps to Hazel Nam and Lucas Maes’s CS25 V6 guest lecture (April 9, 2026); full attribution is in this lesson’s references.
Where this fits
Section titled “Where this fits”This opens Phase 4 by surfacing the most articulated alternative to generative pretraining in modern multimodal AI. Phases 2 and 3 all leaned on generative pretraining (predict next token, predict next denoising step, predict next spacetime patch); this lesson asks what changes if the prediction is moved into embedding space. The world-modeling extension sets up the rest of Phase 4: lesson 8 takes a specific multimodal-world-model application (drug discovery) into scientific territory; lesson 9 covers multimodal agents in production; lesson 10 closes the track with synthesis.
Before you start
Section titled “Before you start”Prerequisite: Lesson 3, Native multimodal intelligence. You need the generative-pretraining paradigm established there (one transformer, mixed-modality tokens, next-token prediction across the stream) because this lesson is the principled contrast to it. The diffusion-side generative work from lessons 5 and 6 also helps but is not strictly required; the contrast point is generative-pretraining-as-a-family.
By the end, you’ll be able to
Section titled “By the end, you’ll be able to”- State the surface-reproduction-tax objection to generative pretraining
- Walk the JEPA training loop (mask, encode context and target, predict in embedding space)
- Distinguish I-JEPA from V-JEPA
- Explain the JEPA-style world-modeling connection
- Apply the operational scope test to JEPA + world-modeling claims
Time and difficulty
Section titled “Time and difficulty”- Read time: about 13 minutes
- Practice time: about 15 minutes (a generative-or-JEPA classification, an operational scope-test exercise, and flashcards)
- Difficulty: standard