References: Joint embedding predictive architectures (JEPA) and world modeling

Source material

Source material:
• Stanford CS25 V6 (April 9, 2026):
  "From Representation Learning to World Modeling through Joint Embedding
   Predictive Architectures"
  Speakers: Hazel Nam and Lucas Maes (Brown University)
  YouTube: https://www.youtube.com/watch?v=GBd7iuJkW08
  Course site: https://web.stanford.edu/class/cs25/
  License (lecture video): as published on Stanford's public CS25 YouTube
                           channel (link-out only)

Clawdemy provides original notes, summaries, and quizzes derived from this
material for educational purposes. All rights to the original lecture remain
with Stanford and the speakers.

What this lesson draws from each source

Nam and Maes’s CS25 V6 lecture anchors the topic and the bridge from representation learning (I-JEPA / V-JEPA) to world modeling. The lecture’s framing of “from representation learning to world modeling” is the structural arc this lesson mirrors.
The explicit recap of generative pretraining as the dominant Phases 2-3 objective, the “surface-reproduction tax” articulation, the side-by-side generative-vs-JEPA comparison table, and the operational scope test applied to JEPA + world-modeling territory are Clawdemy’s own connective tissue.

Going deeper

“Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture” (Assran et al., I-JEPA, 2023). The I-JEPA paper, the reference account of the recipe described here. Section 3 walks the architecture and training loop in detail.
“Revisiting Feature Prediction for Learning Visual Representations from Video” (Bardes et al., V-JEPA, 2024). The V-JEPA paper, extending the I-JEPA recipe to spacetime patches of video.
“A Path Towards Autonomous Machine Intelligence” (LeCun, 2022). LeCun’s white paper laying out the world-modeling thesis that JEPA instantiates. Position paper rather than experimental, but the strongest single account of why he argues this direction matters.

Adjacent topics

Self-supervised learning more broadly. Masked autoencoding (MAE), contrastive learning (SimCLR family), and JEPA all approach the question “learn good representations without labels” from different angles. Reading them as a family clarifies the tradeoffs.
World models in reinforcement learning. The Dreamer family of model-based RL systems is one of the longest-running lines on world models for planning; comparing their generative-frame-prediction approach to a JEPA-style alternative is the live research frontier.
Multimodal world models for science (the next lesson). Takes the world-modeling idea into a specific scientific application (drug discovery) where multimodal data streams need to be fused, and shows how the framing pays off in practice.

Community discussion

None selected for this lesson at the present time. The I-JEPA and V-JEPA papers plus LeCun’s position paper together are the strongest public account of the direction. If a canonical secondary discussion surfaces, it will be added at the next review.