Cheatsheet: Joint embedding predictive architectures (JEPA) and world modeling
Two training paradigms
Section titled “Two training paradigms”| Paradigm | Predicts | Loss compares | Capacity goes to |
|---|---|---|---|
| Generative pretraining | raw outputs (next token, next pixel, next noise step) | predicted vs actual raw output | rendering surface detail + representation |
| JEPA | embeddings (target representation vector) | predicted vs actual embedding | representation only |
The JEPA training loop
Section titled “The JEPA training loop”| Step | Action |
|---|---|
| 1 | take input; mask several regions |
| 2 | encode visible portion with CONTEXT encoder -> context vector |
| 3 | encode masked portion with TARGET encoder -> target vectors |
| 4 | train PREDICTOR to map context vector to target vectors |
| 5 | loss is in embedding space (vector distance), not raw pixels |
| Self-supervised | no labels needed |
I-JEPA vs V-JEPA
Section titled “I-JEPA vs V-JEPA”| Variant | Domain | Patches |
|---|---|---|
| I-JEPA | images | 2D image patches |
| V-JEPA | video | 3D spacetime patches |
Both from Meta AI around LeCun. Same recipe, different patchification.
What JEPA buys (the bet)
Section titled “What JEPA buys (the bet)”| Claim | What it means |
|---|---|
| Representation quality | competitive or better downstream-task performance at equal compute |
| Sample efficiency | capacity not spent rendering means more semantic learning per parameter |
| Scalable abstraction | embedding-space prediction works at any timescale (next second / minute / planning step) without changing the loss family |
World-modeling connection
Section titled “World-modeling connection”| Model type | Predicts | Capacity goes to | Match to planning |
|---|---|---|---|
| Generative world model | future raw frames | rendering plausible-looking futures | mismatched (planning does not need pixels) |
| JEPA-style world model | future embeddings | semantic structure of future states | matched (planning uses the same abstraction) |
Where JEPA sits in 2026
Section titled “Where JEPA sits in 2026”| Stratum | Status |
|---|---|
| Representation benchmarks | competitive with generative pretraining; sometimes ahead |
| Production multimodal systems | generative pretraining still dominant |
| Open question | will JEPA-style displace generative pretraining? Watch the next 2-3 years |
Common confusions
Section titled “Common confusions”| Confusion | Reality |
|---|---|
| ”JEPA replaces transformers” | no - JEPA is a training paradigm; encoders are typically transformers |
| ”JEPA solved world modeling” | no - architectural proposal, not a solved problem |
| ”Embedding prediction is always better” | no - generation tasks need raw-output prediction; JEPA cannot generate the output you want |
| ”Generative pretraining is wasteful, full stop” | the surface-reproduction tax is real; the paradigm still built every production system today |
The operational scope test (carry-forward from Phase 3)
Section titled “The operational scope test (carry-forward from Phase 3)”| If the question is settled by… | It is… |
|---|---|
| benchmarks, planning-task performance, interpretability tools | IN scope (technical) |
| autonomy philosophy, accountability/legal frameworks, institutional governance | OUT of scope (different conversation) |