Skip to content

Cheatsheet: Joint embedding predictive architectures (JEPA) and world modeling

ParadigmPredictsLoss comparesCapacity goes to
Generative pretrainingraw outputs (next token, next pixel, next noise step)predicted vs actual raw outputrendering surface detail + representation
JEPAembeddings (target representation vector)predicted vs actual embeddingrepresentation only
StepAction
1take input; mask several regions
2encode visible portion with CONTEXT encoder -> context vector
3encode masked portion with TARGET encoder -> target vectors
4train PREDICTOR to map context vector to target vectors
5loss is in embedding space (vector distance), not raw pixels
Self-supervisedno labels needed
VariantDomainPatches
I-JEPAimages2D image patches
V-JEPAvideo3D spacetime patches

Both from Meta AI around LeCun. Same recipe, different patchification.

ClaimWhat it means
Representation qualitycompetitive or better downstream-task performance at equal compute
Sample efficiencycapacity not spent rendering means more semantic learning per parameter
Scalable abstractionembedding-space prediction works at any timescale (next second / minute / planning step) without changing the loss family
Model typePredictsCapacity goes toMatch to planning
Generative world modelfuture raw framesrendering plausible-looking futuresmismatched (planning does not need pixels)
JEPA-style world modelfuture embeddingssemantic structure of future statesmatched (planning uses the same abstraction)
StratumStatus
Representation benchmarkscompetitive with generative pretraining; sometimes ahead
Production multimodal systemsgenerative pretraining still dominant
Open questionwill JEPA-style displace generative pretraining? Watch the next 2-3 years
ConfusionReality
”JEPA replaces transformers”no - JEPA is a training paradigm; encoders are typically transformers
”JEPA solved world modeling”no - architectural proposal, not a solved problem
”Embedding prediction is always better”no - generation tasks need raw-output prediction; JEPA cannot generate the output you want
”Generative pretraining is wasteful, full stop”the surface-reproduction tax is real; the paradigm still built every production system today

The operational scope test (carry-forward from Phase 3)

Section titled “The operational scope test (carry-forward from Phase 3)”
If the question is settled by…It is…
benchmarks, planning-task performance, interpretability toolsIN scope (technical)
autonomy philosophy, accountability/legal frameworks, institutional governanceOUT of scope (different conversation)