Over the past couple of weeks, several new papers on JEPA (Joint Embedding Predictive Architecture), like V-JEPA 2.1, LeWorldModel and ThinkJEPA, have been released – and they turned out to be not just incremental, but foundational.

This made us realize that we’re missing a clear understanding of JEPA’s step-by-step evolution. By reconstructing this full picture, we can trace a distinct trajectory of AI models moving from static perception to dynamic world modeling.

So we invite you to walk this path with us through 14 of the most important and influential types of JEPA:

P.S.: If you’re new to JEPA, it is a self-supervised learning framework, proposed by Yann LeCun, that learns representations by predicting target embeddings of masked or future inputs in a latent space, using context embeddings, without reconstructing the original input signal. The core idea is to learn predictive, abstract representations of the world that support reasoning and planning to achieve more human-like AI. → All the basics you need to know about JEPA.

  1. JEPA / H-JEPA
    This is the conceptual root and a starting point. Yann LeCun’s framework defines JEPA as prediction in representation space, and H-JEPA adds the crucial idea of hierarchical, multi-timescale world modeling and planning. → Read more

  2. I-JEPA

    The first major concrete success. I-JEPA showed that JEPA could learn semantic image representations without hand-crafted augmentations, and that the approach scaled well with Vision Transformers and large datasets. This is the point where JEPA became a serious practical recipe, competing with masked modeling and contrastive SSL approaches. → Read more

  3. MC-JEPA
    MC-JEPA is more of an exploratory step than a core milestone. It attempts to jointly learn motion and content features in a shared encoder, helping illustrate early efforts to extend JEPA from static images toward dynamic understanding. → Read more

  4. V-JEPA

    One of the central pillars of the JEPA story is a leap from images to video-based latent prediction. V-JEPA showed that predictive feature learning can scale to large video datasets and learn strong motion and appearance representations without relying on reconstruction or contrastive objectives. → Read more

  5. Audio-JEPA

    Audio-JEPA proved that JEPA was not only for vision, it is modality-general. It extends the approach to audio spectrograms using latent prediction and time-frequency-aware masking, showing strong performance on audio and speech tasks. → Read more

  6. Point-JEPA
    Point-JEPA is one of the key 3D branches. It adapts JEPA specifically to point cloud data, avoids raw-space reconstruction, and shows that JEPA can work efficiently on geometric representations. → Read more

  7. 3D-JEPA
    Broadens the 3D story beyond point clouds into more general 3D representation learning. JEPA becomes a framework for full 3D semantics. → Read more

  8. ACT-JEPA

    The clearest bridge from JEPA to action and policy learning. It jointly predicts action sequences and latent observation sequences, showing improved world-model quality and better task performance. This is where JEPA starts to look like a full control architecture. → Read more

  9. V-JEPA 2

    It is the point where JEPA becomes an explicit world model for understanding, prediction, and planning. It demonstrates zero-shot robotic planning with visual subgoals in unseen environments. This is a major milestone in the family. → Read more

  10. LeJEPA

    This is the theory-and-training cleanup layer. LeJEPA introduces a simpler and more stable objective (SIGReg), argues for isotropic embedding structure, removes many heuristics, and emphasizes scalability and efficiency. It helps make JEPA more principled and easier to train. Read more in our article

  11. Causal-JEPA
    A conceptual extension that pushes JEPA toward object-centric and causal reasoning. By introducing object-level masking, it encourages learning more structured and causally meaningful representations, with improvements in reasoning and planning efficiency. → Read more

  12. V-JEPA 2.1
    If V-JEPA 2 is the world-model milestone, this is the representation-quality upgrade. V-JEPA 2.1 extends the V-JEPA 2 line with dense predictive losses, improved self-supervision, and better feature quality across images and videos, while also improving robotics and dense understanding benchmarks. → Read more

  13. LeWorldModel
    Presents a clean, end-to-end JEPA-style world model trained from raw pixels with a minimal objective, reducing training complexity and enabling faster planning compared to heavier foundation-model-based pipelines. → Read more

  14. ThinkJEPA
    Represents a forward-looking direction: combining JEPA world models with a semantic “thinking” pathway derived from vision-language models. It targets long-horizon reasoning and planning, going beyond local prediction. → Read more

Also, subscribe to our X, Threads and BlueSky

to get unique content on every social media

Reply

Avatar

or to participate

Keep Reading