This website uses cookies

Read our Privacy policy and Terms of use for more information.

Introduction

Current AI architectures like Transformers are powerful and have achieved impressive results, including generalizing on previously unseen data and emergent abilities as the models are scaled. At the same time, they are still constrained compared to humans and animals who don’t need to see millions of data points before making the right conclusions or learning new skills like speaking. Examples include crows that can solve puzzles as five-year-olds, orcas’ sophisticated hunting methods that often deploy brutal, coordinated attacks, and elephants' cooperative abilities.

The Moravec paradox highlights that tasks that are difficult for humans, such as computing, are simple for computers to handle because they can be described and modeled easily. However, perception and sensory processing, which are natural for humans, are challenging for machines to master. Simply scaling the model and providing it with more data might not be a viable solution. Some argue that this approach will not lead to qualitatively different results enabling AI models to reach a new level of reasoning or world perception. Therefore, alternative methods must be explored to enable AI to attain human-level intelligence. Yann LeCun (one of AI Godfathers) insists that JEPA is the first step.

In today’s episode, we will cover:

  • What are the limitations of LLMs?

  • So what’s the potential solution?

  • How does JEPA work?

  • What can one build on JEPA?

  • I-JEPA – JEPA for Images

  • MC-JEPA - Multitasking JEPA

  • V-JEPA – JEPA for Video

  • Generalizing JEPA

  • Bonus: All resources in one place

Yann LeCun and JEPA: The Vision

Yann LeCun is the one who was always rational about the latest models and exposed their limitations and educated the public that the fear of AGI and AI taking over humans is an unreasonable fear. In February 2022, Yann LeCun proposed his vision of achieving human-level reasoning by AI. And Joint Embedding Predictive Architecture (JEPA) was at the core of his vision. Let’s figure out what it is!

What Is JEPA? LeCun’s Answer to LLM Limitations

Yann LeCun also gave several talks presenting his vision for objective-driven AI: talk 1 (March 28, 2024), and talk 2 (September 9, 2023). There, he extensively discussed the limitations of large language models (LLMs):

  • LLMs have no common sense: LLMs have limited knowledge of the underlying reality and make strange mistakes called “hallucinations.” This paper showed that LLMs are good at formal linguistic competence – knowledge of linguistic rules and patterns, while their performance on functional linguistic competence – understanding and using language in the world – remains unstable.

  • LLMs have no memory and can’t plan their answer: PlanBench benchmark proved that. 

So what’s the potential solution?

To propose new ideas, it is always great to come to the roots and fundamental disciplines. For the task of building intelligent AI, one needs to recap cognitive science, psychology, neuroscience along engineering sciences. Actually, this is the strategy that the creators of AI had taken in the 1960s. Professor LeCun did the same in the 2020s and devised the important parts for success we discuss below.

World models

The fundamental part of LeCun’s vision is the concept of "world models," which are internal representations of how the world functions. He argues that giving the model a context of the world around it could improve its results.

“The idea that humans, animals, and intelligent systems use world models goes back many decades in psychology and in fields of engineering such as control and robotics.”

Yann LeCun

Self-supervised learning

Another important aspect is using self-supervised learning (SSL) akin to babies who learn the world by observing it. Models like GPT, BERT, LLaMa and other foundation models are based on SSL and have changed the way we use machine learning.

Abstract representations

Apart from SSL, the model also needs to understand what should be captured by its sensors and what’s not. In other words, the model needs to contrast the relevant information in each state of the model. For example, the human eye is perfectly wired for that. What may seem to be a limitation in fact allows us to extract the essence.

The invisible gorilla study published in 1999 is the most famous example of a phenomenon called “inattentional blindness.” When we pay close attention to one thing, we often fail to notice other things – even if they are obvious. This is just one example of how our eyes function, scientists also showed that our eyes need some time to refocus on things just like a camera on your smartphone.

Using this analogy, Yann LeCun proposed that a model should use abstract representations* of images rather than comparing the pixels.

*Abstract representations simplify complex information into a form that is more manageable and meaningful for specific tasks or analyses. By focusing on the essential aspects and ignoring the less important details, these representations help systems (whether human or machine) to process information more efficiently and effectively.

Architecture – Objective-Driven AI

LeCun proposes a modular, configurable architecture for autonomous intelligence, emphasizing the development of self-supervised learning methods to enable AI to learn these world models without extensive labeled data.

Here’s a detailed view of the components of the system architecture for autonomous intelligence:

  • Configurator: Acts as the executive control center of the AI system by dynamically configuring other components of the system based on the specific task or context. For instance, it adjusts the parameters of the perception, world model, and actor modules to optimize performance for the given task.

  • Perception module: Captures and interprets sensory data from various sensors to estimate the current state of the world. This component is a basis for all higher-level processing and decision-making.

  • World model module: Predicts future states of the environment and fills in missing information. It acts as a simulator, using current and past data to forecast future conditions and possible scenarios. This component is the key for AI to perform hypothetical reasoning and planning, essential for navigating complex, dynamic environments.

  • Cost module: Evaluates the potential consequences of actions in terms of predefined costs associated with a given state or action. It has two submodules:

    • Intrinsic cost: Hard-wired, calculating immediate discomfort or risk

    • Critic: Trainable, estimating future costs based on current actions

  • Actor module: Decides and proposes specific actions based on the predictions and evaluations provided by other components of the architecture. It computes optimal action sequences that minimize the predicted costs, often using methods akin to those in optimal control theory.

  • Short-term memory: Keeps track of the immediate history of the system’s interactions with the environment. It stores recent data on the world state, actions taken, and the associated costs, allowing the system to reference this information in real-time decision-making.

How JEPA works: Architecture explained

Joint Embedding Predictive Architecture (JEPA) is a central element in the pursuit of developing AI that can understand and interact with the world as humans do. It encapsulates the key elements we mentioned above. JEPA allows the system to handle uncertainty and ignore irrelevant details while maintaining essential information for making predictions.

It works based on these elements:

  • Inputs: JEPA takes pairs of related inputs. For example, sequential frames of a video (x could be a current frame, and y the next frame)

  • Encoders: They transform the inputs, x and y, into abstract representations (sx and sy) which capture only essential features of the inputs and omit irrelevant details.

  • Predictor module: It is trained to predict the abstract representation of the next frame, sy, based on the abstract representation of the current frame, sx.

JEPA handles uncertainty in predictions in either of the two ways:

  • During the encoding phase, when the encoder drops irrelevant information. For example, the encoder checks which features of the input data are too uncertain or noisy and decides not to include these in the abstract representation.

  • After the encoding, based on the latent variable (z). Latent Variable z represents elements present in sy but not observable in sx. To handle uncertainty, z is varied across a predefined set of values, each representing different hypothetical scenarios or aspects of the future state y that might not be directly observable from x. By altering z, the predictive model can simulate how small changes in unseen factors could influence the upcoming state.

Interestingly, several JEPAs could be combined into a multistep/recurrent JEPA or stacked into a Hierarchical JEPA that could be used to perform predictions at several levels of abstraction and several time scales.

JEPA vs Transformers: What Makes It Different

Following the proposed JEPA architecture, Meta AI researchers along with Yann LeCun as a co-author published several specialized models. What are they?

I-JEPA: Image-based Joint-Embedding Predictive Architecture

I-JEPA, proposed in June 2023, was the first model based on JEPA.

I-JEPA is a non-generative, self-supervised learning framework designed for processing images. It works by masking parts of the images and then trying to predict those masked parts:

  • Masking: The image is divided into numerous patches. Some of these patches, referred to as "target blocks," are masked (hidden) so that the model doesn’t have information about them

  • Context sampling: A portion of the image, called the "context block," is left unmasked. This part is used by the context encoder to understand the visible aspects of the image.

  • Prediction: The predictor then tries to predict the hidden parts (target blocks) based only on what it can see in the context block.

  • Iteration: This process involves updating the model's parameters to reduce the difference between predicted and actual patches.

I-JEPA consists of three parts each of which is a Vision Transformer (ViT):

  • Context encoder: Processes parts of the image that are visible, known as the "context block"

  • Predictor: Uses the output from the context encoder to predict what the masked (hidden) parts of the image look like

  • Target encoder: Generates representations from the target blocks (hidden parts) that the model uses to learn and make predictions about hidden parts of the image.

The overall goal of I-JEPA is to train the predictor to accurately predict the representations of the hidden image parts from the visible context. This self-supervised learning process allows the model to learn powerful image representations without relying on explicit labels.

MC-JEPA: Motion-Content Joint-Embedding Predictive Architecture

MC-JEPA is another JEPA variation designed to simultaneously interpret video data: dynamic elements (motion) and static details (content) using a shared encoder. It was proposed just a month after I-JEPA, in July 2023.

MC-JEPA is a more comprehensive and robust visual representation model that can be used in real-world applications in computer vision like autonomous driving, video surveillance, and activity recognition.

V-JEPA: Video-based Joint-Embedding Predictive Architecture

V-JEPA is designed to enhance AI's understanding of video content which was marked as an important future direction after the initial I-JEPA publication.

V-JEPA consists of two main components:

  • Encoder: Transforms input video frames into a high-dimensional space where similar features are closer together. The encoder captures essential visual cues from the video.

  • Predictor: Takes the encoded features of one part of the video and predicts the features of another part. This prediction is based on learning the temporal and spatial transformations within the video, aiding in understanding motion and changes over time.

V-JEPA's design allows it to learn from videos in a way that mimics some aspects of human learning – observing and predicting the visual world without needing explicit annotations. The model's ability to generalize from unsupervised video data to diverse visual tasks makes it a powerful tool for advancing how machines understand and interact with dynamic visual environments.

JEPA in Robotics&Physical AI

The latest paper published in March 2024, "Learning and Leveraging World Models in Visual Representation Learning," introduces the concept of Image World Models (IWM) and explores how the use of JEPA architecture can be generalized to a broader set of corruptions – changes in input images like color jitters, blurs – apart from masking.

The study explores two types of world models:

  • Invariant models: Recognize and maintain stable, unchanged features across different scenarios

  • Equivariant models: Adapt to changes in the input data, preserving the relationships and transformations that occur

The research discovered that machines can more accurately predict and adjust to visual changes by utilizing these world models. This resulted in the development of more resilient and adaptable systems. This method challenges traditional AI approaches and provides a new means to improve the effectiveness of machine learning models without requiring direct supervision.

Bonus: All resources in one place

Original models

Yann LeCun talks:

JEPA-inspired models

We also created for you a list of related models inspired by JEPA architecture. They are grouped based on their application domains:

Audio and Speech Applications

  1. A-JEPA: Focused on audio data using masked-modeling principles for improving contextual semantic understanding in audio and speech classification tasks.

  2. Investigating Design Choices in Joint-Embedding Predictive Architectures for General Audio Representation Learning: Analyzes masking strategies and sample durations in self-supervised audio representation learning.

Visual and Spatial Data Applications

  1. S-JEA: Enhances visual representation learning through hierarchical semantic representations in stacked joint embedding architectures.

  2. DMT-JEPA: Targets image modeling with a focus on local semantic understanding, applicable to classification, object detection, and segmentation.

  3. JEP-KD: Aligns visual speech recognition models with audio features, improving performance in visual speech recognition.

  4. Point-JEPA: Applied to point cloud data, enhancing efficiency and representation learning in spatial datasets.

  5. Signal-JEPA: Focuses on EEG signal processing, improving cross-dataset transfer and classification in EEG analysis.

Graph and Dynamic Data Applications

  1. Graph-JEPA: First joint-embedding architecture for graphs, using hyperbolic coordinate prediction for subgraph representation.

  2. ST-JEMA: Enhances learning of dynamic functional connectivity from fMRI data, focusing on high-level semantic representations.

Time-Series and Remote Sensing Applications

  1. LaT-PFN: Combines time-series forecasting with joint embedding architecture, leveraging related series for robust in-context learning.

  2. Time-Series JEPA: Optimizes remote control over limited-capacity networks through spatio-temporal correlations in sensor data.

  3. Predicting Gradient is Better: Utilizes self-supervised learning for SAR ATR, leveraging gradient features for automatic target recognition.

Evaluation and Methodological Studies

  1. LiDAR: Sensing Linear Probing Performance in Joint Embedding SSL Architectures: Introduces a metric for evaluating representations in joint-embedding self-supervised learning architectures, focusing on linear probing performance.

FAQ

What is JEPA in AI? 

JEPA, or Joint Embedding Predictive Architecture, is an AI architecture proposed by Yann LeCun in his 2022 position paper A Path Towards Autonomous Machine Intelligence. Instead of predicting raw pixels or tokens, JEPA learns by predicting abstract representations of inputs in a latent space. The goal is to build AI systems that form internal models of how the world works – closer, in LeCun's framing, to how humans and animals learn – rather than memorizing surface patterns.

How is JEPA different from transformers and LLMs? 

LLMs built on the transformer architecture predict the next token in a sequence, operating directly in token (text) space. Pixel-level autoregressive and diffusion models do the analogous thing for images. JEPA breaks with both: it predicts in an abstract representation space, which lets the model discard details that don't matter for understanding. LeCun argues this makes JEPA a more promising substrate for world models and common-sense reasoning – though whether it actually delivers on that promise remains an open empirical question.

What is the difference between I-JEPA, MC-JEPA, and V-JEPA? 

All three come out of Meta and share the core JEPA principle of predicting in latent space, but they target different modalities and were developed in that order. I-JEPA (Assran et al., 2023) applies the architecture to images by predicting the representations of masked image patches. MC-JEPA (Bardes, Ponce, LeCun, July 2023) jointly learns optical flow and content features within a shared encoder, treating self-supervised optical flow estimation as a pretext task alongside content learning. V-JEPA (Bardes et al., February 2024) extends JEPA to video, learning spatio-temporal representations by predicting masked regions of video clips in feature space.

What is LeJEPA?

LeJEPA (Learning Joint-Embedding Predictive Architecture) is a self-supervised AI architecture developed by Yann LeCun and Randall Balestriero. It is designed to help AI systems learn structured representations of the world by predicting abstract embeddings instead of generating the next token or pixel.

LeJEPA builds on the earlier JEPA framework and introduces a mathematically grounded method called SIGReg (Sketched Isotropic Gaussian Regularization) to prevent representation collapse during training. Unlike many traditional generative AI models, LeJEPA focuses on learning stable world models and meaningful latent representations that can support reasoning, planning, robotics, and embodied AI systems.

One of LeJEPA’s main advantages is that it removes many complex training heuristics used in earlier self-supervised learning systems while remaining scalable across different architectures and datasets.

Does JEPA use self-supervised learning? 

Yes. JEPA is fundamentally a self-supervised framework – it learns from unlabeled data by predicting one part of the input from another, with no human-annotated labels required. This is structurally similar to how GPT and BERT are trained, but the prediction target is different: JEPA predicts abstract representations rather than raw tokens, which in principle reduces sensitivity to noise and irrelevant detail.

Can JEPA be used for robotics and physical AI? 

Yes – and this is where the architecture is currently being pushed hardest. LeCun has positioned JEPA from the start as a foundation for systems that can plan, reason about consequences, and act in the physical world. The most direct evidence is V-JEPA 2 (Assran, Bardes, Fan, Garrido et al., June 2025), which extends V-JEPA by combining roughly a million hours of internet-scale video with a small amount of robot trajectory data, and reports results on understanding, prediction, and planning in physical environments. V-JEPA 2.1, released in March 2026, sharpens the recipe with more temporally consistent dense features. An earlier paper, Image World Models (Garrido et al., March 2024), generalized JEPA's prediction task from masking to global photometric transformations – useful as a methodological extension, but more about image-domain representation learning than robotics proper.

Thank you for reading! Share this article with three friends and get a 1-month subscription free! 🤍

Reply

Avatar

or to participate

Keep Reading