What is Joint Embedding Predictive Architecture (JEPA)?

Current AI architectures like Transformers are powerful and have achieved impressive results, including generalizing on previously unseen data and emergent abilities as the models are scaled. At the same time, they are still constrained compared to humans and animals who don’t need to see millions of data points before making the right conclusions or learning new skills like speaking. Examples include crows that can solve puzzles as five-year-olds, orcas’ sophisticated hunting methods that often deploy brutal, coordinated attacks, and elephants' cooperative abilities.

The Moravec paradox highlights that tasks that are difficult for humans, such as computing, are simple for computers to handle because they can be described and modeled easily. However, perception and sensory processing, which are natural for humans, are challenging for machines to master. Simply scaling the model and providing it with more data might not be a viable solution. Some argue that this approach will not lead to qualitatively different results enabling AI models to reach a new level of reasoning or world perception. Therefore, alternative methods must be explored to enable AI to attain human-level intelligence. Yann LeCun (one of AI Godfathers) insists that JEPA is the first step.

In today’s episode, we will cover:

What are the limitations of LLMs?
So what’s the potential solution?
How does JEPA work?
What can one build on JEPA?
I-JEPA – JEPA for Images
MC-JEPA - Multitasking JEPA
V-JEPA – JEPA for Video
VL-JEPA: Vision-Language JEPA Explained
JEPA in Robotics & Physical AI
Generalizing JEPA
Bonus: All resources in one place

❝

JEPA, or Joint Embedding Predictive Architecture, is a self-supervised AI architecture that learns by predicting abstract representations of inputs instead of reconstructing raw pixels or generating tokens. The main idea is to build world models that capture what matters for understanding, planning, and reasoning while ignoring irrelevant details.

Yann LeCun and JEPA: The Vision Behind the Architecture

Yann LeCun is the one who was always rational about the latest models and exposed their limitations and educated the public that the fear of AGI and AI taking over humans is an unreasonable fear. In February 2022, Yann LeCun proposed his vision of achieving human-level reasoning by AI. And Joint Embedding Predictive Architecture (JEPA) was at the core of his vision. Let’s figure out what it is!

— # (#)

What Is JEPA? LeCun’s Answer to LLM Limitations

Yann LeCun also gave several talks presenting his vision for objective-driven AI: talk 1 (March 28, 2024), and talk 2 (September 9, 2023). There, he extensively discussed the limitations of large language models (LLMs):

LLMs have no common sense: LLMs have limited knowledge of the underlying reality and make strange mistakes called “hallucinations.” This paper showed that LLMs are good at formal linguistic competence – knowledge of linguistic rules and patterns, while their performance on functional linguistic competence – understanding and using language in the world – remains unstable.
LLMs have no memory and can’t plan their answer: PlanBench benchmark proved that.

World Models: The Foundation of LeCun's Vision

To propose new ideas, it is always great to come to the roots and fundamental disciplines. For the task of building intelligent AI, one needs to recap cognitive science, psychology, neuroscience along engineering sciences. Actually, this is the strategy that the creators of AI had taken in the 1960s. Professor LeCun did the same in the 2020s and devised the important parts for success we discuss below.

What Are World Models in AI?

The fundamental part of LeCun’s vision is the concept of "world models," which are internal representations of how the world functions. He argues that giving the model a context of the world around it could improve its results.

— # (#)

❝

“The idea that humans, animals, and intelligent systems use world models goes back many decades in psychology and in fields of engineering such as control and robotics.”

Yann LeCun

Self-Supervised Learning in JEPA

Another important aspect is using self-supervised learning (SSL) akin to babies who learn the world by observing it. Models like GPT, BERT, LLaMa and other foundation models are based on SSL and have changed the way we use machine learning.

Abstract Representations: How JEPA Ignores Irrelevant Detail

Apart from SSL, the model also needs to understand what should be captured by its sensors and what’s not. In other words, the model needs to contrast the relevant information in each state of the model. For example, the human eye is perfectly wired for that. What may seem to be a limitation in fact allows us to extract the essence.

The invisible gorilla study published in 1999 is the most famous example of a phenomenon called “inattentional blindness.” When we pay close attention to one thing, we often fail to notice other things – even if they are obvious. This is just one example of how our eyes function, scientists also showed that our eyes need some time to refocus on things just like a camera on your smartphone.

Using this analogy, Yann LeCun proposed that a model should use abstract representations* of images rather than comparing the pixels.

*Abstract representations simplify complex information into a form that is more manageable and meaningful for specific tasks or analyses. By focusing on the essential aspects and ignoring the less important details, these representations help systems (whether human or machine) to process information more efficiently and effectively.

JEPA Architecture: Components Explained

LeCun proposes a modular, configurable architecture for autonomous intelligence, emphasizing the development of self-supervised learning methods to enable AI to learn these world models without extensive labeled data.

Image Credit: Meta AI blog post, Yann LeCun on a vision to make AI systems learn and reason like animals and humans

Here’s a detailed view of the components of the system architecture for autonomous intelligence:

Configurator: Acts as the executive control center of the AI system by dynamically configuring other components of the system based on the specific task or context. For instance, it adjusts the parameters of the perception, world model, and actor modules to optimize performance for the given task.
Perception module: Captures and interprets sensory data from various sensors to estimate the current state of the world. This component is a basis for all higher-level processing and decision-making.
World model module: Predicts future states of the environment and fills in missing information. It acts as a simulator, using current and past data to forecast future conditions and possible scenarios. This component is the key for AI to perform hypothetical reasoning and planning, essential for navigating complex, dynamic environments.
Cost module: Evaluates the potential consequences of actions in terms of predefined costs associated with a given state or action. It has two submodules:
- Intrinsic cost: Hard-wired, calculating immediate discomfort or risk
- Critic: Trainable, estimating future costs based on current actions
Actor module: Decides and proposes specific actions based on the predictions and evaluations provided by other components of the architecture. It computes optimal action sequences that minimize the predicted costs, often using methods akin to those in optimal control theory.
Short-term memory: Keeps track of the immediate history of the system’s interactions with the environment. It stores recent data on the world state, actions taken, and the associated costs, allowing the system to reference this information in real-time decision-making.

How JEPA Works Step by Step

Joint Embedding Predictive Architecture (JEPA) is a central element in the pursuit of developing AI that can understand and interact with the world as humans do. It encapsulates the key elements we mentioned above. JEPA allows the system to handle uncertainty and ignore irrelevant details while maintaining essential information for making predictions.

Image Credit: Yann LeCun’s Harvard presentation (March 28, 2024)

It works based on these elements:

Inputs: JEPA takes pairs of related inputs. For example, sequential frames of a video (x could be a current frame, and y the next frame)
Encoders: They transform the inputs, x and y, into abstract representations (s_x and s_y), also called embeddings, which capture only essential features of the inputs and omit irrelevant details.
Predictor module: It is trained to predict the abstract representation of the next frame, s_y, based on the abstract representation of the current frame, s_x.

JEPA handles uncertainty in predictions in either of the two ways:

During the encoding phase, when the encoder drops irrelevant information. For example, the encoder checks which features of the input data are too uncertain or noisy and decides not to include these in the abstract representation.
After the encoding, based on the latent variable (z). Latent Variable z represents elements present in sy but not observable in s_x. To handle uncertainty, z is varied across a predefined set of values, each representing different hypothetical scenarios or aspects of the future state y that might not be directly observable from x. By altering z, the predictive model can simulate how small changes in unseen factors could influence the upcoming state.

Image Credit: Yann LeCun’s Munich presentation (September 29, 2023)

Interestingly, several JEPAs could be combined into a multistep/recurrent JEPA or stacked into a Hierarchical JEPA that could be used to perform predictions at several levels of abstraction and several time scales.

Image Credit: Yann LeCun’s Munich presentation (September 29, 2023)

JEPA Loss Function and Training Objective

The central training objective in JEPA is to make the predicted representation as close as possible to the real target representation. It can be done with an L2 loss, where the distance between the predicted embedding and the target embedding is minimized. If the predictor outputs ŝ_y, the objective can be written as minimizing ||ŝ_y − s_y||². This loss function pushes the model to learn the abstract structure that connects x and y, while ignoring unpredictable details that do not matter.

But if the model maps everything to the same embedding, the loss can collapse into a useless solution. So JEPA-style models also need an anti-collapse mechanism.

Some JEPAs like vision-language variant (for example, VL-JEPA that we'll explain further) use contrastive objectives – InfoNCE loss. It pulls the correct prediction and target embedding like image-text or context-target embeddings closer together, and pushes wrong matches in the batch farther apart:

Here, ŝ_y= predicted text embedding, s_y= correct text embedding, and s_j= all candidate text embeddings in the batch. Sim means similarity function, and cosine similarity is usually used.

Overall, L2 loss is useful when the goal is direct latent prediction, and InfoNCE is useful when the goal is alignment between representations, such as matching a video, image, or context with the right text embedding.JEPA vs Transformers: What Makes It Different

Following the proposed JEPA architecture, Meta AI researchers along with Yann LeCun as a co-author published several specialized models. What are they?

For a full overview of all JEPA variants, see All JEPA Models: 14 Milestones.

JEPA Models: I-JEPA, V-JEPA, MC-JEPA Explained

I-JEPA: Image-based Joint-Embedding Predictive Architecture

I-JEPA, proposed in June 2023, was the first model based on JEPA.

Image Credit: Meta AI blog post, The first AI model based on Yann LeCun’s vision for more human-like AI

I-JEPA is a non-generative, self-supervised learning framework designed for processing images. It works by masking parts of the images and then trying to predict those masked parts:

Masking: The image is divided into numerous patches. Some of these patches, referred to as "target blocks," are masked (hidden) so that the model doesn’t have information about them
Context sampling: A portion of the image, called the "context block," is left unmasked. This part is used by the context encoder to understand the visible aspects of the image.
Prediction: The predictor then tries to predict the hidden parts (target blocks) based only on what it can see in the context block.
Iteration: This process involves updating the model's parameters to reduce the difference between predicted and actual patches.

I-JEPA consists of three parts each of which is a Vision Transformer (ViT):

Context encoder: Processes parts of the image that are visible, known as the "context block"
Predictor: Uses the output from the context encoder to predict what the masked (hidden) parts of the image look like
Target encoder: Generates representations from the target blocks (hidden parts) that the model uses to learn and make predictions about hidden parts of the image.

The overall goal of I-JEPA is to train the predictor to accurately predict the representations of the hidden image parts from the visible context. This self-supervised learning process allows the model to learn powerful image representations without relying on explicit labels.

MC-JEPA: Motion-Content Joint-Embedding Predictive Architecture

MC-JEPA is another JEPA variation designed to simultaneously interpret video data: dynamic elements (motion) and static details (content) using a shared encoder. It was proposed just a month after I-JEPA, in July 2023.

Image Credit: MC-JEPA: A Joint-Embedding Predictive Architecture for Self-Supervised Learning of Motion and Content Features

MC-JEPA is a more comprehensive and robust visual representation model that can be used in real-world applications in computer vision like autonomous driving, video surveillance, and activity recognition.

V-JEPA: Video-based Joint-Embedding Predictive Architecture

V-JEPA is designed to enhance AI's understanding of video content which was marked as an important future direction after the initial I-JEPA publication.

Image Credit: Meta AI blog post, V-JEPA: The next step toward advanced machine intelligence

V-JEPA consists of two main components:

Encoder: Transforms input video frames into a high-dimensional space where similar features are closer together. The encoder captures essential visual cues from the video.
Predictor: Takes the encoded features of one part of the video and predicts the features of another part. This prediction is based on learning the temporal and spatial transformations within the video, aiding in understanding motion and changes over time.

V-JEPA's design allows it to learn from videos in a way that mimics some aspects of human learning – observing and predicting the visual world without needing explicit annotations. The model's ability to generalize from unsupervised video data to diverse visual tasks makes it a powerful tool for advancing how machines understand and interact with dynamic visual environments.

VL-JEPA: Vision-Language JEPA Explained

VL-JEPA is a special way to connect vision and language: it predicts meaning first, and only then decodes words. It presents a very different approach compared to classic VLMs that are trained look at an image or video, read a question, and write the answer token by token. VL-JEPA is trained to understand what the sentence should mean, predicting the answer in embedding space. The architecture has four parts:

X-Encoder: turns images or video frames into visual embeddings.
Predictor: combines visual embeddings with the text query and predicts the target answer embedding.
Y-Encoder: turns the real target answer into an embedding during training.
Y-Decoder: Only when we need human-readable text does this small decoder turn the predicted meaning into text at inference time.

Image Credit: VL-JEPA original paper

This architecture reduces decoding operations by about 2.85×, but keeps similar or even stronger performance than token generating VLMs with about 50% fewer trainable parameters – everything becomes cheaper and faster. That’s why VL-JEPA may be especially useful for real-time systems like robots, wearables, action tracking, where the model should constantly understand what is happening, but should not waste compute generating text.

JEPA in Robotics & Physical AI

The latest paper published in March 2024, "Learning and Leveraging World Models in Visual Representation Learning," introduces the concept of Image World Models (IWM) and explores how the use of JEPA architecture can be generalized to a broader set of corruptions – changes in input images like color jitters, blurs – apart from masking.

Image Credit: Learning and Leveraging World Models in Visual Representation Learning

The study explores two types of world models:

Invariant models: Recognize and maintain stable, unchanged features across different scenarios
Equivariant models: Adapt to changes in the input data, preserving the relationships and transformations that occur

The research discovered that machines can more accurately predict and adjust to visual changes by utilizing these world models. This resulted in the development of more resilient and adaptable systems. This method challenges traditional AI approaches and provides a new means to improve the effectiveness of machine learning models without requiring direct supervision.

How JEPA Powers Embodied AI: Step by Step

Imagine a household robot trying to pick up a mug from a cluttered kitchen table without knocking over a glass.

The robot observes the scene. Its camera captures the table, the mug, the glass, the edge of the counter, and the robot’s own arm position.
JEPA converts raw perception into abstract representations. Instead of modeling every pixel, the system encodes what matters: object positions, shapes, likely contact points, and spatial relationships.
The world model predicts what could happen next. If the robot moves its gripper left, the model predicts whether the mug will stay stable, slide, or collide with nearby objects.
The system ignores irrelevant details. The table color or background pattern may not matter for the action, so JEPA-style prediction can focus on task-relevant structure.
The robot compares possible actions. It can simulate several short action sequences in latent space: approach from the side, lift vertically, rotate the wrist, or move the glass first.
The actor chooses the lowest-cost action. The robot selects the movement that is most likely to grasp the mug while avoiding collision, instability, or wasted motion.

To sum up, JEPA is important for embodied AI, because it gives physical systems a way to learn predictive world models without generating every detail of the world.

Current Limitations of JEPA

JEPA is very-very promising as an architecture for world models that power robotics and other advanced systems that work with visual understanding of reality. But it is not a solved architecture for general intelligence or robotics. Several limitations of JEPA still matter:

JEPA depends heavily on representation quality: If the encoder learns the wrong abstractions, the predictor may ignore details that are actually important for planning or physical interaction.
JEPA does not automatically solve long-horizon reasoning: Multi-step planning across minutes, goals, and changing environments remains difficult.
JEPA can struggle when uncertainty is too high: When JEPA doesn't work well, it is often because the future state depends on hidden variables the model cannot infer from context.
JEPA still needs strong evaluation standards: It is not always clear whether a JEPA-style model has learned a useful world model or only a representation that performs well on narrow benchmarks.
Robotics requires grounding beyond visual prediction: Physical AI needs action, force, feedback, failure recovery, and real-world safety. JEPA can support this, but it is only one part of the full embodied intelligence stack.

Bonus: All resources in one place

Original models

Yann LeCun talks:

JEPA-inspired models

We also created for you a list of related models inspired by JEPA architecture. They are grouped based on their application domains:

Model	Modality / Domain	What it does
A-JEPA	Audio and speech	Applies JEPA-style masked modeling to audio data to improve contextual semantic understanding in audio and speech classification
General Audio JEPA	Audio representation learning	Studies how masking strategies and sample duration affect self-supervised audio representation learning in JEPA-style systems
S-JEA	Visual representation learning	Enhances visual representation learning through hierarchical semantic representations in stacked joint embedding architectures
DMT-JEPA	Images and dense vision tasks	Targets image modeling with a focus on local semantic understanding, applicable to classification, object detection, and segmentation
JEP-KD	Visual speech recognition	Aligns visual speech recognition models with audio features, improving performance in visual speech recognition
Point-JEPA	Point clouds / 3D spatial data	Learns efficient representations for point cloud data and spatial understanding tasks
Signal-JEPA	EEG	Improves cross-dataset transfer and classification for EEG signal processing
Graph-JEPA	Graph data	First joint-embedding architecture for graphs, using hyperbolic coordinate prediction for subgraph representation
ST-JEMA	fMRI / dynamic connectivity	Learns high-level semantic representations from dynamic functional connectivity in brain imaging data
LaT-PFN	Time-series forecasting	Combines time-series forecasting with joint embedding architecture, leveraging related series for robust in-context learning
Time-Series JEPA	Sensor networks / remote control	Optimizes remote control over limited-capacity networks through spatio-temporal correlations in sensor data
Predicting Gradient is Better	SAR / remote sensing	Utilizes self-supervised learning for SAR ATR, leveraging gradient features for automatic target recognition
LiDAR: Sensing Linear Probing Performance in Joint Embedding SSL Architectures:	Self-supervised representation evaluation	Introduces a metric for evaluating representations in joint-embedding self-supervised learning architectures, focusing on linear probing performance

FAQ

What is JEPA in AI?

JEPA, or Joint Embedding Predictive Architecture, is an AI architecture proposed by Yann LeCun in his 2022 position paper A Path Towards Autonomous Machine Intelligence. Instead of predicting raw pixels or tokens, JEPA learns by predicting abstract representations of inputs in a latent space. The goal is to build AI systems that form internal models of how the world works – closer, in LeCun's framing, to how humans and animals learn – rather than memorizing surface patterns.

How is JEPA different from transformers and LLMs?

LLMs built on the transformer architecture predict the next token in a sequence, operating directly in token (text) space. Pixel-level autoregressive and diffusion models do the analogous thing for images. JEPA breaks with both: it predicts in an abstract representation space, which lets the model discard details that don't matter for understanding. LeCun argues this makes JEPA a more promising substrate for world models and common-sense reasoning – though whether it actually delivers on that promise remains an open empirical question.

What is the difference between I-JEPA, MC-JEPA, and V-JEPA?

All three come out of Meta and share the core JEPA principle of predicting in latent space, but they target different modalities and were developed in that order. I-JEPA (Assran et al., 2023) applies the architecture to images by predicting the representations of masked image patches. MC-JEPA (Bardes, Ponce, LeCun, July 2023) jointly learns optical flow and content features within a shared encoder, treating self-supervised optical flow estimation as a pretext task alongside content learning. V-JEPA (Bardes et al., February 2024) extends JEPA to video, learning spatio-temporal representations by predicting masked regions of video clips in feature space.

What is LeJEPA?

LeJEPA (Learning Joint-Embedding Predictive Architecture) is a self-supervised AI architecture developed by Yann LeCun and Randall Balestriero. It is designed to help AI systems learn structured representations of the world by predicting abstract embeddings instead of generating the next token or pixel.

LeJEPA builds on the earlier JEPA framework and introduces a mathematically grounded method called SIGReg (Sketched Isotropic Gaussian Regularization) to prevent representation collapse during training. Unlike many traditional generative AI models, LeJEPA focuses on learning stable world models and meaningful latent representations that can support reasoning, planning, robotics, and embodied AI systems.

One of LeJEPA’s main advantages is that it removes many complex training heuristics used in earlier self-supervised learning systems while remaining scalable across different architectures and datasets.

Does JEPA use self-supervised learning?

Yes. JEPA is fundamentally a self-supervised framework – it learns from unlabeled data by predicting one part of the input from another, with no human-annotated labels required. This is structurally similar to how GPT and BERT are trained, but the prediction target is different: JEPA predicts abstract representations rather than raw tokens, which in principle reduces sensitivity to noise and irrelevant detail.

Can JEPA be used for robotics and physical AI?

Yes – and this is where the architecture is currently being pushed hardest. LeCun has positioned JEPA from the start as a foundation for systems that can plan, reason about consequences, and act in the physical world. The most direct evidence is V-JEPA 2 (Assran, Bardes, Fan, Garrido et al., June 2025), which extends V-JEPA by combining roughly a million hours of internet-scale video with a small amount of robot trajectory data, and reports results on understanding, prediction, and planning in physical environments. V-JEPA 2.1, released in March 2026, sharpens the recipe with more temporally consistent dense features. An earlier paper, Image World Models (Garrido et al., March 2024), generalized JEPA's prediction task from masking to global photometric transformations – useful as a methodological extension, but more about image-domain representation learning than robotics proper.

How did you like it?

Thank you for reading! Share this article with three friends and get a 1-month subscription free! 🤍

What Is JEPA? Joint Embedding Predictive Architecture