• Turing Post
  • Posts
  • AI 101: What is PAN? How to Build a Better World Model?

AI 101: What is PAN? How to Build a Better World Model?

Explore how rethinking world model building patterns can turn our vision upside down and lead to a new Physical, Agentic, and Nested (PAN) system

We have devoted a couple of articles to exploring the bottomless world of world models. Firstly, we looked at an entire ecosystem – NVIDIA’s Cosmos World Foundation Model platform that supports the path toward Physical AI, and then we dived deeper into what current world models look like and how they function to bring simulated worlds into the service of AI agents. But there is so much more to say about the field that is still in its formative stage.

Today let’s fundamentally rethink how we build world models. Maybe the way we usually approach their architectures is not the right one? When the field is just establishing, there is no completely right or wrong, there is a world full of investigation. Since there’s still disagreement about what a world model even is, figuring out how to build one is even harder.

Researchers from Carnegie Mellon University, MBZUAI, and UC San Diego have taken a critical path through world modeling. Starting with the imaginative worlds of Dune and the inspiration from the concept of “hypothetical thinking” in psychology, they identify the main mission of world models: “Simulating all actionable possibilities of the real world for purposeful reasoning and acting.” They argue that a new architecture should be:

  • Hierarchical and multi-level

  • With mixed continuous and discrete representations

  • Generative and self-supervised

Such an approach, as the researchers see it, could lead to a PAN world model system, meaning Physical, Agentic, and Nested, that might form the foundation of AGI. So let’s follow their critique experience to explore the deeper sides of effective world modeling that could shape our future.

In today’s episode, we will cover:

  • Why do we need world models?

  • The pitfalls of current world models

  • Inspiration and new ideas come in

  • How to build a better world model

  • How does PAN world model work?

  • The main advantages of PAN

  • Not without limitations

  • Conclusion

  • Sources and further reading

Why do we need world models?

We will start with a quick reminder of what a world model is (say, how most see what it is) and why we really need this kind of tools.

A world model comes into play when we think about agents – systems that act on their own to achieve goals. Agent’s environment is everything around it: the physical world, the social world, or even the universe itself. Generally, an optimal agent works like this:

  • It sees the current state of the world.

  • Based on that, it picks an action following a certain strategy or rule.

  • The environment responds, producing a new state.

  • The agent also receives rewards that measure how well it’s doing toward its goal. Its aim is to pick actions that maximize the expected total reward.

In theory, an optimal agent would need access to the true state of the world in order to make perfect decisions, but in practice, it has imperfect information with noise and incompleteness. World models can help with this giving agents some kind of “imagination”.

A world model functions like a generative simulator of futures, allowing the agent to do “thought experiments” before acting in the real world.

With a world model things become easier avoiding the need of the true state for agents. Instead, we observe the following process:

Image Credit: Critiques of World Models original paper

  • The agent builds its own belief state (Ɲₜ), based on what it observes (oₜ) through its sensors.

  • An encoder processes these observations into that internal belief state.

  • Then the agent considers a possible action (aâ€Čₜ).

  • The world model predicts what the next belief state (Ɲₜ₊₁) of the world will look like not deterministically but probabilistically, as there are usually many possible futures and uncertainties.

  • The agent repeats this prediction–action cycle over multiple steps into the future.

With this ability, an AI agent can run thought experiments, like “If I do action A, what’s likely to happen?”, “What if I try B instead?”, “What if I do nothing?”, etc. This allows the agent to choose actions that are most likely to lead to its goals.

A general-purpose world model can cover many domains, such as:

  • Physical dynamics (how objects move, how water pours, etc.)

  • Embodied experiences like balance, posture, or how it feels to be hot or dizzy.

  • Emotional states

  • Social situations

  • Mental world: abstract reasoning like planning, strategizing, or problem-solving in multi-agent situations.

  • Counterfactuals – exploring “what if” scenarios.

  • Evolutionary dynamics like changes over generations, such as inheritance and adaptation.

The strength of a world model is that it allows simulative reasoning: running through many imagined futures, comparing them, and picking the best plan. Moreover, world models make it possible for AI to transfer knowledge from one setting to another, like humans do with different skills (for example, a gamer’s skill at controlling virtual characters helps with drone piloting). This is a way to create machines that could achieve cool level of “zero-shot generalization” in new situations without explicit training.

And what connects all world models is the bold idea: If models can predict the next word, we can make them predict the next “world”, like every possible way the future might unfold. And the main question here is how to do this.

The pitfalls of current world models

The first thing that catches the eye is that most today’s world models share a strong emphasis on video and image generation. That’s why they often look more like video generators than true reasoning engines. There is something more to consider in different world models that we work with today:

  • Gaming world models like Genie 2 (Google DeepMind) and Muse (Microsoft) are good for short visual simulations lasting 1-2 minutes, but they struggle with long-term coherence and are too domain-specific for console-style inputs or Minecraft-like worlds.

  • 3D scene world models like those built by World Labs push realism in space with the focus on stylized 3D environments and navigation, but they don’t yet support agent reasoning or causal simulation and lack dynamics, physics, or rich interactivity.

  • Physical world models, such as NVIDIA Cosmos and Wayve GAIA-2 designed for embodied physical tasks specialize in modeling physics and sensory-motor responses. However, they don’t generalize to open-ended, multi-agent, or social reasoning tasks, staying capable only in narrow areas.

  • Video generation models like OpenAI Sora, Google DeepMind Veo that create high-quality video sequences from prompts or prior frames only generate fixed video trajectories and have no explicit state, action, or object-level understanding.

  • And finally, Joint Embedding Predictive Models, JEPA family with V-JEPA, DINO-WM and others, are the most promising direction with the ability to, for example, perform real robotic arm manipulation, but they are still far from handling complex, long-term tasks.

So there is still no completely winning architecture for world models that would offer us everything we want to see from such systems. Maybe it is time to reconsider what we can do with this?

Inspiration and new ideas come in

With the aim of creating a new architecture that might solve a lot of limitations, the researchers from Carnegie Mellon University, MBZUAI, and UC San Diego turned for inspiration to two concepts.

The first one is what we can see in futuristic stories. →

Join Premium members from top companies like Microsoft, Google, Hugging Face, a16z, Datadog plus AI labs such as Ai2, MIT, Berkeley, .gov, and thousands of others to really understand what’s going on with AI. Simplify your learning journey đŸ‘†đŸŒ

1 

Reply

or to participate.