AI 101: An Insightful Guide to VLA/VLA+ models

Long ago, robots started to work in structured settings, mostly in industry, where they took on the role of ambassadors of automation, following a clear program of action with a defined starting point and outcome. There was no need for flexibility, because everything had to follow strict rules. Now, robots are entering the real world, full of chaos and constant change. In this new age, they are starting to learn how to perceive, think, and act continuously, and that's why everyone is talking about Physical AI now – it's what makes this robotic shift possible.

And today, we're witnessing the next evolution: VLA+ models.

Just this morning, Microsoft Research released Rho-alpha (ρα) – their first robotics model, built on the Phi family, and it's doing something fundamentally different. While most Vision-Language-Action (VLA) models stop at vision and language, Rho-alpha adds tactile sensing to feel objects during manipulation, and online learning that lets it improve from human corrections in real-time, even after deployment. Microsoft is calling this a VLA+ model, positioning it as an extension beyond what current VLA systems support. But to understand why this "plus" matters, we need to understand what came before.

Vision-Language-Action models (VLAs) are the core of Physical AI, they are bridges between the real world and robots that translate unstructured human language and sensory input into executable robotic actions, keeping this explicit connection of vision → language → action constant. How did we get to VLA+?

The VLA journey reveals strikingly different visions. Google bet on massive scale and embodied reasoning (Gemini Robotics). Physical Intelligence built general policies through flow-based action modeling (π0). Hugging Face went the opposite direction, making VLAs tiny enough for consumer GPUs (SmolVLA). Figure gave humanoid robots a dual-system brain inspired by human cognition (Helix). Chinese researchers used mixture-of-experts to prevent knowledge loss (ChatVLA-2). One team moved reasoning into action space itself (ACoT-VLA). And NVIDIA proved you could reach state-of-the-art by treating actions as simple text (VLA-0).

Each approach opens up a different secret about connecting vision, language, and action. Together, they created the foundation that makes VLA+ possible. Today, we're taking you through this entire landscape – so you'll understand not just what these models do, but how to build them and where robotics is heading next.

In today’s episode, we will cover:

What are Vision-Language-Action models?
Gemini Robotics: Embodied reasoning is the key
π0: Building one general robot policy
SmolVLA: A VLA accessible on any device
Helix system for humanoid robots
Mixture-of-Experts as a VLA backbone
Action Chain-of-Thought: Moving reasoning to the action space
VLA-0: How to create a VLA by doing less
Microsoft's Rho-alpha (ρα): From VLA to VLA+
Conclusion
Sources and further reading

What are Vision-Language-Action models?

Before we dive into the competing approaches that led to VLA+, let's first understand what makes a Vision-Language-Action model work in the first place.

The idea of VLA is to help robots interpret what they see, understand instructions, and act in the physical world. To do this, VLAs combine perception, language understanding, and control in a single system. They push robot learning toward foundation-model-style control, where just one model can handle many tasks by leveraging pretrained multimodal knowledge.

Most VLA models are built around three core components:

Vision-Language backbone: VLAs typically start from a large Vision-Language Model (VLM) pretrained on image–text data. VLMs already know how to recognize objects, understand text, reason spatially, and even solve math problems.
Action interface: On top of the VLM, VLAs add a mechanism to produce robot actions. Depending on the design, this can be direct action prediction (continuous control), action chunks or trajectories, or structured action representations learned from demonstrations.
Multimodal inputs: VLAs usually condition on camera images, natural language instructions, and often robot state like joint positions, gripper state, and others.

A good VLA model should accomplish two missions well: preserve open-world reasoning from the VLM, and correctly turn that reasoning – what a robot sees and is told – into actions.

Now let's see the most illustrative ways teams have solved this challenge, each demonstrating a different architectural insight →

Don’t settle for shallow articles. Learn the basics and go deeper with us. Truly understanding things is deeply satisfying →

UPGRADE TO READ THE REST

Join Premium members from top companies like Microsoft, Nvidia, Google, Hugging Face, OpenAI, a16z, plus AI labs such as Ai2, MIT, Berkeley, .gov, and thousands of others to really understand what’s going on in AI.

AI 101: An Insightful Guide to VLA/VLA+ models

What are Vision-Language-Action models?

Reply

Keep Reading

Turing Post