• Turing Post
  • Posts
  • AI 101: An Insightful Guide to VLA/VLA+ models

AI 101: An Insightful Guide to VLA/VLA+ models

How Microsoft's Rho-alpha, Google's Gemini Robotics, and six other breakthroughs built the bridge from VLA to VLA+ – the complete snapshot of where we are today

Long ago, robots started to work in structured settings, mostly in industry, where they took on the role of ambassadors of automation, following a clear program of action with a defined starting point and outcome. There was no need for flexibility, because everything had to follow strict rules. Now, robots are entering the real world, full of chaos and constant change. In this new age, they are starting to learn how to perceive, think, and act continuously, and that's why everyone is talking about Physical AI now – it's what makes this robotic shift possible.

And today, we're witnessing the next evolution: VLA+ models.

Just this morning, Microsoft Research released Rho-alpha (ρα) – their first robotics model, built on the Phi family, and it's doing something fundamentally different. While most Vision-Language-Action (VLA) models stop at vision and language, Rho-alpha adds tactile sensing to feel objects during manipulation, and online learning that lets it improve from human corrections in real-time, even after deployment. Microsoft is calling this a VLA+ model, positioning it as an extension beyond what current VLA systems support. But to understand why this "plus" matters, we need to understand what came before.

Vision-Language-Action models (VLAs) are the core of Physical AI, they are bridges between the real world and robots that translate unstructured human language and sensory input into executable robotic actions, keeping this explicit connection of vision → language → action constant. How did we get to VLA+?

The VLA journey reveals strikingly different visions. Google bet on massive scale and embodied reasoning (Gemini Robotics). Physical Intelligence built general policies through flow-based action modeling (π0). Hugging Face went the opposite direction, making VLAs tiny enough for consumer GPUs (SmolVLA). Figure gave humanoid robots a dual-system brain inspired by human cognition (Helix). Chinese researchers used mixture-of-experts to prevent knowledge loss (ChatVLA-2). One team moved reasoning into action space itself (ACoT-VLA). And NVIDIA proved you could reach state-of-the-art by treating actions as simple text (VLA-0).

Each approach opens up a different secret about connecting vision, language, and action. Together, they created the foundation that makes VLA+ possible. Today, we're taking you through this entire landscape – so you'll understand not just what these models do, but how to build them and where robotics is heading next.

In today’s episode, we will cover:

  • What are Vision-Language-Action models?

  • Gemini Robotics: Embodied reasoning is the key

  • π0: Building one general robot policy

  • SmolVLA: A VLA accessible on any device

  • Helix system for humanoid robots

  • Mixture-of-Experts as a VLA backbone

  • Action Chain-of-Thought: Moving reasoning to the action space

  • VLA-0: How to create a VLA by doing less

  • Microsoft's Rho-alpha (ρα): From VLA to VLA+

  • Conclusion

  • Sources and further reading

What are Vision-Language-Action models?

Before we dive into the competing approaches that led to VLA+, let's first understand what makes a Vision-Language-Action model work in the first place.

The idea of VLA is to help robots interpret what they see, understand instructions, and act in the physical world. To do this, VLAs combine perception, language understanding, and control in a single system. They push robot learning toward foundation-model-style control, where just one model can handle many tasks by leveraging pretrained multimodal knowledge.

Most VLA models are built around three core components:

  • Vision-Language backbone: VLAs typically start from a large Vision-Language Model (VLM) pretrained on image–text data. VLMs already know how to recognize objects, understand text, reason spatially, and even solve math problems.

  • Action interface: On top of the VLM, VLAs add a mechanism to produce robot actions. Depending on the design, this can be direct action prediction (continuous control), action chunks or trajectories, or structured action representations learned from demonstrations.

  • Multimodal inputs: VLAs usually condition on camera images, natural language instructions, and often robot state like joint positions, gripper state, and others.

A good VLA model should accomplish two missions well: preserve open-world reasoning from the VLM, and correctly turn that reasoning – what a robot sees and is told – into actions.

Now let's see the most illustrative ways teams have solved this challenge, each demonstrating a different architectural insight →

Don’t settle for shallow articles. Learn the basics and go deeper with us. Truly understanding things is deeply satisfying →

Join Premium members from top companies like Microsoft, Nvidia, Google, Hugging Face, OpenAI, a16z, plus AI labs such as Ai2, MIT, Berkeley, .gov, and thousands of others to really understand what’s going on in AI. 

Reply

or to participate.