This moment has finally come – we’re dedicating an entire article to robotics, the embodied entities of physical AI that exist in the real world. They move AI from digital space into physical environments, where machines can perceive, move, and act in real time. This means one key thing: robots now interact with the world on an entirely new level. And the big players – NVIDIA, Tesla, and even OpenAI (whose Sora 2 app is a big step forward) – are racing to master physical AI.
We’ve avoided this topic for a while because the technology stack behind robotics mostly includes vision-language-action (VLA) models, computer vision, world models, and mechanical systems – areas we’ve covered in other articles. Plus, robotics often overlaps with our Agentic AI series. Still, we can’t ignore that robotics is becoming its own field – not just about agentic behavior, vision, or physics, but everything combined. All these puzzle pieces come together to shape robots into what may soon become our everyday companions. How long will it take before living alongside robots feels normal?
We don’t know yet – but recent progress is clearly bringing that future closer. From Figure 03 and 1X’s Neo to Unitree’s quadrupeds and NVIDIA’s latest innovations, robotics is advancing faster than ever. In this article, we’ll explore the key technologies driving this shift with illustrative technical details to the essential concepts behind creating modern robotic systems. It’s a must read for everyone!
In today’s episode, we will cover:
Basic Terms to Become Fluent with Robotics
Robot Types
Four Main Pillars of Robotics
How Robots Learn Today
Reinforcement Learning in Robotics
Behavioral Cloning
Your potential home companions: Figure 03 and 1X’s NEO
Unitree Robots
NVIDIA Freshest Updates: Powering Robotics
Conclusion
Sources and further reading
Basic Terms to Become Fluent with Robotics
Robot Types
Terms are the base of every domain. If you’ve never dived deep into robotics before, this section may help you understand the subject. Today we can identify several main types of robots:
Humanoid robots designed to look and move like humans (Figure 03, Tesla Optimus, Agility Digit, Boston Dynamics Atlas, 1X Neo)

Image Credit: Atlas, Boston Dynamics
Quadruped robots – four-legged robots inspired by animals, often dogs (Boston Dynamics Spot, Unitree Go2, ANYmal)

Manipulator robot, or robotic arm used in factories or manufacturing lines (in that case called industrial robots), fun projects, and labs. Soon to be our surgeons as well.

Image Credit: Hugging Face LeRobot GitHub
Autonomous vehicles, like self-driving cars (Waymo, Tesla Autopilot), drones, or delivery robots (Serve Robotics, for example).

Image Credit: Serve Robotics
Exoskeletons – wearable robotic systems that assist or enhance human movement (Sarcos Guardian XO, Ekso Bionics).

Image Credit: The Robot Report
Machine learning (ML) and AI today enriches these entities with new perception, understanding and action capabilities. No matter if a robot is a humanoid, drone, or manipulator, it needs four main pillars to function meaningfully.
Four Main Pillars of Robotics
Perception
Perception helps the robot see, hear, and sense its environment. Its main components include:
Computer vision: Cameras + neural networks detecting objects, people, and scenes.
Depth sensing: LiDAR – Light Detection and Ranging sensor that measures distance by firing laser pulses and timing how long they take to reflect back, producing a 3D point cloud, – radar, or stereo vision to create 3D maps.
Touch and force sensing: Tactile sensors or joint torque sensors that estimate contact and grip strength.
Audio processing: Microphones with speech recognition models and sound localization.
AI models here help to interpret robot “senses.” They converts raw sensor data into a symbolic or numerical understanding of the world, handle object recognition, segmentation, motion detection, etc.
Localization and mapping (SLAM)
SLAM (Simultaneous Localization and Mapping) does two things at once:
Determines the robot’s position within a map (localization).
Builds or updates a map of surroundings in real time (mapping).
It uses algorithms such as Extended Kalman Filters, Graph-SLAM, or particle filters to align sensor readings over time and minimize positional drift.
Motion
This covers how a robot acts physically, including locomotion, manipulation and whole-body control.

Image Credit: Robot Learning: A Tutorial
Locomotion means how a robot moves through its environment. It combines mechanics (hardware) and AI-based control (software).
The mechanical part provides the movement system – wheels, legs, tracks, wings, etc. Traditional motion relied on fixed programming with predictable, repetitive paths.
AI-powered locomotion adds adaptivity and autonomy. The AI part plans and adjusts movement, balancing, reacting to unexpected obstacles, adapting to terrain (we discussed this in the interview with Spencer Huang from Nvidia, stay tuned), learning optimal movement patterns through RL, coordinating complex motions like walking or climbing stairs and planning paths in real time.
Manipulation is about robot’s interaction with and moving objects. AI systems here are needed to provide visual and tactile feedback for adaptive grasping, language-guided manipulation (like “pick up the write cup”), and to predict object dynamics (for example, how items move or fall when pushed).
Mobile manipulation combines both and demonstrates more complex cases like ones when a robot moves through a room and picks up objects.
Whole-body control relates to the movement of all parts and joints of a robot.
In general, there are two broad types of methods for generating robot motion:
Explicit (dynamics-based) models: These rely on mathematical equations to describe exactly how a robot’s parts move and interact with the environment. They come from physics, including things like forces, torques, and rigid-body dynamics.
Implicit (learning-based) models: They learn patterns directly from data by observing how robots move and respond in different situations.
Many modern robotic systems combine both ideas.
Planning and decision-making:
This is the robot’s “thinking system”. Depending on the robot’s functioning, different AI models (models with RL, world models, transformers, neural-symbolic planners) can be used to choose actions, search and plan paths and high-level tasks, and react to new information. These models use search algorithms, learn from feedback, combine symbolic reasoning and neural decision policies, etc. to plan and execute high-level tasks, balancing goals, safety, and ethical constraints.
Optional but common add-ons may include:
Communication / Language understanding if the robot responds to human commands.
Learning and adaptation to improve over time.
Energy management for long operation.
One of the main shifts today happens in robot learning. Recently Hugging Face published “Robot Learning: A Tutorial” where they show how robotics learning approach moved from traditional dynamics-based models – mathematical equations describing how forces and motions behave – to ML methods that have started to transform how robots plan and act. Let’s look a little bit closer on general trends in robot learning.
How Robots Learn Today
Curious what these robots are doing in the real world right now? See our humanoid robot news roundup from March 2026.
Learning-based robotics has become very promising, and here are the main reasons why:
Traditional model-based control depends on precise physics and equations, custom planners, costly simulators and modular pipelines that process sensing, planning, and control separately. Learning-based methods allow robots to learn directly from experience, without relying on this heavy stack.
They combine perception and control in a single pipeline.
Adaptation to new robots and tasks happens with less manual tuning.
Robots can process raw, high-dimensional inputs (like camera images, proprioception, audio) and improve automatically as more data becomes available.
Learning-based robotics simplifies everything by learning a single perception-to-action policy: observation → action (o → a). This is a visuomotor policy where a neural network maps sensor input directly to control actions.

Image Credit: Robot Learning: A Tutorial
These policies can be trained through two main paradigms:
Reinforcement Learning (RL): learn by trial and error, optimizing reward.
Behavioral Cloning (BC): learn by imitating demonstrations.

Image Credit: Robot Learning: A Tutorial
We hope that almost everyone is already familiar with reinforcement learning, but here are a few words about it in the context of robotics.
Reinforcement Learning in Robotics
RL trains an agent to learn a control policy that maximizes expected cumulative reward within a Markov Decision Process (MDP). In robotics, the following RL algorithms are widely used to learn control policies for tasks such as grasping or locomotion directly from interaction data:
TRPO: Trust Region Policy Optimization which restricts each update to a “trust region”
PPO: (Proximal Policy Optimization)
SAC: Soft Actor-Critic off-policy RL method which learns a stochastic policy that maximizes both expected reward and action entropy.
Training usually begins in simulation to avoid hardware risk, and the reality gap (differences between simulated and real physics) is mitigated using Domain Randomization (DR), which randomizes parameters like friction or mass.
Real-world RL is also used and its efficiency is improved via off-policy learning, data reuse, and combining demonstrations with live rollouts. In real life, environments are not perfectly known or static, since obstacles may move, sensors may be noisy, and models may be slightly wrong. To handle this, robots have feedback loops – they constantly compare what they expects to happen with what they actually observe, and adjust accordingly.
In RL for robotics, human feedback also plays a key role, because if the robot makes a mistake, a human can step in, provide a correction, and the system stores that correction for future updates. Human-in-the-loop methods like HIL-SERL (Human-in-the-Loop Sample-Efficient Reinforcement Learning) achieves 99%+ success on complex real-world manipulation tasks within 1–2 hours of training on low-cost robots
Behavioral Cloning
Behavioral Cloning (BC) lets robots learn directly from human demonstrations, imitating how humans act in recorded examples. It doesn’t depend on exploring or defining rewards unlike RL, and thanks to this avoid the problem of designing rewards for every new task.
BC treats control as a supervised learning problem: It learns the mapping from observation to expert action (o → a). Robot’s performance improves as more human trajectories across different robots and tasks become available. However, BC can only be as good as the demonstrator.
There are some more issues. Basic BC is simple imitation of expert actions, and it can’t handle multimodal data (like many ways to solve one task). Also, it accumulates small errors over time in sequential control and doesn’t generalize so well. That’s why BC needs kind of “extension” via software architectures that shape how a robot’s policy (the decision-making model) is trained and run.
Here’s how it works:
Training happens off-robot
Robots (or humans) collect demonstration data, like camera feeds, joint angles, actions, etc.
This data is used to train models like VAEs, ACT, or Diffusion Policies on powerful GPUs (in the cloud or lab).
The trained model learns how to map observations → actions in smarter, more stable ways.
Then deployment happens on the robot
Once trained, the model is uploaded to the robot’s onboard computer.
During operation, the robot runs the model in real time:
Cameras or sensors provide observations.
The trained policy outputs actions, for example, move joint, grasp object).
The main generative models that are used for this process include:
Variational Autoencoders (VAEs) → model the distribution of expert behaviors, rather than only copying the main ones. It add stochasticity and capture multiple valid behaviors (like left vs. right grasp).

Image Credit: Robot Learning: A Tutorial
Diffusion models → generate smooth, realistic multimodal trajectories by progressively denoising noisy action samples into expert-like trajectories.
Action Chunking with Transformers (ACT) → combines Conditional VAEs (CVAEs) with Transformers to learn chunks of multiple consecutive actions. This helps to generate coherent action sequences and reduces error accumulation in long-horizon tasks.

Image Credit: Robot Learning: A Tutorial
Vision-Language-Action Models (VLAs) → these ones introduce a shift to foundation models for robotics. VLAs like π₀ and SmolVLA integrate language and vision with action prediction, often using transformer-based or flow-matching architectures. They use pretrained Vision-Language Models (VLMs) for semantic understanding and scale BC to many tasks and embodiments using vision and language grounding.

Image Credit: Robot Learning: A Tutorial
This was the main information on how robots are trained and how they function in general. Now it’s time to take a look at what leading robots use and represent. Let’s start with impressive humanoids.
Your potential home companions: Figure 03 and 1X’s NEO
Figure 03
Figure AI has one specific goal – build a general-purpose humanoid robots that can handle any task a human can. Just recently they moved closer to this purpose with Figure 03. This humanoid robot is designed for mass production, not just lab demos. It can wash dishes, clean floors, and do other chores all on its own, controlled through voice commands. What brings this to reality?
At the core of Figure’s robots is its Helix neural network that is like the robot’s “brain.”

Image Credit: Helix: A Vision-Language-Action Model for Generalist Humanoid Control
Helix uses two connected systems that think and act at different speeds:
System 2 (S2): A slower, smarter Vision-Language Model (VLM) that understands the scene and the user’s instructions. It runs about 7–9 times per second and handles big-picture reasoning.
System 1 (S1): A fast, reactive visuomotor model that turns S2’s understanding into smooth, continuous movements at 200 times per second. In other words, this part is reliable for “reflexes.”
So S2 runs in the background, considering goals and context and plans what to do; S1 executes it quickly and adjusts on the fly, running in real time and controlling the robot’s body at high speed.
Helix was trained on about 500 hours of robot demonstrations. These were recorded by human operators using multiple robots. Then, an AI system automatically generated text instructions (like “place the cup in the drawer”) based on what the robot did in each video.
The model was trained end-to-end, directly mapping images and text commands to physical actions, without any manual fine-tuning or separate stages.
Thanks to Helix “thinking engine,” Figure robots gain the following features:
Precise full-body motion: Helix smoothly controls 35 joints at once, from fingers to torso, to handle complex tasks like grasping and reaching.
Two-robot teamwork: Two identical robots can collaborate using only shared language prompts, with no role-specific programming.
Generalization: Helix can understand and act on abstract concepts like “Pick up the desert item,” recognizing that a toy cactus fits that description.
Another interesting thing is how Figure 03 robots are trained. Engineers, called “pilots,” move around a fake kitchen wearing VR headsets, and act out everyday chores like folding laundry, washing dishes, and folding towels. The robots watch and learn from these recordings.
So far, Helix learned how to fold towels using just 80 hours of video, and the company plans to scale that to millions of hours. Figure is building entire simulated homes and factory mock-ups inside its offices to record more data.
As for the other features, Figure 03 includes major mechanical and design improvements over its predecessor:
Smaller, stronger joints (actuators)
Slimmer hands with tactile pads and a palm camera
90% cheaper parts than before, to make mass production viable
Safer batteries and a lighter overall frame (making it less intimidating)
And memory: the robot can remember object locations, like where it placed the keys.
The previous version, Figure 02 robots, have already being used in industry, at BMW’s Spartanburg factory, moving parts for assembly of the BMW X3. At home, though, the robots still struggle with the most mundane things: folding shirts or picking up dropped laundry. Moravec’s Paradox in action. These tasks, that humans find trivial, are hard for robots because of unpredictable shapes and textures. Figure’s CEO Brett Adcock says full home autonomy could come by 2026, but admits there’s “a big push” left to get there. Besides the hype, most of Figure 03’s demos still rely on earlier robots, and real-world testing has just begun.
NEO by 1X
Another impression humanoid for home is NEO by 1X which can be ordered now (the company announced it just yesterday).

Image Credit: NEO, 1X
But here, in our AI 101 series we’re more interested in what’s inside NEO robots. It relies on two advanced AI systems that blend physical realism with simulated learning.
The 1X World Model acts as a virtual environment – a physics-based simulator that predicts what will happen before the robot takes any real-world action. This “hallucination” process lets the robot test ideas safely and quickly, helping engineers measure how well its AI model performs. Because everything is modeled in data, training and improving robot behavior happens far faster than through real-world trial and error.
The Redwood AI system gives NEO both movement and reasoning skills. Using BC and stereo vision, Redwood enables the robot to walk, run, kneel, or climb stairs with fluid control, all within one unified controller. At the same time, it’s also a vision-language transformer – it can understand what it sees and respond in context, whether folding laundry, answering the door, or navigating a home. Each new experience NEO gains strengthens Redwood’s models, making the robot more capable, adaptable, and natural in human environments.
These two developments show that humanoid robots are getting closer to entering our homes. Robotics are becoming part of everyday life not just for professionals or industry, but also for ordinary people just seeking comfort (and their laundry to be folded!).
Unitree robots
While the previous two robots emphasize their specialization on home tasks, Unitree robots are created for broader use. You’ve probably seen those awesome videos of robots performing different tricks – yes, they are from Unitree.
One of them is a 1.3-meter-tall humanoid G1. It is built for research and development. It boasts “extra large joint movement space, 23-43 joint motors,” and features force-position hybrid control for its hands for precise operations. It learns from both imitation and RL and is powered by a UnifoLM (Unitree Robot Unified Large Model) world model.

Image Credit: Unitree G1
Another notable member of the Unitree robots family is Go2 – a bionic quadruped powered by advanced physical sensing and AI. At its heart lies a 360° × 90° ultra-wide 4D LiDAR (“L1”) with a minimum detection distance of just 0.05 m, enabling near-complete terrain awareness and all-ground navigation. On the actuation side, Go2 features a 15 kg aluminium-alloy + high-strength-engineering-plastic body, joint torque reaching ~45 N·m, and max speed up to ~5 m/s under lab conditions. This robot is trained via RL.

Image Credit: Unitree Go2
NVIDIA Freshest Updates: Powering Robotics
Maybe this is the part you’ve been waiting for – coming straight from the latest updates at NVIDIA GTC. But – there was no actual robots announcements. This time, their innovations are focused on building the backbone and driving the powerful functionality of physical AI. Let’s unpack them one by one and see what they mean for the future of robotics.

Image Credit: Jensen Huang’s Keynote
On the keynote, Jensen Huang noted that physical AI needs three computers, much like how training a language model uses two (one for training and one for inference).
The Training Computer – The Grace Blackwell GB200 trains and evaluates large AI models that form the “brains” of robots.
The Simulation Computer – Built on Omniverse DSX, this system creates digital twins of robots, factories, and other environments where AI can safely learn through simulation. It handles generative AI, computer graphics, ray tracing, and sensor simulation, all essential for realistic virtual learning.
The Robotic Computer – This is the Jetson Thor platform, a compact system that goes inside real robots or self-driving vehicles. It runs the trained model, allowing robots to operate, move, and make real-time decisions in the physical world. Hopefully getting it next week to tell you more about it.

Image Credit: Jensen Huang Keynote
All three computers run on NVIDIA’s CUDA architecture, forming a complete pipeline where AI is trained, simulated, and deployed seamlessly. Together, they make it possible to build AI that understands physics, causality, and permanence, in other words, AI that truly grasps how the physical world works.
NVIDIA’s vision of Physical AI ties to the reindustrialization of America. In Houston, Foxconn is building a fully robotic facility for manufacturing NVIDIA’s AI infrastructure systems – a factory that’s born digital. Using Omniverse and Siemens’ digital twin technology, engineers can design, test, and optimize every system (mechanical, electrical, and plumbing) virtually before construction.
Inside these factories, robots trained in Isaac Sim assemble AI hardware, while fleets of autonomous machines coordinate tasks through Omniverse-based sensor simulations. AI agents from Metropolis and Cosmos oversee operations, detect anomalies, and even assist with worker onboarding through interactive coaching systems.
The result is a new kind of factory – a robot orchestrating other robots, designed, trained, and managed entirely through digital twins. Quite impressive.
Jensen Huang also highlighted how robotaxis are reaching a major inflection point. He announced a partnership with Uber to connect vehicles built on the NVIDIA Drive Hyperion platform into a global network. This is NVIDIA’s end-to-end architecture for self-driving cars that combines a comprehensive sensor suite, including surround cameras, radar, and LiDAR, with redundant AI compute built for real-time perception, mapping, and decision-making. In the near future, passengers will be able to hail one of these AI-powered cars directly through Uber. We wonder what Tesla thinks about it…

Image Credit: Jensen Huang Keynote
Another interesting innovation is NVIDIA IGX Thor – a next-gen processor built to give robots and machines real-time intelligence right at the edge without cloud delay. It’s powered by NVIDIA’s Blackwell architecture and packs an integrated and discrete GPU combo that delivers up to eight times more AI power than before. In practice, that means a robot can process camera, LiDAR, and sensor data instantly, understand its surroundings, and make split-second decisions safely. Machines will move, see, and react more like humans, enabling smarter factory automation, safer medical robots, and faster, more adaptive physical AI systems.
Conclusion
Robotics is still in its early stage, and a few big leaps are required before robots can make high-quality physical AI a reality. Robots are moving into homes and industry, and the crucial step now is designing how all their systems work together – then training them in both simulated and real worlds, just as we’ve trained language and vision models. To get there, developers need to overcome the following challenges:
Reality gap – how different the simulated environment is from the real one.
Performance gap – how differently a robot behaves or performs in the real world compared to how it did in simulation.
There is a lot of research and work ahead.
Robotics is a symbiosis of everything – AI, mechanics, electronics, design – making it one of the most exciting and sought-after fields to succeed in.
Sources and further reading
LeRobot GitHub (Hugging Face)
From Turing Post:







