Embodied AI (Robotics)
AI research and product area focused on robots that can sense, plan, and act in the physical world. As of mid-2026, the field is undergoing a foundation-model moment — VLA (Vision-Language-Action) models are doing for robotics what LLMs did for language: training one model to handle diverse tasks across multiple embodiments.
Research & Commercial Dimensions
- Locomotion: Whole-body movement — walking, running, jumping, crawling. Addressed by nvidia-groot|GR00T's GEAR-SONIC module (November 2025) with cross-embodiment motion tracking.
- Manipulation: Object interaction, grasping, tool use. The primary focus of physical-intelligence's π series and GR00T N1.
- Mobility: Navigation and transportation in unstructured environments. Still an active research gap in foundation models — most current VLA models are manipulation-heavy and need architectural extensions for mobility.
Key Models & Platforms (as of mid-2026)
| Model/Platform | Developer | Approach | Status |
|---|---|---|---|
| nvidia-groot | GR00T N1 → N1.7 | NVIDIA | Per-embodiment VLA + WBC + SONIC locomotion |
| physical-intelligence | π0 / π0.5 | Physical Intelligence | Unified VLA across embodiments |
| RT series | Google DeepMind | Transformer-based robotics | Research |
| Cosmos World Model | NVIDIA | World simulation for training | Active |
Architectural Divide: Unified vs Per-Embodiment
The defining debate in embodied AI foundation models:
- Unified approach (physical-intelligence): Train one VLA model that generalizes across many robot embodiments. Higher risk but potentially unlimited scalability — analogous to training one LLM for all language tasks.
- Per-embodiment approach (nvidia-groot|GR00T): Train or fine-tune a foundation model for each specific robot platform. More reliable per-robot but requires more training runs and data per embodiment.
VLA Architecture Building Blocks
Modern VLA models are converging on architectures borrowed from multimodal LLMs:
- ViT + MLP + LLM pipeline: Vision Transformer encodes images → MLP projector aligns to text space → LLM backbone generates actions as tokens. This LLaVA-style pattern is widely adopted.
- Action Expert / Flow Matching: Specialized modules for generating continuous action trajectories (vs discrete text tokens).
- Action Chunking: Predicting sequences of actions rather than single steps, improving smoothness and consistency.
- Cross-embodiment data: Training on data from multiple robot types, using embodiment tokens to condition behavior.
Open Questions
- Can mobility be cleanly added to manipulation-first VLA architectures?
- Will the unified or per-embodiment approach win at scale?
- How much real-world robot data is needed vs simulation? (GR00T's BONES-SEED: 142K+ human motions / ~288 hours of data provides one data point.)
- World Model evaluation vs Sim2real: Is using world models to judge robot policies a more scalable alternative to sim2real transfer metrics? PhysicalIntelligence's June 18, 2026 release argues yes — the self-consistency check (reverse dynamics → detect physics violations → early termination) provides a novel evaluation signal that conventional success-rate metrics miss [4].
World Model Evaluation (2026)
A new evaluation paradigm emerged in mid-2026: using world models as robot policy judges rather than just as training simulators.
PhysicalIntelligence (June 18, 2026) released a methodology that benchmarks 7 VLA models against world-model simulation criteria, comparing with existing frameworks (Isaac Lab-Arena, WorldEval, dWorldEval) [4].
Key technical innovation: reverse-dynamics self-consistency detection:
- The world model simulates the robot's action sequence forward
- Inverse dynamics are computed on the resulting state
- If the reconstructed scene deviates from physical reality (e.g., objects passing through walls, impossible joint angles), evaluation terminates early
- This catches physically implausible behavior that traditional success-rate metrics miss
This is analogous to what MMLU/GSM8K did for language models — creating a standardized, objective benchmark that the entire field can compete on. The approach works alongside existing frameworks:
- WorldEval / dWorldEval: Community benchmarks for world model quality
- Isaac Lab-Arena: NVIDIA's simulation-based robotics evaluation platform
- Sim2real transfer: Traditional approach measuring real-world performance after sim training
NVIDIA Cosmos (world model) and PhysicalIntelligence π (VLA policy) represent the two sides of this evaluation coin — Cosmos as the simulator/judge, π as the policy being judged [4].
Related Entities
- physical-intelligence — Unified VLA foundation model company
- nvidia-groot — NVIDIA's open humanoid robot platform
- deepseek — Model research background
- cerebras — Hardware considerations for robot inference