SpatialVLA-Spatial embodied general operation model jointly launched by Shanghai AI Lab, China Telecom Artificial Intelligence Research Institute and ShanghaiTech University

What is SpatialVLA

SpatialVLA is a new spatial embodied general operation model jointly launched by Shanghai AILab, China Telecom Artificial Intelligence Research Institute and ShanghaiTech University. It is pre-trained based on millions of real data and gives robots a general 3D spatial understanding ability. SpatiaIVLA integrates 3D spatial information with semantic features based on Ego3D position encoding, discretizes continuous actions with adaptive action grids, and realizes generalized control across robot platforms. SpatialVLA is pre-trained on large-scale real robot data, showing strong zero-shot generalization ability and spatial understanding ability, and performs well in complex environments and multi-task scenarios. SpatialVLA open source code and flexible fine-tuning mechanism provide a new technical path for research and application in the field of robotics.

Main functions of SpatialVLA

Zero-shot generalization control: Directly perform operations in unseen robot tasks and environments without additional training.

Efficient adaptation to new scenarios: Fine-tune with a small amount of data to quickly adapt to new robot platforms or tasks.

Powerful spatial understanding: Understand complex 3D spatial layouts and perform precise manipulation tasks such as object positioning, grasping, and placement. Versatility across robot platforms: Support multiple robot forms and configurations to achieve common manipulation strategies

Fast reasoning and efficient action generation: Based on discretized action space, improve model reasoning speed and be suitable for real-time robot control.

Technical principles of SpatialVLA

Ego3D position encoding: Combine depth information with 2D semantic features to build a robot-centric 3D coordinate system. Eliminate the need for specific robot-camera calibration and allow the model to perceive 3D scene structure to adapt to different robot platforms.

Adaptive action grid: Discretize continuous robot actions into adaptive grids and divide the action space based on data distribution. Align the actions of different robots with grids to achieve cross-platform action generalization and migration.

Spatial embedding adaptation: In the fine-tuning stage, re-divide the grid and adjust the spatial embedding according to the action distribution of the new robot. Provide a flexible and efficient robot post-training method to accelerate the model's adaptation to the new environment.

Pre-training and fine-tuning: Pre-train on large-scale real robot data to learn common manipulation strategies. Fine-tune on new tasks or robotic platforms to further optimize model performance.

SpatialVLA project address

Project website: https://spatialvla.github.io/

GitHub repository: https://github.com/SpatialVLA/SpatialVLA

HuggingFace model library:

https:/huggingface.co/IPEC-COMMUNITY/foundation-vision-language-action-model

arXiv technical paper: https://arxiv.org/pdf/2501.15830

What is SpatialVLA

Main functions of SpatialVLA

Technical principles of SpatialVLA

SpatialVLA project address

QwQ-32B-The latest inference model of Alitong Yiqianwen open source

SpeciesNet - Google's Open-Source AI Model for Animal Species Recognition

Related articles