We present a cost-effective pretraining paradigm for VLA models using only synthetic data, achieving direct sim-to-real transfer and strong zero-shot generalizability for robotic grasping. Key contributions include:
-
SynGrasp-1B: a billion-frame synthetic grasping dataset, spanning 240 object categories and 10,000+ objects.
-
GraspVLA: a VLA model pretrained on SynGrasp-1B that achieves zero-shot generalization to real-world grasping without fine-tuning.
-
Unified CoT Framework: GraspVLA integrates autoregressive perception and flow-matching-based action generation into a single reasoning process, enabling joint training on synthetic action data and internet-scale semantic data for open-vocabulary grasping.
TODO List:
- Release the supplementary material
- Release model weights
- Release SynGrasp-1B dataset