Autonomous grasping has long been a central topic in robotics, yet deployment in small and medium-sized enterprises (SMEs) is still hindered by low-level robot programming and the lack of natural language interaction. Recent Vision-Language-Action models (VLAs) allow robots to interpret natural language commands for intuitive interaction and control, but they still exhibit output uncertainty and are not yet well suited to directly generating reliable, precise actions in safety-critical industrial contexts. To address this gap, we present VL-GRiP3, a hierarchical Vision-Language model (VLM)-enabled pipeline for autonomous 3D robotic grasping that bridges natural language interaction and accurate, reliable manipulation in SME settings. The framework decomposes language understanding, perception, and action planning in a transparent modular architecture, improving flexibility and interpretability. Within this architecture, a single VLM backbone handles natural language interpretation, target perception, and high-level action planning. CAD-augmented point cloud registration then mitigates occlusions in single RGB-D views while keeping hardware cost low, and an M2T2-based grasp planner predicts accurate 3D grasp poses that explicitly account for complex object geometry from the augmented point cloud, enabling reliable manipulation of irregular industrial parts. Experiments show that our fine-tuned VLM modules achieve segmentation performance comparable to YOLOv8n, and VL-GRiP3 attains a 94.67% success rate over 150 randomized grasping trials. A comparative evaluation against state-of-the-art end-to-end VLAs further indicates that our modular, CAD-augmented design with explicit 3D grasp pose prediction yields more reliable and controllable behavior for SME manufacturing applications.
扫码关注我们
求助内容:
应助结果提醒方式:
