3D mapping is vital for a broad range of applications that rely on a consistent and accurate representation of the environment. Change is an ever-persistent force in our world and with the evolution of a scene its 3D map becomes outdated. Thus, a mapping framework that can adapt and refine the 3D maps with the changes in the scene is necessary. In this letter, we propose a lifelong mapping framework where map maintenance is based on two objectives including the preservation of static structures and the refinement of the 3D map. To preserve only the static structures, we classify the object’s state and remove the dynamic objects and the quasi-static objects, i.e., objects which temporarily appear static. For classifying the state of objects, we propose a discrete probabilistic solution utilizing a factor graph. Using this classification, we generate static maps from multiple sessions which are used for map refinement. The refinement is based on change detection and map update, leveraging semantic and geometric information. For the evaluation, we collect a multi-campus lifelong dataset as an extension of the MCD datasets from KTH and NTU campuses. The proposed approach is capable of accurately detecting quasi-static objects even in highly dynamic environments. Our system demonstrates state of the art performance in large scale environments. Furthermore, our approach can handle both SLAM-generated and survey-grade maps.
{"title":"ProbPer-LiLo: Probabilistic Persistency Modeling for Life-Long Mapping","authors":"Waqas Ali;Yixi Cai;Patric Jensfelt;Thien-Minh Nguyen","doi":"10.1109/LRA.2026.3653311","DOIUrl":"https://doi.org/10.1109/LRA.2026.3653311","url":null,"abstract":"3D mapping is vital for a broad range of applications that rely on a consistent and accurate representation of the environment. Change is an ever-persistent force in our world and with the evolution of a scene its 3D map becomes outdated. Thus, a mapping framework that can adapt and refine the 3D maps with the changes in the scene is necessary. In this letter, we propose a lifelong mapping framework where map maintenance is based on two objectives including the preservation of static structures and the refinement of the 3D map. To preserve only the static structures, we classify the object’s state and remove the dynamic objects and the quasi-static objects, i.e., objects which temporarily appear static. For classifying the state of objects, we propose a discrete probabilistic solution utilizing a factor graph. Using this classification, we generate static maps from multiple sessions which are used for map refinement. The refinement is based on change detection and map update, leveraging semantic and geometric information. For the evaluation, we collect a multi-campus lifelong dataset as an extension of the MCD datasets from KTH and NTU campuses. The proposed approach is capable of accurately detecting quasi-static objects even in highly dynamic environments. Our system demonstrates state of the art performance in large scale environments. Furthermore, our approach can handle both SLAM-generated and survey-grade maps.","PeriodicalId":13241,"journal":{"name":"IEEE Robotics and Automation Letters","volume":"11 3","pages":"2530-2537"},"PeriodicalIF":5.3,"publicationDate":"2026-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146001870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-12DOI: 10.1109/LRA.2026.3652073
Yixiang Chen;Yan Huang;Keji He;Peiyan Li;Liang Wang
When performing 3D manipulation tasks, robots have to execute action planning based on perceptions from multiple fixed cameras. The multi-camera setup introduces substantial redundancy and irrelevant information, which increases computational costs and forces the model to spend extra training time extracting crucial task-relevant details. To filter out redundant information and accurately extract task-relevant features, we propose the VERM (Virtual Eye for Robotic Manipulation) method, leveraging the knowledge in foundation models to imagine a virtual task-adaptive view from the constructed 3D point cloud, which efficiently captures necessary information and mitigates occlusion. To facilitate 3D action planning and fine-grained manipulation, we further design a depth-aware module and a dynamic coarse-to-fine procedure. Extensive experimental results on both simulation benchmark RLBench and real-world evaluations demonstrate the effectiveness of our method, surpassing previous state-of-the-art methods while achieving 1.89× speedup in training time and 1.54× speedup in inference speed.
在执行3D操作任务时,机器人必须根据多个固定摄像机的感知来执行行动计划。多摄像头设置引入了大量冗余和不相关信息,这增加了计算成本,并迫使模型花费额外的训练时间来提取关键的任务相关细节。为了过滤冗余信息并准确提取任务相关特征,我们提出了VERM (Virtual Eye for Robotic Manipulation)方法,利用基础模型中的知识从构建的3D点云中想象一个虚拟的任务自适应视图,有效地捕获必要的信息并减轻遮挡。为了便于三维动作规划和细粒度操作,我们进一步设计了深度感知模块和动态粗到细过程。在模拟基准RLBench和现实世界的评估上的大量实验结果表明,我们的方法是有效的,超越了以前最先进的方法,同时实现了1.89倍的训练时间加速和1.54倍的推理速度加速。
{"title":"VERM: Leveraging Foundation Models to Create a Virtual Eye for Efficient 3D Robotic Manipulation","authors":"Yixiang Chen;Yan Huang;Keji He;Peiyan Li;Liang Wang","doi":"10.1109/LRA.2026.3652073","DOIUrl":"https://doi.org/10.1109/LRA.2026.3652073","url":null,"abstract":"When performing 3D manipulation tasks, robots have to execute action planning based on perceptions from multiple fixed cameras. The multi-camera setup introduces substantial redundancy and irrelevant information, which increases computational costs and forces the model to spend extra training time extracting crucial task-relevant details. To filter out redundant information and accurately extract task-relevant features, we propose the <bold>VERM</b> (<bold>V</b>irtual <bold>E</b>ye for <bold>R</b>obotic <bold>M</b>anipulation) method, leveraging the knowledge in foundation models to imagine a virtual task-adaptive view from the constructed 3D point cloud, which efficiently captures necessary information and mitigates occlusion. To facilitate 3D action planning and fine-grained manipulation, we further design a depth-aware module and a dynamic coarse-to-fine procedure. Extensive experimental results on both simulation benchmark RLBench and real-world evaluations demonstrate the effectiveness of our method, surpassing previous state-of-the-art methods while achieving <bold>1.89×</b> speedup in training time and <bold>1.54×</b> speedup in inference speed.","PeriodicalId":13241,"journal":{"name":"IEEE Robotics and Automation Letters","volume":"11 3","pages":"2482-2489"},"PeriodicalIF":5.3,"publicationDate":"2026-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146001875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Path planning in large-scale, complex 3D environments is fundamentally constrained by a trade-off between path quality and computational speed. This paper presents RUSH (Recursive and Scalable 3D Coarse To Fine Path Planning), a hierarchical framework that resolves this trade-off. RUSH decomposes the long-range planning task into a coarse plan followed by fine-grained, independent subproblems that can be solved in parallel. These subproblems are addressed by a unified, diffusion-based network that refines an initial estimate path by learning its residual to an optimal path. This approach allows RUSH to leverage rich geometric information directly from 3D voxel maps without being bottlenecked by the full map’s complexity. We validate our method on large-scale outdoor (KITTI, MulRan) and indoor (HM3D) datasets, each spanning a 200 m× 200m× 6m map. Experimental results demonstrate that RUSH generates feasible, high-quality paths with remarkable efficiency, achieving up to a 12.59× speedup over a hierarchically accelerated A* baseline, while maintaining a path cost within 24% of the optimal solution. This performance gain positions RUSH as a powerful and practical solution for applications requiring rapid global path planning in large-scale 3D maps.
{"title":"RUSH: Recursive and Scalable 3D Coarse to Fine Path Planning","authors":"Hwajung Lee;Daegeol Ko;Jaehyuk Hur;Junwon Lee;Seongbo Ha;Jong Hwan Ko;Hyeonwoo Yu","doi":"10.1109/LRA.2026.3653375","DOIUrl":"https://doi.org/10.1109/LRA.2026.3653375","url":null,"abstract":"Path planning in large-scale, complex 3D environments is fundamentally constrained by a trade-off between path quality and computational speed. This paper presents RUSH (Recursive and Scalable 3D <italic>Coarse To Fine</i> Path Planning), a hierarchical framework that resolves this trade-off. RUSH decomposes the long-range planning task into a coarse plan followed by fine-grained, independent subproblems that can be solved in parallel. These subproblems are addressed by a unified, diffusion-based network that refines an initial estimate path by learning its residual to an optimal path. This approach allows RUSH to leverage rich geometric information directly from 3D voxel maps without being bottlenecked by the full map’s complexity. We validate our method on large-scale outdoor (KITTI, MulRan) and indoor (HM3D) datasets, each spanning a 200 m× 200m× 6m map. Experimental results demonstrate that RUSH generates feasible, high-quality paths with remarkable efficiency, achieving up to a 12.59× speedup over a hierarchically accelerated A* baseline, while maintaining a path cost within 24% of the optimal solution. This performance gain positions RUSH as a powerful and practical solution for applications requiring rapid global path planning in large-scale 3D maps.","PeriodicalId":13241,"journal":{"name":"IEEE Robotics and Automation Letters","volume":"11 2","pages":"2346-2353"},"PeriodicalIF":5.3,"publicationDate":"2026-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146026334","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-12DOI: 10.1109/LRA.2026.3653306
Serhat İşcan;H. Iṣıl Bozma
This paper introduces a reliable and fast method for scene representation from a single RGB frame, even with human occlusion. Our goal is to enhance vision-based spatial reasoning in dynamic environments where human presence varies over time. Once humans are detected, the method addresses two key challenges: estimating the level of visual obstruction and generating a scene descriptor with humans removed. The first is handled via a novel visual obstruction measure that prevents descriptor generation under high occlusion. The second is addressed by adapting the previously presented bubble descriptor so that surface regions corresponding to detected humans are deformed using a modified spherical interpolation method—eliminating the need for inpainting or reconstruction and enabling rapid computation. We validate our approach through extensive comparisons across multiple datasets, including two new datasets collected using both stationary and mobile robots. Results show comparable representation quality with a 14–44 × reduction in computation time.
{"title":"Reliable and Fast Humans Removed Visual Scene Representation","authors":"Serhat İşcan;H. Iṣıl Bozma","doi":"10.1109/LRA.2026.3653306","DOIUrl":"https://doi.org/10.1109/LRA.2026.3653306","url":null,"abstract":"This paper introduces a reliable and fast method for scene representation from a single RGB frame, even with human occlusion. Our goal is to enhance vision-based spatial reasoning in dynamic environments where human presence varies over time. Once humans are detected, the method addresses two key challenges: estimating the level of visual obstruction and generating a scene descriptor with humans removed. The first is handled via a novel visual obstruction measure that prevents descriptor generation under high occlusion. The second is addressed by adapting the previously presented bubble descriptor so that surface regions corresponding to detected humans are deformed using a modified spherical interpolation method—eliminating the need for inpainting or reconstruction and enabling rapid computation. We validate our approach through extensive comparisons across multiple datasets, including two new datasets collected using both stationary and mobile robots. Results show comparable representation quality with a 14–44 × reduction in computation time.","PeriodicalId":13241,"journal":{"name":"IEEE Robotics and Automation Letters","volume":"11 3","pages":"2730-2737"},"PeriodicalIF":5.3,"publicationDate":"2026-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146026545","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-12DOI: 10.1109/LRA.2026.3653334
Pablo E. Tortós-Vinocour;Shota Kokubu;Fuko Matsunaga;Yuxi Lu;Zhongchaou Zhou;Naoki Kamijo;María Cordero-Alvarado;Jose Gomez-Tames;Wenwei Yu
Soft actuators are safer than rigid robots for hand rehabilitation, yet their performance can be significantly affected by interactions that occur on multi-joint systems. Actuator–finger and actuator–actuator interactions can impact bending output and make actuator performance dependent on the actuation pattern. To address this, we developed and validated a finite element model of a three-joint modular actuator system attached to a dummy finger. The simulation revealed that the displacement between actuators, and the contact area between the fingers and the actuators are key factors influencing actuator performance. We proposed a novel attachment method to enhance contact area and reduce actuator displacement and compared it against five existing designs across two actuator types and three actuation patterns. Our results demonstrate improved bending and reduced dependence of actuator performance on actuation pattern. This study makes a dual contribution to the area of soft robotics for hand rehabilitation by proposing a novel FEM framework for modeling soft actuators attached to multi-joint systems as well as providing insights on attachment method design for soft actuators in hand rehabilitation, emphasizing the importance of actuator–actuator and actuator–finger interactions.
{"title":"Accounting for the Interaction Between a Dummy Finger and Joint Modular Soft Actuators for Multi-Joint Support Using a Novel FEM-Based Approach","authors":"Pablo E. Tortós-Vinocour;Shota Kokubu;Fuko Matsunaga;Yuxi Lu;Zhongchaou Zhou;Naoki Kamijo;María Cordero-Alvarado;Jose Gomez-Tames;Wenwei Yu","doi":"10.1109/LRA.2026.3653334","DOIUrl":"https://doi.org/10.1109/LRA.2026.3653334","url":null,"abstract":"Soft actuators are safer than rigid robots for hand rehabilitation, yet their performance can be significantly affected by interactions that occur on multi-joint systems. Actuator–finger and actuator–actuator interactions can impact bending output and make actuator performance dependent on the actuation pattern. To address this, we developed and validated a finite element model of a three-joint modular actuator system attached to a dummy finger. The simulation revealed that the displacement between actuators, and the contact area between the fingers and the actuators are key factors influencing actuator performance. We proposed a novel attachment method to enhance contact area and reduce actuator displacement and compared it against five existing designs across two actuator types and three actuation patterns. Our results demonstrate improved bending and reduced dependence of actuator performance on actuation pattern. This study makes a dual contribution to the area of soft robotics for hand rehabilitation by proposing a novel FEM framework for modeling soft actuators attached to multi-joint systems as well as providing insights on attachment method design for soft actuators in hand rehabilitation, emphasizing the importance of actuator–actuator and actuator–finger interactions.","PeriodicalId":13241,"journal":{"name":"IEEE Robotics and Automation Letters","volume":"11 3","pages":"2578-2585"},"PeriodicalIF":5.3,"publicationDate":"2026-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11345949","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146026547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-12DOI: 10.1109/LRA.2026.3653374
Jiawei Zhang;Chengchao Bai;Wei Pan;Jifeng Guo
Multiple Peg-in-Hole (MPiH) assembly is one of the fundamental tasks in robotic assembly. In the MPiH tasks for large-size parts, it is challenging for a single manipulator to simultaneously align multiple distant pegs and holes, necessitating tightly coupled multi-manipulator systems. For such MPiH tasks using tightly coupled multiple manipulators, we propose a collaborative visual servo control framework that uses only the monocular in-hand cameras of each manipulator to reduce positioning errors. Initially, we train a state classification neural network and a positioning neural network. The former divides the states of the peg and hole in the image into three categories: obscured, separated, and overlapped, while the latter determines the position of the peg and hole in the image. Based on these findings, we propose a method to integrate the visual features of multiple manipulators using virtual forces, which can naturally combine with the cooperative controller of the multi-manipulator system. To generalize our approach to holes of different appearances, we varied the appearance of the holes during the dataset generation process. The results confirm that by considering the appearance of the holes, classification accuracy and positioning precision can be improved. Finally, the results show that our method achieves 100% success rate in dual-manipulator dual peg-in-hole tasks with a clearance of 0.2 mm, while robust to camera calibration errors.
{"title":"Virtual-Force Based Visual Servo for Multiple Peg-in-Hole Assembly With Tightly Coupled Multi-Manipulator","authors":"Jiawei Zhang;Chengchao Bai;Wei Pan;Jifeng Guo","doi":"10.1109/LRA.2026.3653374","DOIUrl":"https://doi.org/10.1109/LRA.2026.3653374","url":null,"abstract":"Multiple Peg-in-Hole (MPiH) assembly is one of the fundamental tasks in robotic assembly. In the MPiH tasks for large-size parts, it is challenging for a single manipulator to simultaneously align multiple distant pegs and holes, necessitating tightly coupled multi-manipulator systems. For such MPiH tasks using tightly coupled multiple manipulators, we propose a collaborative visual servo control framework that uses only the monocular in-hand cameras of each manipulator to reduce positioning errors. Initially, we train a state classification neural network and a positioning neural network. The former divides the states of the peg and hole in the image into three categories: obscured, separated, and overlapped, while the latter determines the position of the peg and hole in the image. Based on these findings, we propose a method to integrate the visual features of multiple manipulators using virtual forces, which can naturally combine with the cooperative controller of the multi-manipulator system. To generalize our approach to holes of different appearances, we varied the appearance of the holes during the dataset generation process. The results confirm that by considering the appearance of the holes, classification accuracy and positioning precision can be improved. Finally, the results show that our method achieves 100% success rate in dual-manipulator dual peg-in-hole tasks with a clearance of 0.2 mm, while robust to camera calibration errors.","PeriodicalId":13241,"journal":{"name":"IEEE Robotics and Automation Letters","volume":"11 3","pages":"2586-2593"},"PeriodicalIF":5.3,"publicationDate":"2026-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146026563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
High-definition (HD) map learning serves as an essential component of autonomous driving scene understanding, providing structured priors for planning and prediction. Recent transformer-based methods regress vectorized map elements via deformable attention over Bird’s-Eye View (BEV) features. They typically employ a single-pass paradigm, starting from a set of initial queries. However, these queries struggle to precisely localize map elements within the large-scale BEV space. This difficulty is severely amplified when using lightweight backbones that produce less distinctive features. To address this, we propose RefDiffMap, which recasts map construction as a progressive refinement process driven by a diffusion model. We introduce a novel denoising query generator that, at each step, leverages the intermediate noisy geometry to sample relevant features from adaptive BEV RoIs. These features are distilled into context-aware queries that guide the decoder’s next refinement. This creates a powerful geometry-feature co-evolution loop, allowing the model to iteratively correct localization errors. Comprehensive experiments show that RefDiffMap achieves competitive performance on the nuScenes and Argoverse 2 datasets. Notably, its robustness is highlighted with a ResNet-18 backbone, where it improves mAP by a significant 11.3% over our baseline MapTRv2. Further ablation studies validate the effectiveness of our approach.
{"title":"RefDiffMap: Diffusion-Guided Progressive Refinement for Vectorized HD Map Construction","authors":"Wenjie Gao;Entao Chang;Jiawei Fu;Ziyu Zhu;Shitao Chen;Nanning Zheng","doi":"10.1109/LRA.2026.3653402","DOIUrl":"https://doi.org/10.1109/LRA.2026.3653402","url":null,"abstract":"High-definition (HD) map learning serves as an essential component of autonomous driving scene understanding, providing structured priors for planning and prediction. Recent transformer-based methods regress vectorized map elements via deformable attention over Bird’s-Eye View (BEV) features. They typically employ a single-pass paradigm, starting from a set of initial queries. However, these queries struggle to precisely localize map elements within the large-scale BEV space. This difficulty is severely amplified when using lightweight backbones that produce less distinctive features. To address this, we propose RefDiffMap, which recasts map construction as a progressive refinement process driven by a diffusion model. We introduce a novel denoising query generator that, at each step, leverages the intermediate noisy geometry to sample relevant features from adaptive BEV RoIs. These features are distilled into context-aware queries that guide the decoder’s next refinement. This creates a powerful geometry-feature co-evolution loop, allowing the model to iteratively correct localization errors. Comprehensive experiments show that RefDiffMap achieves competitive performance on the nuScenes and Argoverse 2 datasets. Notably, its robustness is highlighted with a ResNet-18 backbone, where it improves mAP by a significant 11.3% over our baseline MapTRv2. Further ablation studies validate the effectiveness of our approach.","PeriodicalId":13241,"journal":{"name":"IEEE Robotics and Automation Letters","volume":"11 3","pages":"2554-2561"},"PeriodicalIF":5.3,"publicationDate":"2026-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146026607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-12DOI: 10.1109/LRA.2026.3653335
Sanjeev Ramkumar Sudha;Marija Popović;Erlend M. Coates
Mobile robot platforms are increasingly being used to automate information gathering tasks such as environmental monitoring. Efficient target tracking in dynamic environments is critical for applications such as search and rescue and pollutant cleanups. In this letter, we study active mapping of floating targets that drift due to environmental disturbances such as wind and currents. This is a challenging problem as it involves predicting both spatial and temporal variations in the map due to changing conditions. We introduce an integrated framework combining dynamic occupancy grid mapping and an informative planning approach to actively map and track freely drifting targets with an autonomous surface vehicle. A key component of our adaptive planning approach is a spatiotemporal prediction network that predicts target position distributions over time. We further propose a planning objective for target tracking that leverages these predictions. Simulation experiments show that this planning objective improves target tracking performance compared to existing methods that consider only entropy reduction as the planning objective. Finally, we validate our approach in field tests, showcasing its ability to track targets in real-world monitoring scenarios.
{"title":"An Informative Planning Framework for Target Tracking and Active Mapping in Dynamic Environments With ASVs","authors":"Sanjeev Ramkumar Sudha;Marija Popović;Erlend M. Coates","doi":"10.1109/LRA.2026.3653335","DOIUrl":"https://doi.org/10.1109/LRA.2026.3653335","url":null,"abstract":"Mobile robot platforms are increasingly being used to automate information gathering tasks such as environmental monitoring. Efficient target tracking in dynamic environments is critical for applications such as search and rescue and pollutant cleanups. In this letter, we study active mapping of floating targets that drift due to environmental disturbances such as wind and currents. This is a challenging problem as it involves predicting both spatial and temporal variations in the map due to changing conditions. We introduce an integrated framework combining dynamic occupancy grid mapping and an informative planning approach to actively map and track freely drifting targets with an autonomous surface vehicle. A key component of our adaptive planning approach is a spatiotemporal prediction network that predicts target position distributions over time. We further propose a planning objective for target tracking that leverages these predictions. Simulation experiments show that this planning objective improves target tracking performance compared to existing methods that consider only entropy reduction as the planning objective. Finally, we validate our approach in field tests, showcasing its ability to track targets in real-world monitoring scenarios.","PeriodicalId":13241,"journal":{"name":"IEEE Robotics and Automation Letters","volume":"11 3","pages":"2690-2697"},"PeriodicalIF":5.3,"publicationDate":"2026-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146026608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-12DOI: 10.1109/LRA.2026.3653331
Ginga Kennis;Shogo Arai
Autonomous robotic handling requires accurate 3-D scene measurement followed by grasp planning. Conventional systems struggle with transparent or specular objects.Additionally, in hand–eye setups, moving through multiple viewpoints increases handling execution time. In this paper, we propose HEAPGrasp—Hand-Eye Active Perception to Grasp objects with diverse optical properties. To measure such objects, we focus on the ability to segment objects regardless of their optical properties in RGB images. We employ Shape from Silhouette based on the segmented images for 3-D measurement. To shorten the time required for multi-view capture with a hand-eye camera, we plan its trajectory using a cost function that balances 3-D measurement accuracy against its trajectory length. Real-robot experiments achieve a 96.0% grasp success rate on transparent, specular, and opaque objects, while reducing the hand-eye camera’s trajectory length by 52% and handling execution time by 19% relative to a baseline that circles around the scene for 3-D measurement.
自主机器人操作需要精确的三维场景测量,然后进行抓取规划。传统的系统难以处理透明或高光物体。此外,在手眼设置中,通过多个视点移动会增加处理执行时间。在本文中,我们提出了heapgrip - hand - eye Active Perception来抓取具有不同光学特性的物体。为了测量这样的物体,我们关注的是分割物体的能力,而不管它们在RGB图像中的光学特性如何。我们使用基于分割图像的轮廓形状进行三维测量。为了缩短手眼相机多视角捕获所需的时间,我们使用平衡3d测量精度和轨迹长度的成本函数来规划其轨迹。真实机器人实验在透明、镜面和不透明物体上的抓取成功率达到96.0%,相对于环绕场景进行三维测量的基线,手眼相机的轨迹长度减少了52%,处理执行时间减少了19%。
{"title":"HEAPGrasp: Hand-Eye Active Perception to Grasp Objects With Diverse Optical Properties","authors":"Ginga Kennis;Shogo Arai","doi":"10.1109/LRA.2026.3653331","DOIUrl":"https://doi.org/10.1109/LRA.2026.3653331","url":null,"abstract":"Autonomous robotic handling requires accurate 3-D scene measurement followed by grasp planning. Conventional systems struggle with transparent or specular objects.Additionally, in hand–eye setups, moving through multiple viewpoints increases handling execution time. In this paper, we propose HEAPGrasp—Hand-Eye Active Perception to Grasp objects with diverse optical properties. To measure such objects, we focus on the ability to segment objects regardless of their optical properties in RGB images. We employ Shape from Silhouette based on the segmented images for 3-D measurement. To shorten the time required for multi-view capture with a hand-eye camera, we plan its trajectory using a cost function that balances 3-D measurement accuracy against its trajectory length. Real-robot experiments achieve a 96.0% grasp success rate on transparent, specular, and opaque objects, while reducing the hand-eye camera’s trajectory length by 52% and handling execution time by 19% relative to a baseline that circles around the scene for 3-D measurement.","PeriodicalId":13241,"journal":{"name":"IEEE Robotics and Automation Letters","volume":"11 3","pages":"3206-3213"},"PeriodicalIF":5.3,"publicationDate":"2026-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11345713","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146082220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Transparent objects are ubiquitous in laboratory automation settings, as liquids need to be visually controlled regularly. Automating laboratory processes would make the creation of small-batch medication feasible, thus making more personalized and better-targeted treatments more accessible. However, transparent objects present a major challenge for robust vision systems, in turn compromising their manipulation. Their appearance varies depending on the environment and depth sensors fail to capture their measurements. These objects therefore break central assumptions made by depth-based as well as render-and-compare pose refinement strategies. To ensure reliable pose estimation, we propose Silhouette-based object pose Refinement (SilRef), a novel pose refinement approach leveraging object silhouette detection and geometric cues, circumventing the need for depth maps or realistic rendering making it robust to environment change. Our proposed formulation directly optimizes the poses by gradient descent based on 3D models rendering and benefits from a large convergence basin. SilRef is evaluated on the Keypose dataset and the newly collected Tracebot In-Gripper dataset. Results show an improvement of 2.8x and 2.7x in Average Distance of Model Points-Symmetric (ADD-S@0.01 m) when the object is standing on a surface and when the object is already grasped, respectively, compared to Megapose6D and ICP (Iterative Closest Point).
{"title":"SilRef: Joint Visual Silhouette and Tactile Pose Optimization for Transparent Object Manipulation","authors":"Jean-Baptiste Weibel;Clemence Dubois;Negar Layegh Khavidaki;Saifeddine Aloui;Mathieu Grossard;Markus Vincze;Andreas Holzinger","doi":"10.1109/LRA.2026.3653340","DOIUrl":"https://doi.org/10.1109/LRA.2026.3653340","url":null,"abstract":"Transparent objects are ubiquitous in laboratory automation settings, as liquids need to be visually controlled regularly. Automating laboratory processes would make the creation of small-batch medication feasible, thus making more personalized and better-targeted treatments more accessible. However, transparent objects present a major challenge for robust vision systems, in turn compromising their manipulation. Their appearance varies depending on the environment and depth sensors fail to capture their measurements. These objects therefore break central assumptions made by depth-based as well as render-and-compare pose refinement strategies. To ensure reliable pose estimation, we propose Silhouette-based object pose Refinement (SilRef), a novel pose refinement approach leveraging object silhouette detection and geometric cues, circumventing the need for depth maps or realistic rendering making it robust to environment change. Our proposed formulation directly optimizes the poses by gradient descent based on 3D models rendering and benefits from a large convergence basin. SilRef is evaluated on the Keypose dataset and the newly collected Tracebot In-Gripper dataset. Results show an improvement of 2.8x and 2.7x in Average Distance of Model Points-Symmetric (ADD-S@0.01 m) when the object is standing on a surface and when the object is already grasped, respectively, compared to Megapose6D and ICP (Iterative Closest Point).","PeriodicalId":13241,"journal":{"name":"IEEE Robotics and Automation Letters","volume":"11 3","pages":"2490-2497"},"PeriodicalIF":5.3,"publicationDate":"2026-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11346999","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146001866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}