Pub Date : 2026-01-22DOI: 10.1109/LRA.2026.3656784
Arjun Gupta;Rishik Sathua;Saurabh Gupta
Many everyday mobile manipulation tasks require precise interaction with small objects, such as grasping a knob to open a cabinet or pressing a light switch. In this letter, we develop Visual Servoing with Vision Models (VSVM), a closed-loop framework that enables a mobile manipulator to tackle such precise tasks involving the manipulation of small objects. VSVM uses state-of-the-art vision foundation models to generate 3D targets for visual servoing to enable diverse tasks in novel environments. Naively doing so fails because of occlusion by the end-effector. VSVM mitigates this using vision models that out-paint the end-effector thereby significantly enhancing target localization. We demonstrate that aided by out-painting methods, open-vocabulary object detectors can serve as a drop-in module for VSVM to seek semantic targets (e.g. knobs) and point tracking methods can help VSVM reliably pursue interaction sites indicated by user clicks. We conduct a large-scale evaluation spanning experiments in 10 novel environments across 6 buildings including 72 different object instances. VSVM obtains a 71% zero-shot success rate on manipulating unseen objects in novel environments in the real world, outperforming an open-loop control method by an absolute 42% and an imitation learning baseline trained on 1000+ demonstrations also by an absolute success rate of 50% .
{"title":"Precise Mobile Manipulation of Small Everyday Objects","authors":"Arjun Gupta;Rishik Sathua;Saurabh Gupta","doi":"10.1109/LRA.2026.3656784","DOIUrl":"https://doi.org/10.1109/LRA.2026.3656784","url":null,"abstract":"Many everyday mobile manipulation tasks require precise interaction with small objects, such as grasping a knob to open a cabinet or pressing a light switch. In this letter, we develop Visual Servoing with Vision Models (VSVM), a closed-loop framework that enables a mobile manipulator to tackle such precise tasks involving the manipulation of small objects. VSVM uses state-of-the-art vision foundation models to generate 3D targets for visual servoing to enable diverse tasks in novel environments. Naively doing so fails because of occlusion by the end-effector. VSVM mitigates this using vision models that out-paint the end-effector thereby significantly enhancing target localization. We demonstrate that aided by out-painting methods, open-vocabulary object detectors can serve as a drop-in module for VSVM to seek semantic targets (e.g. knobs) and point tracking methods can help VSVM reliably pursue interaction sites indicated by user clicks. We conduct a large-scale evaluation spanning experiments in 10 novel environments across 6 buildings including 72 different object instances. VSVM obtains a 71% zero-shot success rate on manipulating unseen objects in novel environments in the real world, outperforming an open-loop control method by an absolute 42% and an imitation learning baseline trained on 1000+ demonstrations also by an absolute success rate of 50% .","PeriodicalId":13241,"journal":{"name":"IEEE Robotics and Automation Letters","volume":"11 3","pages":"3214-3221"},"PeriodicalIF":5.3,"publicationDate":"2026-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146082068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-22DOI: 10.1109/LRA.2026.3656795
Juana Valeria Hurtado;Rohit Mohan;Abhinav Valada
Hyperspectral imaging (HSI) captures spatial information along with dense spectral measurements across numerous narrow wavelength bands. This rich spectral content has the potential to facilitate robust robotic perception, particularly in environments with complex material compositions, varying illumination, or other visually challenging conditions. However, current HSI semantic segmentation methods underperform due to their reliance on architectures and learning frameworks optimized for RGB inputs. In this work, we propose a novel hyperspectral adapter that leverages pretrained vision foundation models to effectively learn from hyperspectral data. Our architecture incorporates a spectral transformer and a spectrum-aware spatial prior module to extract rich spatial-spectral features. Additionally, we introduce a modality-aware interaction block that facilitates effective integration of hyperspectral representations and frozen vision Transformer features through dedicated extraction and injection mechanisms. Extensive evaluations on three benchmark autonomous driving datasets demonstrate that our architecture achieves state-of-the-art semantic segmentation performance while directly using HSI inputs, outperforming both vision-based and hyperspectral segmentation methods.
{"title":"Hyperspectral Adapter for Semantic Segmentation With Vision Foundation Models","authors":"Juana Valeria Hurtado;Rohit Mohan;Abhinav Valada","doi":"10.1109/LRA.2026.3656795","DOIUrl":"https://doi.org/10.1109/LRA.2026.3656795","url":null,"abstract":"Hyperspectral imaging (HSI) captures spatial information along with dense spectral measurements across numerous narrow wavelength bands. This rich spectral content has the potential to facilitate robust robotic perception, particularly in environments with complex material compositions, varying illumination, or other visually challenging conditions. However, current HSI semantic segmentation methods underperform due to their reliance on architectures and learning frameworks optimized for RGB inputs. In this work, we propose a novel hyperspectral adapter that leverages pretrained vision foundation models to effectively learn from hyperspectral data. Our architecture incorporates a spectral transformer and a spectrum-aware spatial prior module to extract rich spatial-spectral features. Additionally, we introduce a modality-aware interaction block that facilitates effective integration of hyperspectral representations and frozen vision Transformer features through dedicated extraction and injection mechanisms. Extensive evaluations on three benchmark autonomous driving datasets demonstrate that our architecture achieves state-of-the-art semantic segmentation performance while directly using HSI inputs, outperforming both vision-based and hyperspectral segmentation methods.","PeriodicalId":13241,"journal":{"name":"IEEE Robotics and Automation Letters","volume":"11 3","pages":"3606-3613"},"PeriodicalIF":5.3,"publicationDate":"2026-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146175693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-22DOI: 10.1109/LRA.2026.3656774
Jiayin Wang;Yanran Wei;Lei Jiang;Xiaoyu Guo;Ayong Zheng;Weidong Zhao;Zhongkui Li
Autonomous control of the laparoscope in robot-assisted Minimally Invasive Surgery (MIS) has received considerable research interest due to its potential to improve surgical safety. Despite progress in pixel-level Image-Based Visual Servoing (IBVS) control, the requirement of continuous visibility and the existence of complex disturbances, such as parameterization error, measurement noise, and uncertainties of payloads, could degrade the surgeon’s visual experience and compromise procedural safety. To address these limitations, this letter proposes VisionSafeEnhanced Visual Predictive Control (VPC), a robust and uncertainty-adaptive framework that guarantees Field of View (FoV) safety under uncertainty. Firstly, Gaussian Process Regression (GPR) is utilized to perform hybrid quantification of operational uncertainties including residual model uncertainties, stochastic uncertainties, and external disturbances. Based on uncertainty quantification, a novel safety-aware trajectory optimization framework with probabilistic guarantees is proposed, where an uncertainty-adaptive safety Control Barrier Function (CBF) condition is given based on uncertainty propagation, and chance constraints are simultaneously formulated based on probabilistic approximation. This uncertainty aware formulation enables adaptive control effort allocation, minimizing unnecessary camera motion while maintaining robustness. The proposed method is validated through comparative simulations and experiments on a commercial surgical robot platform (MicroPort MedBot Toumai) performing a sequential multi-target lymph node dissection. Compared with baseline methods, the framework maintains near-perfect target visibility ($> 99.9%$), reduces tracking errors by over 77% under uncertainty, and lowers control effort by more than an order of magnitude.
机器人辅助微创手术(MIS)中腹腔镜的自主控制因其提高手术安全性的潜力而受到了广泛的研究兴趣。尽管基于图像的视觉伺服(IBVS)在像素级控制方面取得了进展,但持续可见性的要求和复杂干扰的存在,如参数化误差、测量噪声和有效载荷的不确定性,可能会降低外科医生的视觉体验并危及手术安全。为了解决这些限制,本信函提出了VisionSafeEnhanced Visual Predictive Control (VPC),这是一种鲁棒的不确定性自适应框架,可保证不确定性下的视场(FoV)安全。首先,利用高斯过程回归(GPR)对残差模型不确定性、随机不确定性和外部干扰等操作不确定性进行混合量化;在不确定性量化的基础上,提出了一种具有概率保证的安全感知轨迹优化框架,该框架基于不确定性传播给出了不确定性自适应安全控制障碍函数(CBF)条件,同时基于概率逼近给出了机会约束。这种不确定性意识的配方使自适应控制努力分配,最大限度地减少不必要的相机运动,同时保持鲁棒性。通过商业手术机器人平台(MicroPort MedBot Toumai)进行顺序多靶点淋巴结清扫的对比模拟和实验,验证了所提出的方法。与基线方法相比,该框架保持了近乎完美的目标可见性(99.9%),在不确定性下将跟踪误差降低了77%以上,并将控制工作量降低了一个数量级以上。
{"title":"VisionSafeEnhanced VPC: Cautious Predictive Control With Visibility Constraints Under Uncertainty for Autonomous Robotic Surgery","authors":"Jiayin Wang;Yanran Wei;Lei Jiang;Xiaoyu Guo;Ayong Zheng;Weidong Zhao;Zhongkui Li","doi":"10.1109/LRA.2026.3656774","DOIUrl":"https://doi.org/10.1109/LRA.2026.3656774","url":null,"abstract":"Autonomous control of the laparoscope in robot-assisted Minimally Invasive Surgery (MIS) has received considerable research interest due to its potential to improve surgical safety. Despite progress in pixel-level Image-Based Visual Servoing (IBVS) control, the requirement of continuous visibility and the existence of complex disturbances, such as parameterization error, measurement noise, and uncertainties of payloads, could degrade the surgeon’s visual experience and compromise procedural safety. To address these limitations, this letter proposes VisionSafeEnhanced Visual Predictive Control (VPC), a robust and uncertainty-adaptive framework that guarantees Field of View (FoV) safety under uncertainty. Firstly, Gaussian Process Regression (GPR) is utilized to perform hybrid quantification of operational uncertainties including residual model uncertainties, stochastic uncertainties, and external disturbances. Based on uncertainty quantification, a novel safety-aware trajectory optimization framework with probabilistic guarantees is proposed, where an uncertainty-adaptive safety Control Barrier Function (CBF) condition is given based on uncertainty propagation, and chance constraints are simultaneously formulated based on probabilistic approximation. This uncertainty aware formulation enables adaptive control effort allocation, minimizing unnecessary camera motion while maintaining robustness. The proposed method is validated through comparative simulations and experiments on a commercial surgical robot platform (MicroPort MedBot Toumai) performing a sequential multi-target lymph node dissection. Compared with baseline methods, the framework maintains near-perfect target visibility (<inline-formula><tex-math>$> 99.9%$</tex-math></inline-formula>), reduces tracking errors by over 77% under uncertainty, and lowers control effort by more than an order of magnitude.","PeriodicalId":13241,"journal":{"name":"IEEE Robotics and Automation Letters","volume":"11 3","pages":"3590-3597"},"PeriodicalIF":5.3,"publicationDate":"2026-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146175751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-21DOI: 10.1109/LRA.2026.3656780
Yongjae Lim;Seungwoo Jung;Dabin Kim;Dongjae Lee;H. Jin Kim
Fast replanning of the local trajectory is essential for autonomous robots to ensure safe navigation in crowded environments, as such environments require the robot to frequently update its trajectory due to unexpected and dynamic obstacles. In such settings, relying on the single trajectory optimization may not provide sufficient alternatives, making it harder to quickly switch to a safer trajectory and increasing the risk of collisions. While parallel trajectory optimization can address this limitation by considering multiple candidates, it depends heavily on well-defined initial guidance, which is difficult to obtain in complex environments. In this work, we propose a method for identifying the multimodality of the optimal trajectory distribution for safe navigation in crowded 3D environments without initial guidance. Our approach ensures safe trajectory generation by projecting sampled trajectories onto safe constraint sets and clustering them based on their potential to converge to the same locally optimal trajectory. This process naturally produces diverse trajectory options without requiring predefined initial guidance. Finally, for each trajectory cluster, we utilize the Model Predictive Path Integral framework to determine the optimal control input sequence, which corresponds to the local maxima of a multi-modal optimal trajectory distribution. We first validate our approach in simulations, achieving higher success rates than existing methods. Subsequent hardware experiments demonstrate that our fast local trajectory replanning strategy enables a drone to safely navigate crowded environments.
{"title":"Safe Multimodal Replanning via Projection-Based Trajectory Clustering in Crowded Environments","authors":"Yongjae Lim;Seungwoo Jung;Dabin Kim;Dongjae Lee;H. Jin Kim","doi":"10.1109/LRA.2026.3656780","DOIUrl":"https://doi.org/10.1109/LRA.2026.3656780","url":null,"abstract":"Fast replanning of the local trajectory is essential for autonomous robots to ensure safe navigation in crowded environments, as such environments require the robot to frequently update its trajectory due to unexpected and dynamic obstacles. In such settings, relying on the single trajectory optimization may not provide sufficient alternatives, making it harder to quickly switch to a safer trajectory and increasing the risk of collisions. While parallel trajectory optimization can address this limitation by considering multiple candidates, it depends heavily on well-defined initial guidance, which is difficult to obtain in complex environments. In this work, we propose a method for identifying the multimodality of the optimal trajectory distribution for safe navigation in crowded 3D environments without initial guidance. Our approach ensures safe trajectory generation by projecting sampled trajectories onto safe constraint sets and clustering them based on their potential to converge to the same locally optimal trajectory. This process naturally produces diverse trajectory options without requiring predefined initial guidance. Finally, for each trajectory cluster, we utilize the Model Predictive Path Integral framework to determine the optimal control input sequence, which corresponds to the local maxima of a multi-modal optimal trajectory distribution. We first validate our approach in simulations, achieving higher success rates than existing methods. Subsequent hardware experiments demonstrate that our fast local trajectory replanning strategy enables a drone to safely navigate crowded environments.","PeriodicalId":13241,"journal":{"name":"IEEE Robotics and Automation Letters","volume":"11 3","pages":"3558-3565"},"PeriodicalIF":5.3,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146175739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
LiDAR odometry, fused by inertial measurement units (IMU), is an essential task in robotics navigation. Unlike the mainstream methods compensate the motion distortion of LiDAR data by high frequency inertial sensors, this letter deals with the distortion with continuous-time trajectory representation, and achieved competitive performance against state-of-the-art. We propose a compact framework of LiDAR odometry with adaptive non-uniform B-spline trajectory representation to formulate it as continuous-time estimation problem. We deploy point-to-plane registration and pseudo-velocity smoothing constraints to fully utilize geometric and kinematic information of odometry. For faster convergence of optimization, analytical Jacobian of constraints is derived to solve the non-linear least squares minimization. For more efficient B-spline representation, an adaptive knot spacing technique is proposed to adjust the time interval of control poses of spline. Extensive experiments on public and realistic datasets demonstrate validation and efficiency of our system compared with other LiDAR or LiDAR-inertial methods.
{"title":"A&B-LO: Continuous-Time LiDAR Odometry With Adaptive Non-Uniform B-Spline Trajectory Representation","authors":"Yuchu Lu;Chenpeng Yao;Jiayuan Du;Chengju Liu;Qijun Chen","doi":"10.1109/LRA.2026.3656754","DOIUrl":"https://doi.org/10.1109/LRA.2026.3656754","url":null,"abstract":"LiDAR odometry, fused by inertial measurement units (IMU), is an essential task in robotics navigation. Unlike the mainstream methods compensate the motion distortion of LiDAR data by high frequency inertial sensors, this letter deals with the distortion with continuous-time trajectory representation, and achieved competitive performance against state-of-the-art. We propose a compact framework of LiDAR odometry with adaptive non-uniform B-spline trajectory representation to formulate it as continuous-time estimation problem. We deploy point-to-plane registration and pseudo-velocity smoothing constraints to fully utilize geometric and kinematic information of odometry. For faster convergence of optimization, analytical Jacobian of constraints is derived to solve the non-linear least squares minimization. For more efficient B-spline representation, an adaptive knot spacing technique is proposed to adjust the time interval of control poses of spline. Extensive experiments on public and realistic datasets demonstrate validation and efficiency of our system compared with other LiDAR or LiDAR-inertial methods.","PeriodicalId":13241,"journal":{"name":"IEEE Robotics and Automation Letters","volume":"11 3","pages":"3550-3557"},"PeriodicalIF":5.3,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146175763","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-21DOI: 10.1109/LRA.2026.3656725
Xuan Xiao;Kefeng Zhang;Jiaqi Zhu;Jianming Wang;Runtian Zhu
Snake robots exhibit remarkable locomotion capabilities in complex environments as the degrees of freedom (DOFs) increase, but at the cost of energy consumption. To address this issue, this article proposes a cooperation strategy for snake robots based on a head-tail docking mechanism, which allows multiple short snake robots to combine into a longer one, enabling the execution of complex tasks. The mechanical design and the implementation of the dockable snake robots are introduced, featuring passive docking mechanisms at both the head and tail, an embedded controller and a vision camera mounted on the head, and a distributed power supply system. Furthermore, the control strategies for the combined robots have been developed to perform the crawler gait and the motion of spanning between parallel pipes. As a result, experiments are conducted to demonstrate the feasibility and performance of the proposed docking mechanism and cooperative control methods. Specifically, two snake robots can autonomously dock under visual guidance. After docking, the combined robots can rapidly traverse flat surfaces by performing the crawler gait at an average speed of 0.168 m/s. Additionally, the robots can perform spanning between parallel pipes and pipe inspection tasks concurrently by separating.
{"title":"Design and Validation of Docking-Based Cooperative Strategies for Snake Robots in Complex Environments","authors":"Xuan Xiao;Kefeng Zhang;Jiaqi Zhu;Jianming Wang;Runtian Zhu","doi":"10.1109/LRA.2026.3656725","DOIUrl":"https://doi.org/10.1109/LRA.2026.3656725","url":null,"abstract":"Snake robots exhibit remarkable locomotion capabilities in complex environments as the degrees of freedom (DOFs) increase, but at the cost of energy consumption. To address this issue, this article proposes a cooperation strategy for snake robots based on a head-tail docking mechanism, which allows multiple short snake robots to combine into a longer one, enabling the execution of complex tasks. The mechanical design and the implementation of the dockable snake robots are introduced, featuring passive docking mechanisms at both the head and tail, an embedded controller and a vision camera mounted on the head, and a distributed power supply system. Furthermore, the control strategies for the combined robots have been developed to perform the crawler gait and the motion of spanning between parallel pipes. As a result, experiments are conducted to demonstrate the feasibility and performance of the proposed docking mechanism and cooperative control methods. Specifically, two snake robots can autonomously dock under visual guidance. After docking, the combined robots can rapidly traverse flat surfaces by performing the crawler gait at an average speed of 0.168 m/s. Additionally, the robots can perform spanning between parallel pipes and pipe inspection tasks concurrently by separating.","PeriodicalId":13241,"journal":{"name":"IEEE Robotics and Automation Letters","volume":"11 3","pages":"3190-3197"},"PeriodicalIF":5.3,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146082183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-21DOI: 10.1109/LRA.2026.3656724
Yuquan Hu;Alessandro Gardi
Conventional Visual-Inertial Navigation Systems (VINS) are developed under the assumption of static environments, leading to significant performance degradation in dynamic scenarios. In recent years, many dynamic-feature-aware VINS implementations have been proposed, but most of them rely on prior semantic information and lack generalizability. To address these limitations, we propose a robust monocular method called VINS-Mah, which is capable of identifying both dynamic and unreliable features without prior semantic information. First, the covariances related to the feature reprojection errors are computed via the proposed uncertainty estimator. Subsequently, a dynamic feature filter module combines the feature reprojection errors and the computed covariances to determine the Mahalanobis distance, and then applies a Chi-square test to filter out dynamic features. The proposed method is verified against several publicly available datasets, covering both simulated and real-world scenes. Experimental results demonstrate that VINS-Mah outperforms other state-of-the-art methods in dynamic scenarios, while not degrading in static environments.
{"title":"VINS-Mah: A Robust Monocular Visual-Inertial State Estimator for Dynamic Environments","authors":"Yuquan Hu;Alessandro Gardi","doi":"10.1109/LRA.2026.3656724","DOIUrl":"https://doi.org/10.1109/LRA.2026.3656724","url":null,"abstract":"Conventional Visual-Inertial Navigation Systems (VINS) are developed under the assumption of static environments, leading to significant performance degradation in dynamic scenarios. In recent years, many dynamic-feature-aware VINS implementations have been proposed, but most of them rely on prior semantic information and lack generalizability. To address these limitations, we propose a robust monocular method called VINS-Mah, which is capable of identifying both dynamic and unreliable features without prior semantic information. First, the covariances related to the feature reprojection errors are computed via the proposed uncertainty estimator. Subsequently, a dynamic feature filter module combines the feature reprojection errors and the computed covariances to determine the Mahalanobis distance, and then applies a Chi-square test to filter out dynamic features. The proposed method is verified against several publicly available datasets, covering both simulated and real-world scenes. Experimental results demonstrate that VINS-Mah outperforms other state-of-the-art methods in dynamic scenarios, while not degrading in static environments.","PeriodicalId":13241,"journal":{"name":"IEEE Robotics and Automation Letters","volume":"11 3","pages":"3614-3621"},"PeriodicalIF":5.3,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11359677","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146175687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-19DOI: 10.1109/LRA.2026.3655202
Bu Jin;Songen Gu;Xiaotao Hu;Yupeng Zheng;Xiaoyang Guo;Qian Zhang;Xiaoxiao Long;Wei Yin
In this paper, we propose OccTENS, a generative occupancy world model that enables controllable, high-fidelity long-term occupancy generation while maintaining computational efficiency. Different from visual generation, the occupancy world model must capture the fine-grained 3D geometry and dynamic evolution of the 3D scenes, posing great challenges for the generative models. Recent approaches based on autoregression (AR) have demonstrated the potential to predict vehicle movement and future occupancy scenes simultaneously from historical observations, but they typically suffer from inefficiency, temporal degradation in long-term generation and lack of controllability. To holistically address these issues, we reformulate the occupancy world model as a temporal next-scale prediction (TENS) task, which decomposes the temporal sequence modeling problem into the modeling of spatial scale-by-scale generation and temporal scene-by-scene prediction. With a TensFormer, OccTENS can effectively manage the temporal causality and spatial relationships of occupancy sequences in a flexible and scalable way. To enhance the pose controllability, we further propose a holistic pose aggregation strategy, which features a unified sequence modeling for occupancy and ego-motion. Experiments show that OccTENS outperforms the state-of-the-art method with both higher occupancy quality and faster inference time.
{"title":"OccTENS: 3D Occupancy World Model via Temporal Next-Scale Prediction","authors":"Bu Jin;Songen Gu;Xiaotao Hu;Yupeng Zheng;Xiaoyang Guo;Qian Zhang;Xiaoxiao Long;Wei Yin","doi":"10.1109/LRA.2026.3655202","DOIUrl":"https://doi.org/10.1109/LRA.2026.3655202","url":null,"abstract":"In this paper, we propose OccTENS, a generative occupancy world model that enables controllable, high-fidelity long-term occupancy generation while maintaining computational efficiency. Different from visual generation, the occupancy world model must capture the fine-grained 3D geometry and dynamic evolution of the 3D scenes, posing great challenges for the generative models. Recent approaches based on autoregression (AR) have demonstrated the potential to predict vehicle movement and future occupancy scenes simultaneously from historical observations, but they typically suffer from <bold>inefficiency</b>, <bold>temporal degradation</b> in long-term generation and <bold>lack of controllability</b>. To holistically address these issues, we reformulate the occupancy world model as a temporal next-scale prediction (TENS) task, which decomposes the temporal sequence modeling problem into the modeling of spatial scale-by-scale generation and temporal scene-by-scene prediction. With a <bold>TensFormer</b>, OccTENS can effectively manage the temporal causality and spatial relationships of occupancy sequences in a flexible and scalable way. To enhance the pose controllability, we further propose a holistic pose aggregation strategy, which features a unified sequence modeling for occupancy and ego-motion. Experiments show that OccTENS outperforms the state-of-the-art method with both higher occupancy quality and faster inference time.","PeriodicalId":13241,"journal":{"name":"IEEE Robotics and Automation Letters","volume":"11 3","pages":"3566-3573"},"PeriodicalIF":5.3,"publicationDate":"2026-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146175728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-19DOI: 10.1109/LRA.2026.3655281
Haoyu Wang;Zhiqiang Miao;Weiwei Zhan;Xiangke Wang;Wei He;Yaonan Wang
To address the limited maneuverability and low energy efficiency of autonomous aerial vehicles (AAVs) in confined spaces, we design and implement the Hybrid Sprawl-Tuned Vehicle (HSTV) - a deformable multi-modal robotic platform specifically engineered for operation in complex and spatially constrained environments. Based on the “FSTAR” platform, HSTV is equipped with passive front wheels and actively driven rear wheels. The gear transmission mechanism enables the actively driven wheels to be driven without the need for dedicated motors, simplifying the architecture of the system. For both flying and driving modes, detailed kinematics and dynamics models integrated with a mode switching strategy are constructed by using the Newton-Euler method. Based on the developed models, the constrained nonlinear model predictive controller is designed to achieve the accurate motion performance in flying and driving mode. Comprehensive experimental results and comparative analysis demonstrate that HSTV achieves significant trajectory tracking accuracy across both flying and driving modes, while saving energy by up to 70.9% with no significantly increasing structural complexity (maintained at 98.6% simplicity).
{"title":"Design and NMPC-Based Control of a Hybrid Sprawl-Tuned Vehicle With Flying and Driving Modes","authors":"Haoyu Wang;Zhiqiang Miao;Weiwei Zhan;Xiangke Wang;Wei He;Yaonan Wang","doi":"10.1109/LRA.2026.3655281","DOIUrl":"https://doi.org/10.1109/LRA.2026.3655281","url":null,"abstract":"To address the limited maneuverability and low energy efficiency of autonomous aerial vehicles (AAVs) in confined spaces, we design and implement the Hybrid Sprawl-Tuned Vehicle (HSTV) - a deformable multi-modal robotic platform specifically engineered for operation in complex and spatially constrained environments. Based on the “FSTAR” platform, HSTV is equipped with passive front wheels and actively driven rear wheels. The gear transmission mechanism enables the actively driven wheels to be driven without the need for dedicated motors, simplifying the architecture of the system. For both flying and driving modes, detailed kinematics and dynamics models integrated with a mode switching strategy are constructed by using the Newton-Euler method. Based on the developed models, the constrained nonlinear model predictive controller is designed to achieve the accurate motion performance in flying and driving mode. Comprehensive experimental results and comparative analysis demonstrate that HSTV achieves significant trajectory tracking accuracy across both flying and driving modes, while saving energy by up to 70.9% with no significantly increasing structural complexity (maintained at 98.6% simplicity).","PeriodicalId":13241,"journal":{"name":"IEEE Robotics and Automation Letters","volume":"11 3","pages":"3222-3229"},"PeriodicalIF":5.3,"publicationDate":"2026-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146082066","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-19DOI: 10.1109/LRA.2026.3655311
Cong Li;Qin Rao;Zheng Tian;Jun Yang
The rubber-tired container gantry crane (RTG) is a type of heavy-duty lifting equipment commonly used in container yards, which is driven by two-side rubber tires and steered via differential drive. While moving along the desired path, the RTG must remain centered of the lane with restricted heading angle, as deviations may compromise the safety of subsequent yard operations. Due to its underactuated nature and the presence of external disturbances, achieving accurate lane-keeping poses a significant control challenge. To address this issue, a robust safety-critical steering control strategy integrating disturbance rejection vector field (VF) with a new state-interlocked control barrier function (SICBF) is proposed. The strategy initially employs a VF path-following method as the nominal controller. By strategically shrinking the safe set, the SICBF overcomes the limitations of traditional CBFs, such as state coupling in the inequality verification and infeasibility when the control coefficient tends to zero. Furthermore, by incorporating a disturbance observer (DOB) into the quadratic programming (QP) framework, the robustness and safety of the control system are significantly enhanced. Comprehensive simulation and experiment are conducted on a practical RTG with a 40-ton load capacity. To our best knowledge, the proposed method is one of the very few methods that have demonstrated successful application to the practical RTG systems.
{"title":"Safety-Critical Steering Control for Rubber-Tired Container Gantry Cranes: A State-Interlocked CBF Approach","authors":"Cong Li;Qin Rao;Zheng Tian;Jun Yang","doi":"10.1109/LRA.2026.3655311","DOIUrl":"https://doi.org/10.1109/LRA.2026.3655311","url":null,"abstract":"The rubber-tired container gantry crane (RTG) is a type of heavy-duty lifting equipment commonly used in container yards, which is driven by two-side rubber tires and steered via differential drive. While moving along the desired path, the RTG must remain centered of the lane with restricted heading angle, as deviations may compromise the safety of subsequent yard operations. Due to its underactuated nature and the presence of external disturbances, achieving accurate lane-keeping poses a significant control challenge. To address this issue, a robust safety-critical steering control strategy integrating disturbance rejection vector field (VF) with a new state-interlocked control barrier function (SICBF) is proposed. The strategy initially employs a VF path-following method as the nominal controller. By strategically shrinking the safe set, the SICBF overcomes the limitations of traditional CBFs, such as state coupling in the inequality verification and infeasibility when the control coefficient tends to zero. Furthermore, by incorporating a disturbance observer (DOB) into the quadratic programming (QP) framework, the robustness and safety of the control system are significantly enhanced. Comprehensive simulation and experiment are conducted on a practical RTG with a 40-ton load capacity. To our best knowledge, the proposed method is one of the very few methods that have demonstrated successful application to the practical RTG systems.","PeriodicalId":13241,"journal":{"name":"IEEE Robotics and Automation Letters","volume":"11 3","pages":"3238-3245"},"PeriodicalIF":5.3,"publicationDate":"2026-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146082249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}