Haichao Liu, Ruoyu Yao, Zhenmin Huang, Shaojie Shen, Jun Ma
To address the intricate challenges of decentralized cooperative scheduling and motion planning in Autonomous Mobility-on-Demand (AMoD) systems, this paper introduces LMMCoDrive, a novel cooperative driving framework that leverages a Large Multimodal Model (LMM) to enhance traffic efficiency in dynamic urban environments. This framework seamlessly integrates scheduling and motion planning processes to ensure the effective operation of Cooperative Autonomous Vehicles (CAVs). The spatial relationship between CAVs and passenger requests is abstracted into a Bird's-Eye View (BEV) to fully exploit the potential of the LMM. Besides, trajectories are cautiously refined for each CAV while ensuring collision avoidance through safety constraints. A decentralized optimization strategy, facilitated by the Alternating Direction Method of Multipliers (ADMM) within the LMM framework, is proposed to drive the graph evolution of CAVs. Simulation results demonstrate the pivotal role and significant impact of LMM in optimizing CAV scheduling and enhancing decentralized cooperative optimization process for each vehicle. This marks a substantial stride towards achieving practical, efficient, and safe AMoD systems that are poised to revolutionize urban transportation. The code is available at https://github.com/henryhcliu/LMMCoDrive.
{"title":"LMMCoDrive: Cooperative Driving with Large Multimodal Model","authors":"Haichao Liu, Ruoyu Yao, Zhenmin Huang, Shaojie Shen, Jun Ma","doi":"arxiv-2409.11981","DOIUrl":"https://doi.org/arxiv-2409.11981","url":null,"abstract":"To address the intricate challenges of decentralized cooperative scheduling\u0000and motion planning in Autonomous Mobility-on-Demand (AMoD) systems, this paper\u0000introduces LMMCoDrive, a novel cooperative driving framework that leverages a\u0000Large Multimodal Model (LMM) to enhance traffic efficiency in dynamic urban\u0000environments. This framework seamlessly integrates scheduling and motion\u0000planning processes to ensure the effective operation of Cooperative Autonomous\u0000Vehicles (CAVs). The spatial relationship between CAVs and passenger requests\u0000is abstracted into a Bird's-Eye View (BEV) to fully exploit the potential of\u0000the LMM. Besides, trajectories are cautiously refined for each CAV while\u0000ensuring collision avoidance through safety constraints. A decentralized\u0000optimization strategy, facilitated by the Alternating Direction Method of\u0000Multipliers (ADMM) within the LMM framework, is proposed to drive the graph\u0000evolution of CAVs. Simulation results demonstrate the pivotal role and\u0000significant impact of LMM in optimizing CAV scheduling and enhancing\u0000decentralized cooperative optimization process for each vehicle. This marks a\u0000substantial stride towards achieving practical, efficient, and safe AMoD\u0000systems that are poised to revolutionize urban transportation. The code is\u0000available at https://github.com/henryhcliu/LMMCoDrive.","PeriodicalId":501031,"journal":{"name":"arXiv - CS - Robotics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142267036","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jingwei Song, Ray Zhang, Wenwei Zhang, Hao Zhou, Maani Ghaffari
A major limitation of minimally invasive surgery is the difficulty in accurately locating the internal anatomical structures of the target organ due to the lack of tactile feedback and transparency. Augmented reality (AR) offers a promising solution to overcome this challenge. Numerous studies have shown that combining learning-based and geometric methods can achieve accurate preoperative and intraoperative data registration. This work proposes a real-time monocular 3D tracking algorithm for post-registration tasks. The ORB-SLAM2 framework is adopted and modified for prior-based 3D tracking. The primitive 3D shape is used for fast initialization of the monocular SLAM. A pseudo-segmentation strategy is employed to separate the target organ from the background for tracking purposes, and the geometric prior of the 3D shape is incorporated as an additional constraint in the pose graph. Experiments from in-vivo and ex-vivo tests demonstrate that the proposed 3D tracking system provides robust 3D tracking and effectively handles typical challenges such as fast motion, out-of-field-of-view scenarios, partial visibility, and "organ-background" relative motion.
微创手术的一个主要局限是,由于缺乏触觉反馈和透明度,很难准确定位目标器官的内部解剖结构。增强现实技术(AR)为克服这一难题提供了一个前景广阔的解决方案。大量研究表明,结合基于学习的方法和几何方法可以实现术前和术中数据的精确配准。本研究提出了一种用于术后配准任务的全时单目三维跟踪算法。该算法采用 ORB-SLAM2 框架,并对其进行了修改,用于基于先验的三维跟踪。原始三维形状用于单目 SLAM 的快速初始化。采用伪分割策略将目标器官从背景中分离出来进行跟踪,并将三维形状的几何先验值作为附加约束纳入姿态图中。体内和体外实验证明,所提出的三维跟踪系统能提供稳健的三维跟踪,并能有效处理快速运动、视场外场景、部分可见性和 "器官-背景 "相对运动等典型挑战。
{"title":"SLAM assisted 3D tracking system for laparoscopic surgery","authors":"Jingwei Song, Ray Zhang, Wenwei Zhang, Hao Zhou, Maani Ghaffari","doi":"arxiv-2409.11688","DOIUrl":"https://doi.org/arxiv-2409.11688","url":null,"abstract":"A major limitation of minimally invasive surgery is the difficulty in\u0000accurately locating the internal anatomical structures of the target organ due\u0000to the lack of tactile feedback and transparency. Augmented reality (AR) offers\u0000a promising solution to overcome this challenge. Numerous studies have shown\u0000that combining learning-based and geometric methods can achieve accurate\u0000preoperative and intraoperative data registration. This work proposes a\u0000real-time monocular 3D tracking algorithm for post-registration tasks. The\u0000ORB-SLAM2 framework is adopted and modified for prior-based 3D tracking. The\u0000primitive 3D shape is used for fast initialization of the monocular SLAM. A\u0000pseudo-segmentation strategy is employed to separate the target organ from the\u0000background for tracking purposes, and the geometric prior of the 3D shape is\u0000incorporated as an additional constraint in the pose graph. Experiments from\u0000in-vivo and ex-vivo tests demonstrate that the proposed 3D tracking system\u0000provides robust 3D tracking and effectively handles typical challenges such as\u0000fast motion, out-of-field-of-view scenarios, partial visibility, and\u0000\"organ-background\" relative motion.","PeriodicalId":501031,"journal":{"name":"arXiv - CS - Robotics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kelin Li, Shubham M Wagh, Nitish Sharma, Saksham Bhadani, Wei Chen, Chang Liu, Petar Kormushev
Robotic manipulation is essential for the widespread adoption of robots in industrial and home settings and has long been a focus within the robotics community. Advances in artificial intelligence have introduced promising learning-based methods to address this challenge, with imitation learning emerging as particularly effective. However, efficiently acquiring high-quality demonstrations remains a challenge. In this work, we introduce an immersive VR-based teleoperation setup designed to collect demonstrations from a remote human user. We also propose an imitation learning framework called Haptic Action Chunking with Transformers (Haptic-ACT). To evaluate the platform, we conducted a pick-and-place task and collected 50 demonstration episodes. Results indicate that the immersive VR platform significantly reduces demonstrator fingertip forces compared to systems without haptic feedback, enabling more delicate manipulation. Additionally, evaluations of the Haptic-ACT framework in both the MuJoCo simulator and on a real robot demonstrate its effectiveness in teaching robots more compliant manipulation compared to the original ACT. Additional materials are available at https://sites.google.com/view/hapticact.
{"title":"Haptic-ACT: Bridging Human Intuition with Compliant Robotic Manipulation via Immersive VR","authors":"Kelin Li, Shubham M Wagh, Nitish Sharma, Saksham Bhadani, Wei Chen, Chang Liu, Petar Kormushev","doi":"arxiv-2409.11925","DOIUrl":"https://doi.org/arxiv-2409.11925","url":null,"abstract":"Robotic manipulation is essential for the widespread adoption of robots in\u0000industrial and home settings and has long been a focus within the robotics\u0000community. Advances in artificial intelligence have introduced promising\u0000learning-based methods to address this challenge, with imitation learning\u0000emerging as particularly effective. However, efficiently acquiring high-quality\u0000demonstrations remains a challenge. In this work, we introduce an immersive\u0000VR-based teleoperation setup designed to collect demonstrations from a remote\u0000human user. We also propose an imitation learning framework called Haptic\u0000Action Chunking with Transformers (Haptic-ACT). To evaluate the platform, we\u0000conducted a pick-and-place task and collected 50 demonstration episodes.\u0000Results indicate that the immersive VR platform significantly reduces\u0000demonstrator fingertip forces compared to systems without haptic feedback,\u0000enabling more delicate manipulation. Additionally, evaluations of the\u0000Haptic-ACT framework in both the MuJoCo simulator and on a real robot\u0000demonstrate its effectiveness in teaching robots more compliant manipulation\u0000compared to the original ACT. Additional materials are available at\u0000https://sites.google.com/view/hapticact.","PeriodicalId":501031,"journal":{"name":"arXiv - CS - Robotics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Youssef Mohamed, Severin Lemaignan, Arzu Guneysu, Patric Jensfelt, Christian Smith
Accurate recognition of human emotions is a crucial challenge in affective computing and human-robot interaction (HRI). Emotional states play a vital role in shaping behaviors, decisions, and social interactions. However, emotional expressions can be influenced by contextual factors, leading to misinterpretations if context is not considered. Multimodal fusion, combining modalities like facial expressions, speech, and physiological signals, has shown promise in improving affect recognition. This paper proposes a transformer-based multimodal fusion approach that leverages facial thermal data, facial action units, and textual context information for context-aware emotion recognition. We explore modality-specific encoders to learn tailored representations, which are then fused using additive fusion and processed by a shared transformer encoder to capture temporal dependencies and interactions. The proposed method is evaluated on a dataset collected from participants engaged in a tangible tabletop Pacman game designed to induce various affective states. Our results demonstrate the effectiveness of incorporating contextual information and multimodal fusion for affective state recognition.
{"title":"Fusion in Context: A Multimodal Approach to Affective State Recognition","authors":"Youssef Mohamed, Severin Lemaignan, Arzu Guneysu, Patric Jensfelt, Christian Smith","doi":"arxiv-2409.11906","DOIUrl":"https://doi.org/arxiv-2409.11906","url":null,"abstract":"Accurate recognition of human emotions is a crucial challenge in affective\u0000computing and human-robot interaction (HRI). Emotional states play a vital role\u0000in shaping behaviors, decisions, and social interactions. However, emotional\u0000expressions can be influenced by contextual factors, leading to\u0000misinterpretations if context is not considered. Multimodal fusion, combining\u0000modalities like facial expressions, speech, and physiological signals, has\u0000shown promise in improving affect recognition. This paper proposes a\u0000transformer-based multimodal fusion approach that leverages facial thermal\u0000data, facial action units, and textual context information for context-aware\u0000emotion recognition. We explore modality-specific encoders to learn tailored\u0000representations, which are then fused using additive fusion and processed by a\u0000shared transformer encoder to capture temporal dependencies and interactions.\u0000The proposed method is evaluated on a dataset collected from participants\u0000engaged in a tangible tabletop Pacman game designed to induce various affective\u0000states. Our results demonstrate the effectiveness of incorporating contextual\u0000information and multimodal fusion for affective state recognition.","PeriodicalId":501031,"journal":{"name":"arXiv - CS - Robotics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269794","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jaehyung Jung, Simon Boche, Sebastian Barbas Laina, Stefan Leutenegger
We propose visual-inertial simultaneous localization and mapping that tightly couples sparse reprojection errors, inertial measurement unit pre-integrals, and relative pose factors with dense volumetric occupancy mapping. Hereby depth predictions from a deep neural network are fused in a fully probabilistic manner. Specifically, our method is rigorously uncertainty-aware: first, we use depth and uncertainty predictions from a deep network not only from the robot's stereo rig, but we further probabilistically fuse motion stereo that provides depth information across a range of baselines, therefore drastically increasing mapping accuracy. Next, predicted and fused depth uncertainty propagates not only into occupancy probabilities but also into alignment factors between generated dense submaps that enter the probabilistic nonlinear least squares estimator. This submap representation offers globally consistent geometry at scale. Our method is thoroughly evaluated in two benchmark datasets, resulting in localization and mapping accuracy that exceeds the state of the art, while simultaneously offering volumetric occupancy directly usable for downstream robotic planning and control in real-time.
{"title":"Uncertainty-Aware Visual-Inertial SLAM with Volumetric Occupancy Mapping","authors":"Jaehyung Jung, Simon Boche, Sebastian Barbas Laina, Stefan Leutenegger","doi":"arxiv-2409.12051","DOIUrl":"https://doi.org/arxiv-2409.12051","url":null,"abstract":"We propose visual-inertial simultaneous localization and mapping that tightly\u0000couples sparse reprojection errors, inertial measurement unit pre-integrals,\u0000and relative pose factors with dense volumetric occupancy mapping. Hereby depth\u0000predictions from a deep neural network are fused in a fully probabilistic\u0000manner. Specifically, our method is rigorously uncertainty-aware: first, we use\u0000depth and uncertainty predictions from a deep network not only from the robot's\u0000stereo rig, but we further probabilistically fuse motion stereo that provides\u0000depth information across a range of baselines, therefore drastically increasing\u0000mapping accuracy. Next, predicted and fused depth uncertainty propagates not\u0000only into occupancy probabilities but also into alignment factors between\u0000generated dense submaps that enter the probabilistic nonlinear least squares\u0000estimator. This submap representation offers globally consistent geometry at\u0000scale. Our method is thoroughly evaluated in two benchmark datasets, resulting\u0000in localization and mapping accuracy that exceeds the state of the art, while\u0000simultaneously offering volumetric occupancy directly usable for downstream\u0000robotic planning and control in real-time.","PeriodicalId":501031,"journal":{"name":"arXiv - CS - Robotics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142267032","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sonar-based indoor mapping systems have been widely employed in robotics for several decades. While such systems are still the mainstream in underwater and pipe inspection settings, the vulnerability to noise reduced, over time, their general widespread usage in favour of other modalities(textit{e.g.}, cameras, lidars), whose technologies were encountering, instead, extraordinary advancements. Nevertheless, mapping physical environments using acoustic signals and echolocation can bring significant benefits to robot navigation in adverse scenarios, thanks to their complementary characteristics compared to other sensors. Cameras and lidars, indeed, struggle in harsh weather conditions, when dealing with lack of illumination, or with non-reflective walls. Yet, for acoustic sensors to be able to generate accurate maps, noise has to be properly and effectively handled. Traditional signal processing techniques are not always a solution in those cases. In this paper, we propose a framework where machine learning is exploited to aid more traditional signal processing methods to cope with background noise, by removing outliers and artefacts from the generated maps using acoustic sensors. Our goal is to demonstrate that the performance of traditional echolocation mapping techniques can be greatly enhanced, even in particularly noisy conditions, facilitating the employment of acoustic sensors in state-of-the-art multi-modal robot navigation systems. Our simulated evaluation demonstrates that the system can reliably operate at an SNR of $-10$dB. Moreover, we also show that the proposed method is capable of operating in different reverberate environments. In this paper, we also use the proposed method to map the outline of a simulated room using a robotic platform.
{"title":"A machine learning framework for acoustic reflector mapping","authors":"Usama Saqib, Letizia Marchegiani, Jesper Rindom Jensen","doi":"arxiv-2409.12094","DOIUrl":"https://doi.org/arxiv-2409.12094","url":null,"abstract":"Sonar-based indoor mapping systems have been widely employed in robotics for\u0000several decades. While such systems are still the mainstream in underwater and\u0000pipe inspection settings, the vulnerability to noise reduced, over time, their\u0000general widespread usage in favour of other modalities(textit{e.g.}, cameras,\u0000lidars), whose technologies were encountering, instead, extraordinary\u0000advancements. Nevertheless, mapping physical environments using acoustic\u0000signals and echolocation can bring significant benefits to robot navigation in\u0000adverse scenarios, thanks to their complementary characteristics compared to\u0000other sensors. Cameras and lidars, indeed, struggle in harsh weather\u0000conditions, when dealing with lack of illumination, or with non-reflective\u0000walls. Yet, for acoustic sensors to be able to generate accurate maps, noise\u0000has to be properly and effectively handled. Traditional signal processing\u0000techniques are not always a solution in those cases. In this paper, we propose\u0000a framework where machine learning is exploited to aid more traditional signal\u0000processing methods to cope with background noise, by removing outliers and\u0000artefacts from the generated maps using acoustic sensors. Our goal is to\u0000demonstrate that the performance of traditional echolocation mapping techniques\u0000can be greatly enhanced, even in particularly noisy conditions, facilitating\u0000the employment of acoustic sensors in state-of-the-art multi-modal robot\u0000navigation systems. Our simulated evaluation demonstrates that the system can\u0000reliably operate at an SNR of $-10$dB. Moreover, we also show that the proposed\u0000method is capable of operating in different reverberate environments. In this\u0000paper, we also use the proposed method to map the outline of a simulated room\u0000using a robotic platform.","PeriodicalId":501031,"journal":{"name":"arXiv - CS - Robotics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142267030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Human cognition can leverage fundamental conceptual knowledge, like geometric and kinematic ones, to appropriately perceive, comprehend and interact with novel objects. Motivated by this finding, we aim to endow machine intelligence with an analogous capability through performing at the conceptual level, in order to understand and then interact with articulated objects, especially for those in novel categories, which is challenging due to the intricate geometric structures and diverse joint types of articulated objects. To achieve this goal, we propose Analytic Ontology Template (AOT), a parameterized and differentiable program description of generalized conceptual ontologies. A baseline approach called AOTNet driven by AOTs is designed accordingly to equip intelligent agents with these generalized concepts, and then empower the agents to effectively discover the conceptual knowledge on the structure and affordance of articulated objects. The AOT-driven approach yields benefits in three key perspectives: i) enabling concept-level understanding of articulated objects without relying on any real training data, ii) providing analytic structure information, and iii) introducing rich affordance information indicating proper ways of interaction. We conduct exhaustive experiments and the results demonstrate the superiority of our approach in understanding and then interacting with articulated objects.
{"title":"Discovering Conceptual Knowledge with Analytic Ontology Templates for Articulated Objects","authors":"Jianhua Sun, Yuxuan Li, Longfei Xu, Jiude Wei, Liang Chai, Cewu Lu","doi":"arxiv-2409.11702","DOIUrl":"https://doi.org/arxiv-2409.11702","url":null,"abstract":"Human cognition can leverage fundamental conceptual knowledge, like geometric\u0000and kinematic ones, to appropriately perceive, comprehend and interact with\u0000novel objects. Motivated by this finding, we aim to endow machine intelligence\u0000with an analogous capability through performing at the conceptual level, in\u0000order to understand and then interact with articulated objects, especially for\u0000those in novel categories, which is challenging due to the intricate geometric\u0000structures and diverse joint types of articulated objects. To achieve this\u0000goal, we propose Analytic Ontology Template (AOT), a parameterized and\u0000differentiable program description of generalized conceptual ontologies. A\u0000baseline approach called AOTNet driven by AOTs is designed accordingly to equip\u0000intelligent agents with these generalized concepts, and then empower the agents\u0000to effectively discover the conceptual knowledge on the structure and\u0000affordance of articulated objects. The AOT-driven approach yields benefits in\u0000three key perspectives: i) enabling concept-level understanding of articulated\u0000objects without relying on any real training data, ii) providing analytic\u0000structure information, and iii) introducing rich affordance information\u0000indicating proper ways of interaction. We conduct exhaustive experiments and\u0000the results demonstrate the superiority of our approach in understanding and\u0000then interacting with articulated objects.","PeriodicalId":501031,"journal":{"name":"arXiv - CS - Robotics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents a general refractive camera model and online co-estimation of odometry and the refractive index of unknown media. This enables operation in diverse and varying refractive fluids, given only the camera calibration in air. The refractive index is estimated online as a state variable of a monocular visual-inertial odometry framework in an iterative formulation using the proposed camera model. The method was verified on data collected using an underwater robot traversing inside a pool. The evaluations demonstrate convergence to the ideal refractive index for water despite significant perturbations in the initialization. Simultaneously, the approach enables on-par visual-inertial odometry performance in refractive media without prior knowledge of the refractive index or requirement of medium-specific camera calibration.
{"title":"Online Refractive Camera Model Calibration in Visual Inertial Odometry","authors":"Mohit Singh, Kostas Alexis","doi":"arxiv-2409.12074","DOIUrl":"https://doi.org/arxiv-2409.12074","url":null,"abstract":"This paper presents a general refractive camera model and online\u0000co-estimation of odometry and the refractive index of unknown media. This\u0000enables operation in diverse and varying refractive fluids, given only the\u0000camera calibration in air. The refractive index is estimated online as a state\u0000variable of a monocular visual-inertial odometry framework in an iterative\u0000formulation using the proposed camera model. The method was verified on data\u0000collected using an underwater robot traversing inside a pool. The evaluations\u0000demonstrate convergence to the ideal refractive index for water despite\u0000significant perturbations in the initialization. Simultaneously, the approach\u0000enables on-par visual-inertial odometry performance in refractive media without\u0000prior knowledge of the refractive index or requirement of medium-specific\u0000camera calibration.","PeriodicalId":501031,"journal":{"name":"arXiv - CS - Robotics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142267087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kejia Chen, Zheng Shen, Yue Zhang, Lingyun Chen, Fan Wu, Zhenshan Bing, Sami Haddadin, Alois Knoll
Large Language Models (LLMs) have gained popularity in task planning for long-horizon manipulation tasks. To enhance the validity of LLM-generated plans, visual demonstrations and online videos have been widely employed to guide the planning process. However, for manipulation tasks involving subtle movements but rich contact interactions, visual perception alone may be insufficient for the LLM to fully interpret the demonstration. Additionally, visual data provides limited information on force-related parameters and conditions, which are crucial for effective execution on real robots. In this paper, we introduce an in-context learning framework that incorporates tactile and force-torque information from human demonstrations to enhance LLMs' ability to generate plans for new task scenarios. We propose a bootstrapped reasoning pipeline that sequentially integrates each modality into a comprehensive task plan. This task plan is then used as a reference for planning in new task configurations. Real-world experiments on two different sequential manipulation tasks demonstrate the effectiveness of our framework in improving LLMs' understanding of multi-modal demonstrations and enhancing the overall planning performance.
{"title":"Learning Task Planning from Multi-Modal Demonstration for Multi-Stage Contact-Rich Manipulation","authors":"Kejia Chen, Zheng Shen, Yue Zhang, Lingyun Chen, Fan Wu, Zhenshan Bing, Sami Haddadin, Alois Knoll","doi":"arxiv-2409.11863","DOIUrl":"https://doi.org/arxiv-2409.11863","url":null,"abstract":"Large Language Models (LLMs) have gained popularity in task planning for\u0000long-horizon manipulation tasks. To enhance the validity of LLM-generated\u0000plans, visual demonstrations and online videos have been widely employed to\u0000guide the planning process. However, for manipulation tasks involving subtle\u0000movements but rich contact interactions, visual perception alone may be\u0000insufficient for the LLM to fully interpret the demonstration. Additionally,\u0000visual data provides limited information on force-related parameters and\u0000conditions, which are crucial for effective execution on real robots. In this paper, we introduce an in-context learning framework that\u0000incorporates tactile and force-torque information from human demonstrations to\u0000enhance LLMs' ability to generate plans for new task scenarios. We propose a\u0000bootstrapped reasoning pipeline that sequentially integrates each modality into\u0000a comprehensive task plan. This task plan is then used as a reference for\u0000planning in new task configurations. Real-world experiments on two different\u0000sequential manipulation tasks demonstrate the effectiveness of our framework in\u0000improving LLMs' understanding of multi-modal demonstrations and enhancing the\u0000overall planning performance.","PeriodicalId":501031,"journal":{"name":"arXiv - CS - Robotics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142267079","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abulikemu Abuduweili, Alice Wu, Tianhao Wei, Weiye Zhao
There is a large population of wheelchair users. Most of the wheelchair users need help with daily tasks. However, according to recent reports, their needs are not properly satisfied due to the lack of caregivers. Therefore, in this project, we develop WeHelp, a shared autonomy system aimed for wheelchair users. A robot with a WeHelp system has three modes, following mode, remote control mode and tele-operation mode. In the following mode, the robot follows the wheelchair user automatically via visual tracking. The wheelchair user can ask the robot to follow them from behind, by the left or by the right. When the wheelchair user asks for help, the robot will recognize the command via speech recognition, and then switch to the teleoperation mode or remote control mode. In the teleoperation mode, the wheelchair user takes over the robot with a joy stick and controls the robot to complete some complex tasks for their needs, such as opening doors, moving obstacles on the way, reaching objects on a high shelf or on the low ground, etc. In the remote control mode, a remote assistant takes over the robot and helps the wheelchair user complete some complex tasks for their needs. Our evaluation shows that the pipeline is useful and practical for wheelchair users. Source code and demo of the paper are available at url{https://github.com/Walleclipse/WeHelp}.
{"title":"WeHelp: A Shared Autonomy System for Wheelchair Users","authors":"Abulikemu Abuduweili, Alice Wu, Tianhao Wei, Weiye Zhao","doi":"arxiv-2409.12159","DOIUrl":"https://doi.org/arxiv-2409.12159","url":null,"abstract":"There is a large population of wheelchair users. Most of the wheelchair users\u0000need help with daily tasks. However, according to recent reports, their needs\u0000are not properly satisfied due to the lack of caregivers. Therefore, in this\u0000project, we develop WeHelp, a shared autonomy system aimed for wheelchair\u0000users. A robot with a WeHelp system has three modes, following mode, remote\u0000control mode and tele-operation mode. In the following mode, the robot follows\u0000the wheelchair user automatically via visual tracking. The wheelchair user can\u0000ask the robot to follow them from behind, by the left or by the right. When the\u0000wheelchair user asks for help, the robot will recognize the command via speech\u0000recognition, and then switch to the teleoperation mode or remote control mode.\u0000In the teleoperation mode, the wheelchair user takes over the robot with a joy\u0000stick and controls the robot to complete some complex tasks for their needs,\u0000such as opening doors, moving obstacles on the way, reaching objects on a high\u0000shelf or on the low ground, etc. In the remote control mode, a remote assistant\u0000takes over the robot and helps the wheelchair user complete some complex tasks\u0000for their needs. Our evaluation shows that the pipeline is useful and practical\u0000for wheelchair users. Source code and demo of the paper are available at\u0000url{https://github.com/Walleclipse/WeHelp}.","PeriodicalId":501031,"journal":{"name":"arXiv - CS - Robotics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142267025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}