Keshu Wu, Yang Zhou, Haotian Shi, Dominique Lord, Bin Ran, Xinyue Ye
The intricate nature of real-world driving environments, characterized by dynamic and diverse interactions among multiple vehicles and their possible future states, presents considerable challenges in accurately predicting the motion states of vehicles and handling the uncertainty inherent in the predictions. Addressing these challenges requires comprehensive modeling and reasoning to capture the implicit relations among vehicles and the corresponding diverse behaviors. This research introduces an integrated framework for autonomous vehicles (AVs) motion prediction to address these complexities, utilizing a novel Relational Hypergraph Interaction-informed Neural mOtion generator (RHINO). RHINO leverages hypergraph-based relational reasoning by integrating a multi-scale hypergraph neural network to model group-wise interactions among multiple vehicles and their multi-modal driving behaviors, thereby enhancing motion prediction accuracy and reliability. Experimental validation using real-world datasets demonstrates the superior performance of this framework in improving predictive accuracy and fostering socially aware automated driving in dynamic traffic scenarios.
{"title":"Hypergraph-based Motion Generation with Multi-modal Interaction Relational Reasoning","authors":"Keshu Wu, Yang Zhou, Haotian Shi, Dominique Lord, Bin Ran, Xinyue Ye","doi":"arxiv-2409.11676","DOIUrl":"https://doi.org/arxiv-2409.11676","url":null,"abstract":"The intricate nature of real-world driving environments, characterized by\u0000dynamic and diverse interactions among multiple vehicles and their possible\u0000future states, presents considerable challenges in accurately predicting the\u0000motion states of vehicles and handling the uncertainty inherent in the\u0000predictions. Addressing these challenges requires comprehensive modeling and\u0000reasoning to capture the implicit relations among vehicles and the\u0000corresponding diverse behaviors. This research introduces an integrated\u0000framework for autonomous vehicles (AVs) motion prediction to address these\u0000complexities, utilizing a novel Relational Hypergraph Interaction-informed\u0000Neural mOtion generator (RHINO). RHINO leverages hypergraph-based relational\u0000reasoning by integrating a multi-scale hypergraph neural network to model\u0000group-wise interactions among multiple vehicles and their multi-modal driving\u0000behaviors, thereby enhancing motion prediction accuracy and reliability.\u0000Experimental validation using real-world datasets demonstrates the superior\u0000performance of this framework in improving predictive accuracy and fostering\u0000socially aware automated driving in dynamic traffic scenarios.","PeriodicalId":501031,"journal":{"name":"arXiv - CS - Robotics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
When your robot grasps an object using dexterous hands or grippers, it should understand the Task-Oriented Affordances of the Object(TOAO), as different tasks often require attention to specific parts of the object. To address this challenge, we propose GauTOAO, a Gaussian-based framework for Task-Oriented Affordance of Objects, which leverages vision-language models in a zero-shot manner to predict affordance-relevant regions of an object, given a natural language query. Our approach introduces a new paradigm: "static camera, moving object," allowing the robot to better observe and understand the object in hand during manipulation. GauTOAO addresses the limitations of existing methods, which often lack effective spatial grouping, by extracting a comprehensive 3D object mask using DINO features. This mask is then used to conditionally query gaussians, producing a refined semantic distribution over the object for the specified task. This approach results in more accurate TOAO extraction, enhancing the robot's understanding of the object and improving task performance. We validate the effectiveness of GauTOAO through real-world experiments, demonstrating its capability to generalize across various tasks.
当机器人使用灵巧的手或抓手抓取物体时,它应该了解物体的任务导向适配性(TOAO),因为不同的任务往往需要关注物体的特定部分。为了应对这一挑战,我们提出了基于高斯的物体任务相关性框架 GauTOAO,该框架在给定自然语言查询的情况下,利用视觉语言模型,以零帧方式预测物体的相关性区域。我们的方法引入了一种新的范式:"静态相机,移动物体",使机器人能够在操作过程中更好地观察和理解手中的物体。GauTOAO 利用 DINO 特征提取全面的 3D 物体掩码,解决了现有方法往往缺乏有效空间分组的局限性。然后利用该掩码对高斯进行有条件查询,为指定任务生成对象的精细语义分布。这种方法能更准确地提取 TOAO,增强机器人对物体的理解,提高任务性能。我们通过真实世界的实验验证了 GauTOAO 的有效性,证明了它在各种任务中的通用能力。
{"title":"GauTOAO: Gaussian-based Task-Oriented Affordance of Objects","authors":"Jiawen Wang, Dingsheng Luo","doi":"arxiv-2409.11941","DOIUrl":"https://doi.org/arxiv-2409.11941","url":null,"abstract":"When your robot grasps an object using dexterous hands or grippers, it should\u0000understand the Task-Oriented Affordances of the Object(TOAO), as different\u0000tasks often require attention to specific parts of the object. To address this\u0000challenge, we propose GauTOAO, a Gaussian-based framework for Task-Oriented\u0000Affordance of Objects, which leverages vision-language models in a zero-shot\u0000manner to predict affordance-relevant regions of an object, given a natural\u0000language query. Our approach introduces a new paradigm: \"static camera, moving\u0000object,\" allowing the robot to better observe and understand the object in hand\u0000during manipulation. GauTOAO addresses the limitations of existing methods,\u0000which often lack effective spatial grouping, by extracting a comprehensive 3D\u0000object mask using DINO features. This mask is then used to conditionally query\u0000gaussians, producing a refined semantic distribution over the object for the\u0000specified task. This approach results in more accurate TOAO extraction,\u0000enhancing the robot's understanding of the object and improving task\u0000performance. We validate the effectiveness of GauTOAO through real-world\u0000experiments, demonstrating its capability to generalize across various tasks.","PeriodicalId":501031,"journal":{"name":"arXiv - CS - Robotics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Martin Schuck, Jan Brüdigam, Sandra Hirche, Angela Schoellig
Handling orientations of robots and objects is a crucial aspect of many applications. Yet, ever so often, there is a lack of mathematical correctness when dealing with orientations, especially in learning pipelines involving, for example, artificial neural networks. In this paper, we investigate reinforcement learning with orientations and propose a simple modification of the network's input and output that adheres to the Lie group structure of orientations. As a result, we obtain an easy and efficient implementation that is directly usable with existing learning libraries and achieves significantly better performance than other common orientation representations. We briefly introduce Lie theory specifically for orientations in robotics to motivate and outline our approach. Subsequently, a thorough empirical evaluation of different combinations of orientation representations for states and actions demonstrates the superior performance of our proposed approach in different scenarios, including: direct orientation control, end effector orientation control, and pick-and-place tasks.
{"title":"Reinforcement Learning with Lie Group Orientations for Robotics","authors":"Martin Schuck, Jan Brüdigam, Sandra Hirche, Angela Schoellig","doi":"arxiv-2409.11935","DOIUrl":"https://doi.org/arxiv-2409.11935","url":null,"abstract":"Handling orientations of robots and objects is a crucial aspect of many\u0000applications. Yet, ever so often, there is a lack of mathematical correctness\u0000when dealing with orientations, especially in learning pipelines involving, for\u0000example, artificial neural networks. In this paper, we investigate\u0000reinforcement learning with orientations and propose a simple modification of\u0000the network's input and output that adheres to the Lie group structure of\u0000orientations. As a result, we obtain an easy and efficient implementation that\u0000is directly usable with existing learning libraries and achieves significantly\u0000better performance than other common orientation representations. We briefly\u0000introduce Lie theory specifically for orientations in robotics to motivate and\u0000outline our approach. Subsequently, a thorough empirical evaluation of\u0000different combinations of orientation representations for states and actions\u0000demonstrates the superior performance of our proposed approach in different\u0000scenarios, including: direct orientation control, end effector orientation\u0000control, and pick-and-place tasks.","PeriodicalId":501031,"journal":{"name":"arXiv - CS - Robotics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Robotic assistive feeding holds significant promise for improving the quality of life for individuals with eating disabilities. However, acquiring diverse food items under varying conditions and generalizing to unseen food presents unique challenges. Existing methods that rely on surface-level geometric information (e.g., bounding box and pose) derived from visual cues (e.g., color, shape, and texture) often lacks adaptability and robustness, especially when foods share similar physical properties but differ in visual appearance. We employ imitation learning (IL) to learn a policy for food acquisition. Existing methods employ IL or Reinforcement Learning (RL) to learn a policy based on off-the-shelf image encoders such as ResNet-50. However, such representations are not robust and struggle to generalize across diverse acquisition scenarios. To address these limitations, we propose a novel approach, IMRL (Integrated Multi-Dimensional Representation Learning), which integrates visual, physical, temporal, and geometric representations to enhance the robustness and generalizability of IL for food acquisition. Our approach captures food types and physical properties (e.g., solid, semi-solid, granular, liquid, and mixture), models temporal dynamics of acquisition actions, and introduces geometric information to determine optimal scooping points and assess bowl fullness. IMRL enables IL to adaptively adjust scooping strategies based on context, improving the robot's capability to handle diverse food acquisition scenarios. Experiments on a real robot demonstrate our approach's robustness and adaptability across various foods and bowl configurations, including zero-shot generalization to unseen settings. Our approach achieves improvement up to $35%$ in success rate compared with the best-performing baseline.
机器人辅助喂食为改善进食残疾人士的生活质量带来了巨大希望。然而,在不同条件下获取不同的食物并将其推广到未见过的食物上,这带来了独特的挑战。现有的方法依赖于从视觉线索(如颜色、形状和纹理)中提取的表面级几何信息(如边界框和姿势),这些方法往往缺乏适应性和鲁棒性,尤其是当食物具有相似的物理特性但视觉外观不同时。然而,这种方法并不稳健,很难在不同的获取场景中通用。为了解决这些局限性,我们提出了一种新方法--IMRL(综合多维表征学习),它整合了视觉、物理、时间和几何表征,以增强用于食物获取的 IL 的鲁棒性和泛化能力。我们的方法捕捉食物类型和物理特性(如固体、半固体、颗粒状、液体和混合物),建立获取动作的时间动态模型,并引入几何信息以确定最佳舀食点和评估碗的饱满度。IMRL使IL能够根据上下文自适应地调整舀取策略,从而提高机器人处理各种食物获取场景的能力。在真实机器人上进行的实验证明了我们的方法在各种食物和碗配置中的稳健性和适应性,包括对未知环境的零点泛化。与表现最好的基准相比,我们的方法在成功率上提高了 35%。
{"title":"IMRL: Integrating Visual, Physical, Temporal, and Geometric Representations for Enhanced Food Acquisition","authors":"Rui Liu, Zahiruddin Mahammad, Amisha Bhaskar, Pratap Tokekar","doi":"arxiv-2409.12092","DOIUrl":"https://doi.org/arxiv-2409.12092","url":null,"abstract":"Robotic assistive feeding holds significant promise for improving the quality\u0000of life for individuals with eating disabilities. However, acquiring diverse\u0000food items under varying conditions and generalizing to unseen food presents\u0000unique challenges. Existing methods that rely on surface-level geometric\u0000information (e.g., bounding box and pose) derived from visual cues (e.g.,\u0000color, shape, and texture) often lacks adaptability and robustness, especially\u0000when foods share similar physical properties but differ in visual appearance.\u0000We employ imitation learning (IL) to learn a policy for food acquisition.\u0000Existing methods employ IL or Reinforcement Learning (RL) to learn a policy\u0000based on off-the-shelf image encoders such as ResNet-50. However, such\u0000representations are not robust and struggle to generalize across diverse\u0000acquisition scenarios. To address these limitations, we propose a novel\u0000approach, IMRL (Integrated Multi-Dimensional Representation Learning), which\u0000integrates visual, physical, temporal, and geometric representations to enhance\u0000the robustness and generalizability of IL for food acquisition. Our approach\u0000captures food types and physical properties (e.g., solid, semi-solid, granular,\u0000liquid, and mixture), models temporal dynamics of acquisition actions, and\u0000introduces geometric information to determine optimal scooping points and\u0000assess bowl fullness. IMRL enables IL to adaptively adjust scooping strategies\u0000based on context, improving the robot's capability to handle diverse food\u0000acquisition scenarios. Experiments on a real robot demonstrate our approach's\u0000robustness and adaptability across various foods and bowl configurations,\u0000including zero-shot generalization to unseen settings. Our approach achieves\u0000improvement up to $35%$ in success rate compared with the best-performing\u0000baseline.","PeriodicalId":501031,"journal":{"name":"arXiv - CS - Robotics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Autonomous driving technology has witnessed rapid advancements, with foundation models improving interactivity and user experiences. However, current autonomous vehicles (AVs) face significant limitations in delivering command-based driving styles. Most existing methods either rely on predefined driving styles that require expert input or use data-driven techniques like Inverse Reinforcement Learning to extract styles from driving data. These approaches, though effective in some cases, face challenges: difficulty obtaining specific driving data for style matching (e.g., in Robotaxis), inability to align driving style metrics with user preferences, and limitations to pre-existing styles, restricting customization and generalization to new commands. This paper introduces Words2Wheels, a framework that automatically generates customized driving policies based on natural language user commands. Words2Wheels employs a Style-Customized Reward Function to generate a Style-Customized Driving Policy without relying on prior driving data. By leveraging large language models and a Driving Style Database, the framework efficiently retrieves, adapts, and generalizes driving styles. A Statistical Evaluation module ensures alignment with user preferences. Experimental results demonstrate that Words2Wheels outperforms existing methods in accuracy, generalization, and adaptability, offering a novel solution for customized AV driving behavior. Code and demo available at https://yokhon.github.io/Words2Wheels/.
{"title":"From Words to Wheels: Automated Style-Customized Policy Generation for Autonomous Driving","authors":"Xu Han, Xianda Chen, Zhenghan Cai, Pinlong Cai, Meixin Zhu, Xiaowen Chu","doi":"arxiv-2409.11694","DOIUrl":"https://doi.org/arxiv-2409.11694","url":null,"abstract":"Autonomous driving technology has witnessed rapid advancements, with\u0000foundation models improving interactivity and user experiences. However,\u0000current autonomous vehicles (AVs) face significant limitations in delivering\u0000command-based driving styles. Most existing methods either rely on predefined\u0000driving styles that require expert input or use data-driven techniques like\u0000Inverse Reinforcement Learning to extract styles from driving data. These\u0000approaches, though effective in some cases, face challenges: difficulty\u0000obtaining specific driving data for style matching (e.g., in Robotaxis),\u0000inability to align driving style metrics with user preferences, and limitations\u0000to pre-existing styles, restricting customization and generalization to new\u0000commands. This paper introduces Words2Wheels, a framework that automatically\u0000generates customized driving policies based on natural language user commands.\u0000Words2Wheels employs a Style-Customized Reward Function to generate a\u0000Style-Customized Driving Policy without relying on prior driving data. By\u0000leveraging large language models and a Driving Style Database, the framework\u0000efficiently retrieves, adapts, and generalizes driving styles. A Statistical\u0000Evaluation module ensures alignment with user preferences. Experimental results\u0000demonstrate that Words2Wheels outperforms existing methods in accuracy,\u0000generalization, and adaptability, offering a novel solution for customized AV\u0000driving behavior. Code and demo available at\u0000https://yokhon.github.io/Words2Wheels/.","PeriodicalId":501031,"journal":{"name":"arXiv - CS - Robotics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Finn Lukas Busch, Timon Homberger, Jesús Ortega-Peimbert, Quantao Yang, Olov Andersson
The capability to efficiently search for objects in complex environments is fundamental for many real-world robot applications. Recent advances in open-vocabulary vision models have resulted in semantically-informed object navigation methods that allow a robot to search for an arbitrary object without prior training. However, these zero-shot methods have so far treated the environment as unknown for each consecutive query. In this paper we introduce a new benchmark for zero-shot multi-object navigation, allowing the robot to leverage information gathered from previous searches to more efficiently find new objects. To address this problem we build a reusable open-vocabulary feature map tailored for real-time object search. We further propose a probabilistic-semantic map update that mitigates common sources of errors in semantic feature extraction and leverage this semantic uncertainty for informed multi-object exploration. We evaluate our method on a set of object navigation tasks in both simulation as well as with a real robot, running in real-time on a Jetson Orin AGX. We demonstrate that it outperforms existing state-of-the-art approaches both on single and multi-object navigation tasks. Additional videos, code and the multi-object navigation benchmark will be available on https://finnbsch.github.io/OneMap.
在复杂环境中高效搜索物体的能力是现实世界中许多机器人应用的基础。开放词汇视觉模型的最新进展带来了基于语义的物体导航方法,使机器人无需事先训练即可搜索任意物体。然而,迄今为止,这些 "零镜头 "方法在每次连续查询时都将环境视为未知。在本文中,我们引入了一种新的零点多目标导航基准,允许机器人利用从之前搜索中收集到的信息,更高效地找到新目标。为了解决这个问题,我们为实时物体搜索量身定制了一个可重复使用的开放词汇特征图。我们进一步提出了一种可减少语义特征提取中常见错误来源的robabilistic语义地图更新方法,并利用这种语义不确定性进行有依据的多对象探索。我们通过在 Jetson Orin AGX 上实时运行的一组对象导航任务,对我们的方法进行了模拟和真实机器人评估。结果表明,在单目标和多目标导航任务上,我们的方法都优于现有的先进方法。更多视频、代码和多目标导航基准将在https://finnbsch.github.io/OneMap。
{"title":"One Map to Find Them All: Real-time Open-Vocabulary Mapping for Zero-shot Multi-Object Navigation","authors":"Finn Lukas Busch, Timon Homberger, Jesús Ortega-Peimbert, Quantao Yang, Olov Andersson","doi":"arxiv-2409.11764","DOIUrl":"https://doi.org/arxiv-2409.11764","url":null,"abstract":"The capability to efficiently search for objects in complex environments is\u0000fundamental for many real-world robot applications. Recent advances in\u0000open-vocabulary vision models have resulted in semantically-informed object\u0000navigation methods that allow a robot to search for an arbitrary object without\u0000prior training. However, these zero-shot methods have so far treated the\u0000environment as unknown for each consecutive query. In this paper we introduce a\u0000new benchmark for zero-shot multi-object navigation, allowing the robot to\u0000leverage information gathered from previous searches to more efficiently find\u0000new objects. To address this problem we build a reusable open-vocabulary\u0000feature map tailored for real-time object search. We further propose a\u0000probabilistic-semantic map update that mitigates common sources of errors in\u0000semantic feature extraction and leverage this semantic uncertainty for informed\u0000multi-object exploration. We evaluate our method on a set of object navigation\u0000tasks in both simulation as well as with a real robot, running in real-time on\u0000a Jetson Orin AGX. We demonstrate that it outperforms existing state-of-the-art\u0000approaches both on single and multi-object navigation tasks. Additional videos,\u0000code and the multi-object navigation benchmark will be available on\u0000https://finnbsch.github.io/OneMap.","PeriodicalId":501031,"journal":{"name":"arXiv - CS - Robotics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Robots can influence people to accomplish their tasks more efficiently: autonomous cars can inch forward at an intersection to pass through, and tabletop manipulators can go for an object on the table first. However, a robot's ability to influence can also compromise the safety of nearby people if naively executed. In this work, we pose and solve a novel robust reach-avoid dynamic game which enables robots to be maximally influential, but only when a safety backup control exists. On the human side, we model the human's behavior as goal-driven but conditioned on the robot's plan, enabling us to capture influence. On the robot side, we solve the dynamic game in the joint physical and belief space, enabling the robot to reason about how its uncertainty in human behavior will evolve over time. We instantiate our method, called SLIDE (Safely Leveraging Influence in Dynamic Environments), in a high-dimensional (39-D) simulated human-robot collaborative manipulation task solved via offline game-theoretic reinforcement learning. We compare our approach to a robust baseline that treats the human as a worst-case adversary, a safety controller that does not explicitly reason about influence, and an energy-function-based safety shield. We find that SLIDE consistently enables the robot to leverage the influence it has on the human when it is safe to do so, ultimately allowing the robot to be less conservative while still ensuring a high safety rate during task execution.
{"title":"Robots that Learn to Safely Influence via Prediction-Informed Reach-Avoid Dynamic Games","authors":"Ravi Pandya, Changliu Liu, Andrea Bajcsy","doi":"arxiv-2409.12153","DOIUrl":"https://doi.org/arxiv-2409.12153","url":null,"abstract":"Robots can influence people to accomplish their tasks more efficiently:\u0000autonomous cars can inch forward at an intersection to pass through, and\u0000tabletop manipulators can go for an object on the table first. However, a\u0000robot's ability to influence can also compromise the safety of nearby people if\u0000naively executed. In this work, we pose and solve a novel robust reach-avoid\u0000dynamic game which enables robots to be maximally influential, but only when a\u0000safety backup control exists. On the human side, we model the human's behavior\u0000as goal-driven but conditioned on the robot's plan, enabling us to capture\u0000influence. On the robot side, we solve the dynamic game in the joint physical\u0000and belief space, enabling the robot to reason about how its uncertainty in\u0000human behavior will evolve over time. We instantiate our method, called SLIDE\u0000(Safely Leveraging Influence in Dynamic Environments), in a high-dimensional\u0000(39-D) simulated human-robot collaborative manipulation task solved via offline\u0000game-theoretic reinforcement learning. We compare our approach to a robust\u0000baseline that treats the human as a worst-case adversary, a safety controller\u0000that does not explicitly reason about influence, and an energy-function-based\u0000safety shield. We find that SLIDE consistently enables the robot to leverage\u0000the influence it has on the human when it is safe to do so, ultimately allowing\u0000the robot to be less conservative while still ensuring a high safety rate\u0000during task execution.","PeriodicalId":501031,"journal":{"name":"arXiv - CS - Robotics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142267027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tim Engelbracht, René Zurbrügg, Marc Pollefeys, Hermann Blum, Zuria Bauer
Despite increasing research efforts on household robotics, robots intended for deployment in domestic settings still struggle with more complex tasks such as interacting with functional elements like drawers or light switches, largely due to limited task-specific understanding and interaction capabilities. These tasks require not only detection and pose estimation but also an understanding of the affordances these elements provide. To address these challenges and enhance robotic scene understanding, we introduce SpotLight: A comprehensive framework for robotic interaction with functional elements, specifically light switches. Furthermore, this framework enables robots to improve their environmental understanding through interaction. Leveraging VLM-based affordance prediction to estimate motion primitives for light switch interaction, we achieve up to 84% operation success in real world experiments. We further introduce a specialized dataset containing 715 images as well as a custom detection model for light switch detection. We demonstrate how the framework can facilitate robot learning through physical interaction by having the robot explore the environment and discover previously unknown relationships in a scene graph representation. Lastly, we propose an extension to the framework to accommodate other functional interactions such as swing doors, showcasing its flexibility. Videos and Code: timengelbracht.github.io/SpotLight/
{"title":"SpotLight: Robotic Scene Understanding through Interaction and Affordance Detection","authors":"Tim Engelbracht, René Zurbrügg, Marc Pollefeys, Hermann Blum, Zuria Bauer","doi":"arxiv-2409.11870","DOIUrl":"https://doi.org/arxiv-2409.11870","url":null,"abstract":"Despite increasing research efforts on household robotics, robots intended\u0000for deployment in domestic settings still struggle with more complex tasks such\u0000as interacting with functional elements like drawers or light switches, largely\u0000due to limited task-specific understanding and interaction capabilities. These\u0000tasks require not only detection and pose estimation but also an understanding\u0000of the affordances these elements provide. To address these challenges and\u0000enhance robotic scene understanding, we introduce SpotLight: A comprehensive\u0000framework for robotic interaction with functional elements, specifically light\u0000switches. Furthermore, this framework enables robots to improve their\u0000environmental understanding through interaction. Leveraging VLM-based\u0000affordance prediction to estimate motion primitives for light switch\u0000interaction, we achieve up to 84% operation success in real world experiments.\u0000We further introduce a specialized dataset containing 715 images as well as a\u0000custom detection model for light switch detection. We demonstrate how the\u0000framework can facilitate robot learning through physical interaction by having\u0000the robot explore the environment and discover previously unknown relationships\u0000in a scene graph representation. Lastly, we propose an extension to the\u0000framework to accommodate other functional interactions such as swing doors,\u0000showcasing its flexibility. Videos and Code:\u0000timengelbracht.github.io/SpotLight/","PeriodicalId":501031,"journal":{"name":"arXiv - CS - Robotics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Non-stationarity poses a fundamental challenge in Multi-Agent Reinforcement Learning (MARL), arising from agents simultaneously learning and altering their policies. This creates a non-stationary environment from the perspective of each individual agent, often leading to suboptimal or even unconverged learning outcomes. We propose an open-source framework named XP-MARL, which augments MARL with auxiliary prioritization to address this challenge in cooperative settings. XP-MARL is 1) founded upon our hypothesis that prioritizing agents and letting higher-priority agents establish their actions first would stabilize the learning process and thus mitigate non-stationarity and 2) enabled by our proposed mechanism called action propagation, where higher-priority agents act first and communicate their actions, providing a more stationary environment for others. Moreover, instead of using a predefined or heuristic priority assignment, XP-MARL learns priority-assignment policies with an auxiliary MARL problem, leading to a joint learning scheme. Experiments in a motion-planning scenario involving Connected and Automated Vehicles (CAVs) demonstrate that XP-MARL improves the safety of a baseline model by 84.4% and outperforms a state-of-the-art approach, which improves the baseline by only 12.8%. Code: github.com/cas-lab-munich/sigmarl
{"title":"XP-MARL: Auxiliary Prioritization in Multi-Agent Reinforcement Learning to Address Non-Stationarity","authors":"Jianye Xu, Omar Sobhy, Bassam Alrifaee","doi":"arxiv-2409.11852","DOIUrl":"https://doi.org/arxiv-2409.11852","url":null,"abstract":"Non-stationarity poses a fundamental challenge in Multi-Agent Reinforcement\u0000Learning (MARL), arising from agents simultaneously learning and altering their\u0000policies. This creates a non-stationary environment from the perspective of\u0000each individual agent, often leading to suboptimal or even unconverged learning\u0000outcomes. We propose an open-source framework named XP-MARL, which augments\u0000MARL with auxiliary prioritization to address this challenge in cooperative\u0000settings. XP-MARL is 1) founded upon our hypothesis that prioritizing agents\u0000and letting higher-priority agents establish their actions first would\u0000stabilize the learning process and thus mitigate non-stationarity and 2)\u0000enabled by our proposed mechanism called action propagation, where\u0000higher-priority agents act first and communicate their actions, providing a\u0000more stationary environment for others. Moreover, instead of using a predefined\u0000or heuristic priority assignment, XP-MARL learns priority-assignment policies\u0000with an auxiliary MARL problem, leading to a joint learning scheme. Experiments\u0000in a motion-planning scenario involving Connected and Automated Vehicles (CAVs)\u0000demonstrate that XP-MARL improves the safety of a baseline model by 84.4% and\u0000outperforms a state-of-the-art approach, which improves the baseline by only\u000012.8%. Code: github.com/cas-lab-munich/sigmarl","PeriodicalId":501031,"journal":{"name":"arXiv - CS - Robotics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Efficiently and completely capturing the three-dimensional data of an object is a fundamental problem in industrial and robotic applications. The task of next-best-view (NBV) planning is to infer the pose of the next viewpoint based on the current data, and gradually realize the complete three-dimensional reconstruction. Many existing algorithms, however, suffer a large computational burden due to the use of ray-casting. To address this, this paper proposes a projection-based NBV planning framework. It can select the next best view at an extremely fast speed while ensuring the complete scanning of the object. Specifically, this framework refits different types of voxel clusters into ellipsoids based on the voxel structure.Then, the next best view is selected from the candidate views using a projection-based viewpoint quality evaluation function in conjunction with a global partitioning strategy. This process replaces the ray-casting in voxel structures, significantly improving the computational efficiency. Comparative experiments with other algorithms in a simulation environment show that the framework proposed in this paper can achieve 10 times efficiency improvement on the basis of capturing roughly the same coverage. The real-world experimental results also prove the efficiency and feasibility of the framework.
{"title":"An Efficient Projection-Based Next-best-view Planning Framework for Reconstruction of Unknown Objects","authors":"Zhizhou Jia, Shaohui Zhang, Qun Hao","doi":"arxiv-2409.12096","DOIUrl":"https://doi.org/arxiv-2409.12096","url":null,"abstract":"Efficiently and completely capturing the three-dimensional data of an object\u0000is a fundamental problem in industrial and robotic applications. The task of\u0000next-best-view (NBV) planning is to infer the pose of the next viewpoint based\u0000on the current data, and gradually realize the complete three-dimensional\u0000reconstruction. Many existing algorithms, however, suffer a large computational\u0000burden due to the use of ray-casting. To address this, this paper proposes a\u0000projection-based NBV planning framework. It can select the next best view at an\u0000extremely fast speed while ensuring the complete scanning of the object.\u0000Specifically, this framework refits different types of voxel clusters into\u0000ellipsoids based on the voxel structure.Then, the next best view is selected\u0000from the candidate views using a projection-based viewpoint quality evaluation\u0000function in conjunction with a global partitioning strategy. This process\u0000replaces the ray-casting in voxel structures, significantly improving the\u0000computational efficiency. Comparative experiments with other algorithms in a\u0000simulation environment show that the framework proposed in this paper can\u0000achieve 10 times efficiency improvement on the basis of capturing roughly the\u0000same coverage. The real-world experimental results also prove the efficiency\u0000and feasibility of the framework.","PeriodicalId":501031,"journal":{"name":"arXiv - CS - Robotics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142267029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}