Conference on Robot Learning最新文献_第4页

Last-Mile Embodied Visual Navigation 最后一英里具体化视觉导航

Conference on Robot Learning

Pub Date : 2022-11-21 DOI: 10.48550/arXiv.2211.11746

Justin Wasserman, Karmesh Yadav, Girish V. Chowdhary, Abhi Gupta, Unnat Jain

Realistic long-horizon tasks like image-goal navigation involve exploratory and exploitative phases. Assigned with an image of the goal, an embodied agent must explore to discover the goal, i.e., search efficiently using learned priors. Once the goal is discovered, the agent must accurately calibrate the last-mile of navigation to the goal. As with any robust system, switches between exploratory goal discovery and exploitative last-mile navigation enable better recovery from errors. Following these intuitive guide rails, we propose SLING to improve the performance of existing image-goal navigation systems. Entirely complementing prior methods, we focus on last-mile navigation and leverage the underlying geometric structure of the problem with neural descriptors. With simple but effective switches, we can easily connect SLING with heuristic, reinforcement learning, and neural modular policies. On a standardized image-goal navigation benchmark (Hahn et al. 2021), we improve performance across policies, scenes, and episode complexity, raising the state-of-the-art from 45% to 55% success rate. Beyond photorealistic simulation, we conduct real-robot experiments in three physical scenes and find these improvements to transfer well to real environments.

现实的长期任务，如图像目标导航，包括探索和利用阶段。给定目标图像后，具身智能体必须探索以发现目标，即使用学习到的先验进行有效搜索。一旦目标被发现，智能体必须精确校准到目标的最后一英里导航。与任何强大的系统一样，在探索性目标发现和利用最后一英里导航之间的切换可以更好地从错误中恢复。在这些直观的指导下，我们提出SLING来提高现有图像目标导航系统的性能。与之前的方法完全互补，我们专注于最后一英里导航，并利用神经描述符来利用问题的底层几何结构。通过简单而有效的开关，我们可以轻松地将SLING与启发式、强化学习和神经模块化策略连接起来。在标准化的图像目标导航基准(Hahn et al. 2021)上，我们提高了策略、场景和情节复杂性的性能，将最先进的成功率从45%提高到55%。除了逼真的模拟，我们在三个物理场景中进行了真实机器人实验，并发现这些改进可以很好地转移到真实环境中。

{"title":"Last-Mile Embodied Visual Navigation","authors":"Justin Wasserman, Karmesh Yadav, Girish V. Chowdhary, Abhi Gupta, Unnat Jain","doi":"10.48550/arXiv.2211.11746","DOIUrl":"https://doi.org/10.48550/arXiv.2211.11746","url":null,"abstract":"Realistic long-horizon tasks like image-goal navigation involve exploratory and exploitative phases. Assigned with an image of the goal, an embodied agent must explore to discover the goal, i.e., search efficiently using learned priors. Once the goal is discovered, the agent must accurately calibrate the last-mile of navigation to the goal. As with any robust system, switches between exploratory goal discovery and exploitative last-mile navigation enable better recovery from errors. Following these intuitive guide rails, we propose SLING to improve the performance of existing image-goal navigation systems. Entirely complementing prior methods, we focus on last-mile navigation and leverage the underlying geometric structure of the problem with neural descriptors. With simple but effective switches, we can easily connect SLING with heuristic, reinforcement learning, and neural modular policies. On a standardized image-goal navigation benchmark (Hahn et al. 2021), we improve performance across policies, scenes, and episode complexity, raising the state-of-the-art from 45% to 55% success rate. Beyond photorealistic simulation, we conduct real-robot experiments in three physical scenes and find these improvements to transfer well to real environments.","PeriodicalId":273870,"journal":{"name":"Conference on Robot Learning","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125928859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Deep Projective Rotation Estimation through Relative Supervision 基于相对监督的深度投影旋转估计

Conference on Robot Learning

Pub Date : 2022-11-21 DOI: 10.48550/arXiv.2211.11182

Brian Okorn, Chuer Pan, M. Hebert, David Held

Orientation estimation is the core to a variety of vision and robotics tasks such as camera and object pose estimation. Deep learning has offered a way to develop image-based orientation estimators; however, such estimators often require training on a large labeled dataset, which can be time-intensive to collect. In this work, we explore whether self-supervised learning from unlabeled data can be used to alleviate this issue. Specifically, we assume access to estimates of the relative orientation between neighboring poses, such that can be obtained via a local alignment method. While self-supervised learning has been used successfully for translational object keypoints, in this work, we show that naively applying relative supervision to the rotational group $SO(3)$ will often fail to converge due to the non-convexity of the rotational space. To tackle this challenge, we propose a new algorithm for self-supervised orientation estimation which utilizes Modified Rodrigues Parameters to stereographically project the closed manifold of $SO(3)$ to the open manifold of $mathbb{R}^{3}$, allowing the optimization to be done in an open Euclidean space. We empirically validate the benefits of the proposed algorithm for rotational averaging problem in two settings: (1) direct optimization on rotation parameters, and (2) optimization of parameters of a convolutional neural network that predicts object orientations from images. In both settings, we demonstrate that our proposed algorithm is able to converge to a consistent relative orientation frame much faster than algorithms that purely operate in the $SO(3)$ space. Additional information can be found at https://sites.google.com/view/deep-projective-rotation/home .

方向估计是各种视觉和机器人任务的核心，如相机和物体姿态估计。深度学习提供了一种开发基于图像的方向估计器的方法;然而，这样的估计器通常需要在大型标记数据集上进行训练，这可能需要大量的时间来收集。在这项工作中，我们探讨了是否可以使用未标记数据的自监督学习来缓解这个问题。具体来说，我们假设可以通过局部对齐方法获得相邻姿态之间的相对方向估计。虽然自监督学习已经成功地用于平移对象关键点，但在这项工作中，我们表明，由于旋转空间的非凸性，对旋转群$SO(3)$天真地应用相对监督通常会无法收敛。为了解决这一问题，我们提出了一种新的自监督方向估计算法，该算法利用改进的Rodrigues参数将封闭流形$SO(3)$立体投影到开放流形$mathbb{R}^{3}$上，使得优化可以在开放的欧几里得空间中进行。我们在两种情况下经验验证了所提出的算法在旋转平均问题上的优势:(1)直接优化旋转参数，(2)优化从图像中预测物体方向的卷积神经网络的参数。在这两种情况下，我们证明了我们提出的算法能够比纯粹在$SO(3)$空间中操作的算法更快地收敛到一致的相对方向帧。更多信息请访问https://sites.google.com/view/deep-projective-rotation/home。

{"title":"Deep Projective Rotation Estimation through Relative Supervision","authors":"Brian Okorn, Chuer Pan, M. Hebert, David Held","doi":"10.48550/arXiv.2211.11182","DOIUrl":"https://doi.org/10.48550/arXiv.2211.11182","url":null,"abstract":"Orientation estimation is the core to a variety of vision and robotics tasks such as camera and object pose estimation. Deep learning has offered a way to develop image-based orientation estimators; however, such estimators often require training on a large labeled dataset, which can be time-intensive to collect. In this work, we explore whether self-supervised learning from unlabeled data can be used to alleviate this issue. Specifically, we assume access to estimates of the relative orientation between neighboring poses, such that can be obtained via a local alignment method. While self-supervised learning has been used successfully for translational object keypoints, in this work, we show that naively applying relative supervision to the rotational group $SO(3)$ will often fail to converge due to the non-convexity of the rotational space. To tackle this challenge, we propose a new algorithm for self-supervised orientation estimation which utilizes Modified Rodrigues Parameters to stereographically project the closed manifold of $SO(3)$ to the open manifold of $mathbb{R}^{3}$, allowing the optimization to be done in an open Euclidean space. We empirically validate the benefits of the proposed algorithm for rotational averaging problem in two settings: (1) direct optimization on rotation parameters, and (2) optimization of parameters of a convolutional neural network that predicts object orientations from images. In both settings, we demonstrate that our proposed algorithm is able to converge to a consistent relative orientation frame much faster than algorithms that purely operate in the $SO(3)$ space. Additional information can be found at https://sites.google.com/view/deep-projective-rotation/home .","PeriodicalId":273870,"journal":{"name":"Conference on Robot Learning","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127297678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Safe Control Under Input Limits with Neural Control Barrier Functions 输入限制下的神经控制障碍函数安全控制

Conference on Robot Learning

Pub Date : 2022-11-20 DOI: 10.48550/arXiv.2211.11056

Simin Liu, Changliu Liu, J. Dolan

We propose new methods to synthesize control barrier function (CBF)-based safe controllers that avoid input saturation, which can cause safety violations. In particular, our method is created for high-dimensional, general nonlinear systems, for which such tools are scarce. We leverage techniques from machine learning, like neural networks and deep learning, to simplify this challenging problem in nonlinear control design. The method consists of a learner-critic architecture, in which the critic gives counterexamples of input saturation and the learner optimizes a neural CBF to eliminate those counterexamples. We provide empirical results on a 10D state, 4D input quadcopter-pendulum system. Our learned CBF avoids input saturation and maintains safety over nearly 100% of trials.

我们提出了一种新的方法来合成基于控制屏障函数(CBF)的安全控制器，以避免可能导致安全违规的输入饱和。特别是，我们的方法是为高维的，一般的非线性系统创建的，对于这样的工具是稀缺的。我们利用机器学习技术，如神经网络和深度学习，来简化非线性控制设计中的这个具有挑战性的问题。该方法由一个学习者-评论家架构组成，其中评论家给出输入饱和的反例，学习者优化神经CBF来消除这些反例。我们提供了一个10D状态、4D输入的四轴摆系统的经验结果。我们学习的CBF避免了输入饱和，并在近100%的试验中保持了安全性。

引用次数: 11

DexPoint: Generalizable Point Cloud Reinforcement Learning for Sim-to-Real Dexterous Manipulation DexPoint:模拟到真实灵巧操作的可推广点云强化学习

Conference on Robot Learning

Pub Date : 2022-11-17 DOI: 10.48550/arXiv.2211.09423

Yuzhe Qin, Binghao Huang, Zhao-Heng Yin, Hao Su, Xiaolong Wang

We propose a sim-to-real framework for dexterous manipulation which can generalize to new objects of the same category in the real world. The key of our framework is to train the manipulation policy with point cloud inputs and dexterous hands. We propose two new techniques to enable joint learning on multiple objects and sim-to-real generalization: (i) using imagined hand point clouds as augmented inputs; and (ii) designing novel contact-based rewards. We empirically evaluate our method using an Allegro Hand to grasp novel objects in both simulation and real world. To the best of our knowledge, this is the first policy learning-based framework that achieves such generalization results with dexterous hands. Our project page is available at https://yzqin.github.io/dexpoint

我们提出了一个模拟到真实的灵巧操作框架，它可以推广到现实世界中相同类别的新对象。该框架的关键是利用点云输入和灵巧的手来训练操作策略。我们提出了两种新技术来实现多对象的联合学习和模拟到真实的泛化:(i)使用想象的手点云作为增强输入;(2)设计新颖的基于接触的奖励。我们在模拟和现实世界中使用快板手对我们的方法进行了经验评估。据我们所知，这是第一个基于政策学习的框架，用灵巧的双手实现了这样的泛化结果。我们的项目页面可访问https://yzqin.github.io/dexpoint

引用次数: 24

TAX-Pose: Task-Specific Cross-Pose Estimation for Robot Manipulation 赋税姿态:机器人操作的任务特定交叉姿态估计

Conference on Robot Learning

Pub Date : 2022-11-17 DOI: 10.48550/arXiv.2211.09325

Chuer Pan, Brian Okorn, Harry Zhang, Ben Eisner, David Held

How do we imbue robots with the ability to efficiently manipulate unseen objects and transfer relevant skills based on demonstrations? End-to-end learning methods often fail to generalize to novel objects or unseen configurations. Instead, we focus on the task-specific pose relationship between relevant parts of interacting objects. We conjecture that this relationship is a generalizable notion of a manipulation task that can transfer to new objects in the same category; examples include the relationship between the pose of a pan relative to an oven or the pose of a mug relative to a mug rack. We call this task-specific pose relationship"cross-pose"and provide a mathematical definition of this concept. We propose a vision-based system that learns to estimate the cross-pose between two objects for a given manipulation task using learned cross-object correspondences. The estimated cross-pose is then used to guide a downstream motion planner to manipulate the objects into the desired pose relationship (placing a pan into the oven or the mug onto the mug rack). We demonstrate our method's capability to generalize to unseen objects, in some cases after training on only 10 demonstrations in the real world. Results show that our system achieves state-of-the-art performance in both simulated and real-world experiments across a number of tasks. Supplementary information and videos can be found at https://sites.google.com/view/tax-pose/home.

我们如何赋予机器人有效操纵看不见的物体的能力，并根据演示转移相关技能?端到端学习方法往往不能推广到新的对象或不可见的配置。相反，我们专注于交互对象的相关部分之间的特定任务姿态关系。我们推测，这种关系是一种可推广的操作任务概念，可以转移到同一类别中的新对象;示例包括锅相对于烤箱的姿势或马克杯相对于马克杯架的姿势之间的关系。我们将这种特定于任务的姿势关系称为“交叉姿势”，并提供了这一概念的数学定义。我们提出了一个基于视觉的系统，该系统使用学习到的交叉对象对应来学习估计给定操作任务中两个对象之间的交叉姿态。然后使用估计的交叉姿势来指导下游运动规划器将物体操纵成所需的姿势关系(将平底锅放入烤箱或将杯子放在杯架上)。我们演示了我们的方法泛化到看不见的对象的能力，在某些情况下，在现实世界中只进行了10次演示训练。结果表明，我们的系统在许多任务的模拟和现实世界实验中都达到了最先进的性能。补充资料和视频可在https://sites.google.com/view/tax-pose/home找到。

{"title":"TAX-Pose: Task-Specific Cross-Pose Estimation for Robot Manipulation","authors":"Chuer Pan, Brian Okorn, Harry Zhang, Ben Eisner, David Held","doi":"10.48550/arXiv.2211.09325","DOIUrl":"https://doi.org/10.48550/arXiv.2211.09325","url":null,"abstract":"How do we imbue robots with the ability to efficiently manipulate unseen objects and transfer relevant skills based on demonstrations? End-to-end learning methods often fail to generalize to novel objects or unseen configurations. Instead, we focus on the task-specific pose relationship between relevant parts of interacting objects. We conjecture that this relationship is a generalizable notion of a manipulation task that can transfer to new objects in the same category; examples include the relationship between the pose of a pan relative to an oven or the pose of a mug relative to a mug rack. We call this task-specific pose relationship\"cross-pose\"and provide a mathematical definition of this concept. We propose a vision-based system that learns to estimate the cross-pose between two objects for a given manipulation task using learned cross-object correspondences. The estimated cross-pose is then used to guide a downstream motion planner to manipulate the objects into the desired pose relationship (placing a pan into the oven or the mug onto the mug rack). We demonstrate our method's capability to generalize to unseen objects, in some cases after training on only 10 demonstrations in the real world. Results show that our system achieves state-of-the-art performance in both simulated and real-world experiments across a number of tasks. Supplementary information and videos can be found at https://sites.google.com/view/tax-pose/home.","PeriodicalId":273870,"journal":{"name":"Conference on Robot Learning","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126420955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

SE(3)-Equivariant Relational Rearrangement with Neural Descriptor Fields SE(3)-神经描述子域的等变关系重排

Conference on Robot Learning

Pub Date : 2022-11-17 DOI: 10.48550/arXiv.2211.09786

A. Simeonov, Yilun Du, Lin Yen-Chen, Alberto Rodriguez, L. Kaelbling, Tomas Lozano-Perez, Pulkit Agrawal

We present a method for performing tasks involving spatial relations between novel object instances initialized in arbitrary poses directly from point cloud observations. Our framework provides a scalable way for specifying new tasks using only 5-10 demonstrations. Object rearrangement is formalized as the question of finding actions that configure task-relevant parts of the object in a desired alignment. This formalism is implemented in three steps: assigning a consistent local coordinate frame to the task-relevant object parts, determining the location and orientation of this coordinate frame on unseen object instances, and executing an action that brings these frames into the desired alignment. We overcome the key technical challenge of determining task-relevant local coordinate frames from a few demonstrations by developing an optimization method based on Neural Descriptor Fields (NDFs) and a single annotated 3D keypoint. An energy-based learning scheme to model the joint configuration of the objects that satisfies a desired relational task further improves performance. The method is tested on three multi-object rearrangement tasks in simulation and on a real robot. Project website, videos, and code: https://anthonysimeonov.github.io/r-ndf/

我们提出了一种方法来执行任务，涉及在任意姿态下初始化的新对象实例之间的空间关系，直接从点云观测。我们的框架提供了一种可扩展的方式，只需使用5-10个演示即可指定新任务。对象重排被形式化为寻找将对象的任务相关部分配置为所需对齐的操作的问题。这种形式通过三个步骤实现:为与任务相关的对象部分分配一致的局部坐标框架，确定该坐标框架在不可见对象实例上的位置和方向，并执行将这些框架引入所需对齐的操作。我们通过开发一种基于神经描述域(ndf)和单个注释3D关键点的优化方法，克服了从几个演示中确定任务相关局部坐标帧的关键技术挑战。基于能量的学习方案对满足所需关系任务的对象的联合配置进行建模，进一步提高了性能。在仿真中对三种多目标重排任务和实际机器人进行了验证。项目网站、视频和代码:https://anthonysimeonov.github.io/r-ndf/

{"title":"SE(3)-Equivariant Relational Rearrangement with Neural Descriptor Fields","authors":"A. Simeonov, Yilun Du, Lin Yen-Chen, Alberto Rodriguez, L. Kaelbling, Tomas Lozano-Perez, Pulkit Agrawal","doi":"10.48550/arXiv.2211.09786","DOIUrl":"https://doi.org/10.48550/arXiv.2211.09786","url":null,"abstract":"We present a method for performing tasks involving spatial relations between novel object instances initialized in arbitrary poses directly from point cloud observations. Our framework provides a scalable way for specifying new tasks using only 5-10 demonstrations. Object rearrangement is formalized as the question of finding actions that configure task-relevant parts of the object in a desired alignment. This formalism is implemented in three steps: assigning a consistent local coordinate frame to the task-relevant object parts, determining the location and orientation of this coordinate frame on unseen object instances, and executing an action that brings these frames into the desired alignment. We overcome the key technical challenge of determining task-relevant local coordinate frames from a few demonstrations by developing an optimization method based on Neural Descriptor Fields (NDFs) and a single annotated 3D keypoint. An energy-based learning scheme to model the joint configuration of the objects that satisfies a desired relational task further improves performance. The method is tested on three multi-object rearrangement tasks in simulation and on a real robot. Project website, videos, and code: https://anthonysimeonov.github.io/r-ndf/","PeriodicalId":273870,"journal":{"name":"Conference on Robot Learning","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121298659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23

Interpretable Self-Aware Neural Networks for Robust Trajectory Prediction 鲁棒轨迹预测的可解释自我意识神经网络

Conference on Robot Learning

Pub Date : 2022-11-16 DOI: 10.48550/arXiv.2211.08701

Masha Itkina, Mykel J. Kochenderfer

Although neural networks have seen tremendous success as predictive models in a variety of domains, they can be overly confident in their predictions on out-of-distribution (OOD) data. To be viable for safety-critical applications, like autonomous vehicles, neural networks must accurately estimate their epistemic or model uncertainty, achieving a level of system self-awareness. Techniques for epistemic uncertainty quantification often require OOD data during training or multiple neural network forward passes during inference. These approaches may not be suitable for real-time performance on high-dimensional inputs. Furthermore, existing methods lack interpretability of the estimated uncertainty, which limits their usefulness both to engineers for further system development and to downstream modules in the autonomy stack. We propose the use of evidential deep learning to estimate the epistemic uncertainty over a low-dimensional, interpretable latent space in a trajectory prediction setting. We introduce an interpretable paradigm for trajectory prediction that distributes the uncertainty among the semantic concepts: past agent behavior, road structure, and social context. We validate our approach on real-world autonomous driving data, demonstrating superior performance over state-of-the-art baselines. Our code is available at: https://github.com/sisl/InterpretableSelfAwarePrediction.

尽管神经网络作为预测模型在许多领域取得了巨大的成功，但它们在预测分布外(OOD)数据时可能过于自信。为了使自动驾驶汽车等安全关键应用可行，神经网络必须准确地估计其认知或模型的不确定性，从而达到一定程度的系统自我意识。认知不确定性量化技术通常在训练期间需要OOD数据或在推理期间需要多个神经网络前向传递。这些方法可能不适合高维输入的实时性能。此外，现有方法缺乏对估计不确定性的可解释性，这限制了它们对工程师进一步系统开发和自治堆栈中的下游模块的有用性。我们建议使用证据深度学习来估计轨迹预测设置中低维，可解释潜在空间的认知不确定性。我们引入了一个可解释的轨迹预测范式，将不确定性分布在语义概念中:过去的代理行为、道路结构和社会背景。我们在真实的自动驾驶数据上验证了我们的方法，证明了比最先进的基线更优越的性能。我们的代码可在:https://github.com/sisl/InterpretableSelfAwarePrediction。

{"title":"Interpretable Self-Aware Neural Networks for Robust Trajectory Prediction","authors":"Masha Itkina, Mykel J. Kochenderfer","doi":"10.48550/arXiv.2211.08701","DOIUrl":"https://doi.org/10.48550/arXiv.2211.08701","url":null,"abstract":"Although neural networks have seen tremendous success as predictive models in a variety of domains, they can be overly confident in their predictions on out-of-distribution (OOD) data. To be viable for safety-critical applications, like autonomous vehicles, neural networks must accurately estimate their epistemic or model uncertainty, achieving a level of system self-awareness. Techniques for epistemic uncertainty quantification often require OOD data during training or multiple neural network forward passes during inference. These approaches may not be suitable for real-time performance on high-dimensional inputs. Furthermore, existing methods lack interpretability of the estimated uncertainty, which limits their usefulness both to engineers for further system development and to downstream modules in the autonomy stack. We propose the use of evidential deep learning to estimate the epistemic uncertainty over a low-dimensional, interpretable latent space in a trajectory prediction setting. We introduce an interpretable paradigm for trajectory prediction that distributes the uncertainty among the semantic concepts: past agent behavior, road structure, and social context. We validate our approach on real-world autonomous driving data, demonstrating superior performance over state-of-the-art baselines. Our code is available at: https://github.com/sisl/InterpretableSelfAwarePrediction.","PeriodicalId":273870,"journal":{"name":"Conference on Robot Learning","volume":"61 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130949408","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

ToolFlowNet: Robotic Manipulation with Tools via Predicting Tool Flow from Point Clouds ToolFlowNet:通过从点云预测工具流的工具机器人操作

Conference on Robot Learning

Pub Date : 2022-11-16 DOI: 10.48550/arXiv.2211.09006

Daniel Seita, Yufei Wang, Sarthak J. Shetty, Edward Li, Zackory M. Erickson, David Held

Point clouds are a widely available and canonical data modality which convey the 3D geometry of a scene. Despite significant progress in classification and segmentation from point clouds, policy learning from such a modality remains challenging, and most prior works in imitation learning focus on learning policies from images or state information. In this paper, we propose a novel framework for learning policies from point clouds for robotic manipulation with tools. We use a novel neural network, ToolFlowNet, which predicts dense per-point flow on the tool that the robot controls, and then uses the flow to derive the transformation that the robot should execute. We apply this framework to imitation learning of challenging deformable object manipulation tasks with continuous movement of tools, including scooping and pouring, and demonstrate significantly improved performance over baselines which do not use flow. We perform 50 physical scooping experiments with ToolFlowNet and attain 82% scooping success. See https://tinyurl.com/toolflownet for supplementary material.

点云是一种广泛使用和规范的数据模式，它传达了场景的三维几何形状。尽管在点云的分类和分割方面取得了重大进展，但从这种模式中学习策略仍然具有挑战性，大多数模仿学习的先前工作都集中在从图像或状态信息中学习策略。在本文中，我们提出了一个从点云学习策略的新框架，用于机器人工具操作。我们使用了一种新颖的神经网络，ToolFlowNet，它预测机器人控制的工具上密集的逐点流，然后使用这些流来推导机器人应该执行的转换。我们将该框架应用于具有挑战性的可变形对象操作任务的模仿学习，包括工具的连续运动，包括舀和倒，并在不使用流的基线上证明了显着提高的性能。我们使用ToolFlowNet进行了50次物理挖取实验，成功率达到82%。参见https://tinyurl.com/toolflownet获取补充资料。

引用次数: 15

Towards Long-Tailed 3D Detection 走向长尾三维探测

Conference on Robot Learning

Pub Date : 2022-11-16 DOI: 10.48550/arXiv.2211.08691

Neehar Peri, Achal Dave, Deva Ramanan, Shu Kong

Contemporary autonomous vehicle (AV) benchmarks have advanced techniques for training 3D detectors, particularly on large-scale lidar data. Surprisingly, although semantic class labels naturally follow a long-tailed distribution, contemporary benchmarks focus on only a few common classes (e.g., pedestrian and car) and neglect many rare classes in-the-tail (e.g., debris and stroller). However, AVs must still detect rare classes to ensure safe operation. Moreover, semantic classes are often organized within a hierarchy, e.g., tail classes such as child and construction-worker are arguably subclasses of pedestrian. However, such hierarchical relationships are often ignored, which may lead to misleading estimates of performance and missed opportunities for algorithmic innovation. We address these challenges by formally studying the problem of Long-Tailed 3D Detection (LT3D), which evaluates on all classes, including those in-the-tail. We evaluate and innovate upon popular 3D detection codebases, such as CenterPoint and PointPillars, adapting them for LT3D. We develop hierarchical losses that promote feature sharing across common-vs-rare classes, as well as improved detection metrics that award partial credit to"reasonable"mistakes respecting the hierarchy (e.g., mistaking a child for an adult). Finally, we point out that fine-grained tail class accuracy is particularly improved via multimodal fusion of RGB images with LiDAR; simply put, small fine-grained classes are challenging to identify from sparse (lidar) geometry alone, suggesting that multimodal cues are crucial to long-tailed 3D detection. Our modifications improve accuracy by 5% AP on average for all classes, and dramatically improve AP for rare classes (e.g., stroller AP improves from 3.6 to 31.6)! Our code is available at https://github.com/neeharperi/LT3D

现代自动驾驶汽车(AV)基准具有先进的3D探测器训练技术，特别是在大规模激光雷达数据上。令人惊讶的是，尽管语义类标签自然地遵循长尾分布，但当代基准测试只关注几个常见的类(例如，行人和汽车)，而忽略了许多罕见的尾部类(例如，碎片和婴儿车)。然而，自动驾驶汽车仍然必须检测稀有类别，以确保安全运行。此外，语义类通常在层次结构中组织，例如，尾类如child和construction-worker可以说是pedestrian的子类。然而，这种层次关系往往被忽视，这可能导致对性能的误导性估计，并错失算法创新的机会。我们通过正式研究长尾3D检测(LT3D)问题来解决这些挑战，LT3D对所有类别进行评估，包括那些在尾部的类别。我们对流行的3D检测代码库(如CenterPoint和PointPillars)进行评估和创新，使其适应LT3D。我们开发了层次损失，促进了常见类与罕见类之间的特征共享，并改进了检测指标，对尊重层次的“合理”错误(例如，将儿童误认为成人)给予部分信任。最后，我们指出，通过RGB图像与LiDAR的多模态融合，可以特别提高细粒度尾类的精度;简而言之，仅从稀疏(激光雷达)几何形状中识别小的细粒度类具有挑战性，这表明多模态线索对长尾3D检测至关重要。我们的修改使所有职业的准确率平均提高了5%，并且显著提高了稀有职业的准确率(例如，婴儿车的准确率从3.6提高到31.6)!我们的代码可在https://github.com/neeharperi/LT3D上获得

{"title":"Towards Long-Tailed 3D Detection","authors":"Neehar Peri, Achal Dave, Deva Ramanan, Shu Kong","doi":"10.48550/arXiv.2211.08691","DOIUrl":"https://doi.org/10.48550/arXiv.2211.08691","url":null,"abstract":"Contemporary autonomous vehicle (AV) benchmarks have advanced techniques for training 3D detectors, particularly on large-scale lidar data. Surprisingly, although semantic class labels naturally follow a long-tailed distribution, contemporary benchmarks focus on only a few common classes (e.g., pedestrian and car) and neglect many rare classes in-the-tail (e.g., debris and stroller). However, AVs must still detect rare classes to ensure safe operation. Moreover, semantic classes are often organized within a hierarchy, e.g., tail classes such as child and construction-worker are arguably subclasses of pedestrian. However, such hierarchical relationships are often ignored, which may lead to misleading estimates of performance and missed opportunities for algorithmic innovation. We address these challenges by formally studying the problem of Long-Tailed 3D Detection (LT3D), which evaluates on all classes, including those in-the-tail. We evaluate and innovate upon popular 3D detection codebases, such as CenterPoint and PointPillars, adapting them for LT3D. We develop hierarchical losses that promote feature sharing across common-vs-rare classes, as well as improved detection metrics that award partial credit to\"reasonable\"mistakes respecting the hierarchy (e.g., mistaking a child for an adult). Finally, we point out that fine-grained tail class accuracy is particularly improved via multimodal fusion of RGB images with LiDAR; simply put, small fine-grained classes are challenging to identify from sparse (lidar) geometry alone, suggesting that multimodal cues are crucial to long-tailed 3D detection. Our modifications improve accuracy by 5% AP on average for all classes, and dramatically improve AP for rare classes (e.g., stroller AP improves from 3.6 to 31.6)! Our code is available at https://github.com/neeharperi/LT3D","PeriodicalId":273870,"journal":{"name":"Conference on Robot Learning","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134280266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Legged Locomotion in Challenging Terrains using Egocentric Vision 利用自我中心视觉在具有挑战性的地形中进行腿部运动

Conference on Robot Learning

Pub Date : 2022-11-14 DOI: 10.48550/arXiv.2211.07638

Ananye Agarwal, Ashish Kumar, Jitendra Malik, Deepak Pathak

Animals are capable of precise and agile locomotion using vision. Replicating this ability has been a long-standing goal in robotics. The traditional approach has been to decompose this problem into elevation mapping and foothold planning phases. The elevation mapping, however, is susceptible to failure and large noise artifacts, requires specialized hardware, and is biologically implausible. In this paper, we present the first end-to-end locomotion system capable of traversing stairs, curbs, stepping stones, and gaps. We show this result on a medium-sized quadruped robot using a single front-facing depth camera. The small size of the robot necessitates discovering specialized gait patterns not seen elsewhere. The egocentric camera requires the policy to remember past information to estimate the terrain under its hind feet. We train our policy in simulation. Training has two phases - first, we train a policy using reinforcement learning with a cheap-to-compute variant of depth image and then in phase 2 distill it into the final policy that uses depth using supervised learning. The resulting policy transfers to the real world and is able to run in real-time on the limited compute of the robot. It can traverse a large variety of terrain while being robust to perturbations like pushes, slippery surfaces, and rocky terrain. Videos are at https://vision-locomotion.github.io

动物能够利用视觉进行精确而敏捷的运动。复制这种能力一直是机器人技术的长期目标。传统的方法是将这个问题分解为高程制图和立足点规划两个阶段。然而，高程映射容易受到故障和大噪声伪影的影响，需要专门的硬件，并且在生物学上是不可信的。在本文中，我们提出了第一个端到端移动系统，能够穿越楼梯，路边，踏脚石和间隙。我们在一个中型四足机器人上展示了这个结果，它使用了一个单一的前置深度摄像头。由于这个机器人的体积很小，因此需要发现其他地方没有的特殊步态模式。以自我为中心的相机要求策略记住过去的信息，以估计其后脚下的地形。我们在模拟中训练我们的策略。训练有两个阶段——首先，我们使用一个易于计算的深度图像变体的强化学习来训练一个策略，然后在第二阶段将其提炼成使用监督学习的深度最终策略。由此产生的策略转移到现实世界，并能够在机器人有限的计算上实时运行。它可以穿越各种各样的地形，同时对推力、光滑的表面和岩石地形等扰动也很强健。视频请访问https://vision-locomotion.github.io

{"title":"Legged Locomotion in Challenging Terrains using Egocentric Vision","authors":"Ananye Agarwal, Ashish Kumar, Jitendra Malik, Deepak Pathak","doi":"10.48550/arXiv.2211.07638","DOIUrl":"https://doi.org/10.48550/arXiv.2211.07638","url":null,"abstract":"Animals are capable of precise and agile locomotion using vision. Replicating this ability has been a long-standing goal in robotics. The traditional approach has been to decompose this problem into elevation mapping and foothold planning phases. The elevation mapping, however, is susceptible to failure and large noise artifacts, requires specialized hardware, and is biologically implausible. In this paper, we present the first end-to-end locomotion system capable of traversing stairs, curbs, stepping stones, and gaps. We show this result on a medium-sized quadruped robot using a single front-facing depth camera. The small size of the robot necessitates discovering specialized gait patterns not seen elsewhere. The egocentric camera requires the policy to remember past information to estimate the terrain under its hind feet. We train our policy in simulation. Training has two phases - first, we train a policy using reinforcement learning with a cheap-to-compute variant of depth image and then in phase 2 distill it into the final policy that uses depth using supervised learning. The resulting policy transfers to the real world and is able to run in real-time on the limited compute of the robot. It can traverse a large variety of terrain while being robust to perturbations like pushes, slippery surfaces, and rocky terrain. Videos are at https://vision-locomotion.github.io","PeriodicalId":273870,"journal":{"name":"Conference on Robot Learning","volume":"312 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115365378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 65