arXiv - CS - Robotics最新文献

Physically-Based Photometric Bundle Adjustment in Non-Lambertian Environments 非朗伯环境中基于物理的光度测量光束调整

arXiv - CS - Robotics

Pub Date : 2024-09-18 DOI: arxiv-2409.11854

Lei Cheng, Junpeng Hu, Haodong Yan, Mariia Gladkova, Tianyu Huang, Yun-Hui Liu, Daniel Cremers, Haoang Li

Photometric bundle adjustment (PBA) is widely used in estimating the camerapose and 3D geometry by assuming a Lambertian world. However, the assumption ofphotometric consistency is often violated since the non-diffuse reflection iscommon in real-world environments. The photometric inconsistency significantlyaffects the reliability of existing PBA methods. To solve this problem, wepropose a novel physically-based PBA method. Specifically, we introduce thephysically-based weights regarding material, illumination, and light path.These weights distinguish the pixel pairs with different levels of photometricinconsistency. We also design corresponding models for material estimationbased on sequential images and illumination estimation based on point clouds.In addition, we establish the first SLAM-related dataset of non-Lambertianscenes with complete ground truth of illumination and material. Extensiveexperiments demonstrated that our PBA method outperforms existing approaches inaccuracy.

光度束调整（PBA）被广泛应用于通过假定朗伯世界来估计摄影机姿态和三维几何图形。然而，由于非漫反射在现实环境中很常见，因此光度一致性假设经常被违反。光度不一致严重影响了现有 PBA 方法的可靠性。为了解决这个问题，我们提出了一种基于物理的新型 PBA 方法。具体来说，我们引入了基于物理的材料、光照和光路权重，这些权重可以区分光度不一致程度不同的像素对。我们还为基于连续图像的材质估计和基于点云的光照估计设计了相应的模型。此外，我们还建立了第一个与 SLAM 相关的非朗伯场景数据集，该数据集具有完整的光照和材质地面实况。广泛的实验证明，我们的 PBA 方法在精度上优于现有方法。

引用次数: 0

Bundle Adjustment in the Eager Mode 急切模式下的捆绑调整

arXiv - CS - Robotics

Pub Date : 2024-09-18 DOI: arxiv-2409.12190

Zitong Zhan, Huan Xu, Zihang Fang, Xinpeng Wei, Yaoyu Hu, Chen Wang

Bundle adjustment (BA) is a critical technique in various roboticapplications, such as simultaneous localization and mapping (SLAM), augmentedreality (AR), and photogrammetry. BA optimizes parameters such as camera posesand 3D landmarks to align them with observations. With the growing importanceof deep learning in perception systems, there is an increasing need tointegrate BA with deep learning frameworks for enhanced reliability andperformance. However, widely-used C++-based BA frameworks, such as GTSAM,g$^2$o, and Ceres, lack native integration with modern deep learning librarieslike PyTorch. This limitation affects their flexibility, adaptability, ease ofdebugging, and overall implementation efficiency. To address this gap, weintroduce an eager-mode BA framework seamlessly integrated with PyPose,providing PyTorch-compatible interfaces with high efficiency. Our approachincludes GPU-accelerated, differentiable, and sparse operations designed for2nd-order optimization, Lie group and Lie algebra operations, and linearsolvers. Our eager-mode BA on GPU demonstrates substantial runtime efficiency,achieving an average speedup of 18.5$times$, 22$times$, and 23$times$compared to GTSAM, g$^2$o, and Ceres, respectively.

捆绑调整（BA）是多种机器人应用中的一项关键技术，如同步定位与测绘（SLAM）、增强现实（AR）和摄影测量。BA可优化相机姿势和三维地标等参数，使其与观测结果保持一致。随着深度学习在感知系统中的重要性与日俱增，人们越来越需要将 BA 与深度学习框架集成起来，以提高可靠性和性能。然而，GTSAM、g$^2$o 和 Ceres 等广泛使用的基于 C++ 的 BA 框架缺乏与 PyTorch 等现代深度学习库的原生集成。这种限制影响了它们的灵活性、适应性、调试的简便性和整体实现效率。为了弥补这一缺陷，我们引入了与 PyPose 无缝集成的急迫模式 BA 框架，提供了与 PyTorch 兼容的高效接口。我们的方法包括为二阶优化设计的 GPU 加速、可微分和稀疏运算、李群和李代数运算以及线性求解器。与 GTSAM、g$^2$o 和 Ceres 相比，我们在 GPU 上的急迫模式 BA 的运行效率大幅提高，分别平均提速 18.5 倍、22 倍和 23 倍。

{"title":"Bundle Adjustment in the Eager Mode","authors":"Zitong Zhan, Huan Xu, Zihang Fang, Xinpeng Wei, Yaoyu Hu, Chen Wang","doi":"arxiv-2409.12190","DOIUrl":"https://doi.org/arxiv-2409.12190","url":null,"abstract":"Bundle adjustment (BA) is a critical technique in various robotic\u0000applications, such as simultaneous localization and mapping (SLAM), augmented\u0000reality (AR), and photogrammetry. BA optimizes parameters such as camera poses\u0000and 3D landmarks to align them with observations. With the growing importance\u0000of deep learning in perception systems, there is an increasing need to\u0000integrate BA with deep learning frameworks for enhanced reliability and\u0000performance. However, widely-used C++-based BA frameworks, such as GTSAM,\u0000g$^2$o, and Ceres, lack native integration with modern deep learning libraries\u0000like PyTorch. This limitation affects their flexibility, adaptability, ease of\u0000debugging, and overall implementation efficiency. To address this gap, we\u0000introduce an eager-mode BA framework seamlessly integrated with PyPose,\u0000providing PyTorch-compatible interfaces with high efficiency. Our approach\u0000includes GPU-accelerated, differentiable, and sparse operations designed for\u00002nd-order optimization, Lie group and Lie algebra operations, and linear\u0000solvers. Our eager-mode BA on GPU demonstrates substantial runtime efficiency,\u0000achieving an average speedup of 18.5$times$, 22$times$, and 23$times$\u0000compared to GTSAM, g$^2$o, and Ceres, respectively.","PeriodicalId":501031,"journal":{"name":"arXiv - CS - Robotics","volume":"119 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142267024","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DynaMo: In-Domain Dynamics Pretraining for Visuo-Motor Control DynaMo：针对视觉运动控制的域内动力学预训练

arXiv - CS - Robotics

Pub Date : 2024-09-18 DOI: arxiv-2409.12192

Zichen Jeff Cui, Hengkai Pan, Aadhithya Iyer, Siddhant Haldar, Lerrel Pinto

Imitation learning has proven to be a powerful tool for training complexvisuomotor policies. However, current methods often require hundreds tothousands of expert demonstrations to handle high-dimensional visualobservations. A key reason for this poor data efficiency is that visualrepresentations are predominantly either pretrained on out-of-domain data ortrained directly through a behavior cloning objective. In this work, we presentDynaMo, a new in-domain, self-supervised method for learning visualrepresentations. Given a set of expert demonstrations, we jointly learn alatent inverse dynamics model and a forward dynamics model over a sequence ofimage embeddings, predicting the next frame in latent space, withoutaugmentations, contrastive sampling, or access to ground truth actions.Importantly, DynaMo does not require any out-of-domain data such as Internetdatasets or cross-embodied datasets. On a suite of six simulated and realenvironments, we show that representations learned with DynaMo significantlyimprove downstream imitation learning performance over prior self-supervisedlearning objectives, and pretrained representations. Gains from using DynaMohold across policy classes such as Behavior Transformer, Diffusion Policy, MLP,and nearest neighbors. Finally, we ablate over key components of DynaMo andmeasure its impact on downstream policy performance. Robot videos are bestviewed at https://dynamo-ssl.github.io

事实证明，模仿学习是训练复杂视觉运动策略的有力工具。然而，目前的方法往往需要数百到数千个专家示范才能处理高维视觉观察。造成这种数据效率低下的一个关键原因是，视觉呈现主要是在域外数据上进行预训练，或者直接通过行为克隆目标进行训练。在这项工作中，我们提出了一种用于学习视觉呈现的全新域内自我监督方法--DynaMo。给定一组专家示范，我们在图像嵌入序列上联合学习恒定的反向动力学模型和正向动力学模型，在潜空间中预测下一帧，而无需增强、对比采样或访问地面真实动作。重要的是，DynaMo 不需要任何域外数据，如互联网数据集或交叉嵌入数据集。在一套六种模拟和真实环境中，我们发现与之前的自我监督学习目标和预训练表征相比，使用 DynaMo 学习到的表征显著提高了下游模仿学习性能。使用 DynaMohold 所带来的收益跨越了行为转换器、扩散策略、MLP 和近邻等策略类别。最后，我们消减了 DynaMo 的关键组件，并测量了它对下游策略性能的影响。机器人视频最佳观看地址：https://dynamo-ssl.github.io

{"title":"DynaMo: In-Domain Dynamics Pretraining for Visuo-Motor Control","authors":"Zichen Jeff Cui, Hengkai Pan, Aadhithya Iyer, Siddhant Haldar, Lerrel Pinto","doi":"arxiv-2409.12192","DOIUrl":"https://doi.org/arxiv-2409.12192","url":null,"abstract":"Imitation learning has proven to be a powerful tool for training complex\u0000visuomotor policies. However, current methods often require hundreds to\u0000thousands of expert demonstrations to handle high-dimensional visual\u0000observations. A key reason for this poor data efficiency is that visual\u0000representations are predominantly either pretrained on out-of-domain data or\u0000trained directly through a behavior cloning objective. In this work, we present\u0000DynaMo, a new in-domain, self-supervised method for learning visual\u0000representations. Given a set of expert demonstrations, we jointly learn a\u0000latent inverse dynamics model and a forward dynamics model over a sequence of\u0000image embeddings, predicting the next frame in latent space, without\u0000augmentations, contrastive sampling, or access to ground truth actions.\u0000Importantly, DynaMo does not require any out-of-domain data such as Internet\u0000datasets or cross-embodied datasets. On a suite of six simulated and real\u0000environments, we show that representations learned with DynaMo significantly\u0000improve downstream imitation learning performance over prior self-supervised\u0000learning objectives, and pretrained representations. Gains from using DynaMo\u0000hold across policy classes such as Behavior Transformer, Diffusion Policy, MLP,\u0000and nearest neighbors. Finally, we ablate over key components of DynaMo and\u0000measure its impact on downstream policy performance. Robot videos are best\u0000viewed at https://dynamo-ssl.github.io","PeriodicalId":501031,"journal":{"name":"arXiv - CS - Robotics","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142267022","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Towards Global Localization using Multi-Modal Object-Instance Re-Identification 利用多模式物体-实例再识别技术实现全球本地化

arXiv - CS - Robotics

Pub Date : 2024-09-18 DOI: arxiv-2409.12002

Aneesh Chavan, Vaibhav Agrawal, Vineeth Bhat, Sarthak Chittawar, Siddharth Srivastava, Chetan Arora, K Madhava Krishna

Re-identification (ReID) is a critical challenge in computer vision,predominantly studied in the context of pedestrians and vehicles. However,robust object-instance ReID, which has significant implications for tasks suchas autonomous exploration, long-term perception, and scene understanding,remains underexplored. In this work, we address this gap by proposing a noveldual-path object-instance re-identification transformer architecture thatintegrates multimodal RGB and depth information. By leveraging depth data, wedemonstrate improvements in ReID across scenes that are cluttered or havevarying illumination conditions. Additionally, we develop a ReID-basedlocalization framework that enables accurate camera localization and poseidentification across different viewpoints. We validate our methods using twocustom-built RGB-D datasets, as well as multiple sequences from the open-sourceTUM RGB-D datasets. Our approach demonstrates significant improvements in bothobject instance ReID (mAP of 75.18) and localization accuracy (success rate of83% on TUM-RGBD), highlighting the essential role of object ReID in advancingrobotic perception. Our models, frameworks, and datasets have been madepublicly available.

重新识别（ReID）是计算机视觉领域的一项重要挑战，主要针对行人和车辆进行研究。然而，对于自主探索、长期感知和场景理解等任务具有重要意义的鲁棒对象-实例再识别（robust object-instance ReID）仍未得到充分探索。在这项工作中，我们提出了一种新颖的双路径物体-实体再识别转换器架构，整合了多模态 RGB 和深度信息，从而弥补了这一空白。通过利用深度数据，我们展示了在杂乱或光照条件变化的场景中 ReID 的改进。此外，我们还开发了一个基于 ReID 的定位框架，能够在不同视角下进行精确的相机定位和姿势识别。我们使用两个定制的 RGB-D 数据集以及来自开源 TUM RGB-D 数据集的多个序列验证了我们的方法。我们的方法在物体实例再识别（mAP 为 75.18）和定位精度（TUM-RGBD 上的成功率为 83%）方面都有显著提高，突出了物体再识别在促进机器人感知方面的重要作用。我们的模型、框架和数据集均已公开发布。

{"title":"Towards Global Localization using Multi-Modal Object-Instance Re-Identification","authors":"Aneesh Chavan, Vaibhav Agrawal, Vineeth Bhat, Sarthak Chittawar, Siddharth Srivastava, Chetan Arora, K Madhava Krishna","doi":"arxiv-2409.12002","DOIUrl":"https://doi.org/arxiv-2409.12002","url":null,"abstract":"Re-identification (ReID) is a critical challenge in computer vision,\u0000predominantly studied in the context of pedestrians and vehicles. However,\u0000robust object-instance ReID, which has significant implications for tasks such\u0000as autonomous exploration, long-term perception, and scene understanding,\u0000remains underexplored. In this work, we address this gap by proposing a novel\u0000dual-path object-instance re-identification transformer architecture that\u0000integrates multimodal RGB and depth information. By leveraging depth data, we\u0000demonstrate improvements in ReID across scenes that are cluttered or have\u0000varying illumination conditions. Additionally, we develop a ReID-based\u0000localization framework that enables accurate camera localization and pose\u0000identification across different viewpoints. We validate our methods using two\u0000custom-built RGB-D datasets, as well as multiple sequences from the open-source\u0000TUM RGB-D datasets. Our approach demonstrates significant improvements in both\u0000object instance ReID (mAP of 75.18) and localization accuracy (success rate of\u000083% on TUM-RGBD), highlighting the essential role of object ReID in advancing\u0000robotic perception. Our models, frameworks, and datasets have been made\u0000publicly available.","PeriodicalId":501031,"journal":{"name":"arXiv - CS - Robotics","volume":"49 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142267035","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

RaggeDi: Diffusion-based State Estimation of Disordered Rags, Sheets, Towels and Blankets RaggeDi：基于扩散的无序抹布、床单、毛巾和毯子的状态估计

arXiv - CS - Robotics

Pub Date : 2024-09-18 DOI: arxiv-2409.11831

Jikai Ye, Wanze Li, Shiraz Khan, Gregory S. Chirikjian

Cloth state estimation is an important problem in robotics. It is essentialfor the robot to know the accurate state to manipulate cloth and execute taskssuch as robotic dressing, stitching, and covering/uncovering human beings.However, estimating cloth state accurately remains challenging due to its highflexibility and self-occlusion. This paper proposes a diffusion model-basedpipeline that formulates the cloth state estimation as an image generationproblem by representing the cloth state as an RGB image that describes thepoint-wise translation (translation map) between a pre-defined flattened meshand the deformed mesh in a canonical space. Then we train a conditionaldiffusion-based image generation model to predict the translation map based onan observation. Experiments are conducted in both simulation and the real worldto validate the performance of our method. Results indicate that our methodoutperforms two recent methods in both accuracy and speed.

布料状态估计是机器人技术中的一个重要问题。机器人要想操控布料并执行任务，如机器人穿衣、缝合和为人类盖/脱衣服等，就必须知道布料的准确状态。然而，由于布料的高柔韧性和自闭性，准确估计布料状态仍具有挑战性。本文提出了一种基于扩散模型的管道，它将布料状态估算表述为图像生成问题，将布料状态表示为 RGB 图像，该图像描述了预定义的扁平化网格与典型空间中的变形网格之间的随点平移（平移图）。然后，我们训练一个基于条件扩散的图像生成模型，根据观测结果预测平移图。我们在模拟和现实世界中进行了实验，以验证我们方法的性能。结果表明，我们的方法在准确性和速度上都优于最近的两种方法。

引用次数: 0

Bi-objective trail-planning for a robot team orienteering in a hazardous environment 机器人团队在危险环境中定向越野的双目标路径规划

arXiv - CS - Robotics

Pub Date : 2024-09-18 DOI: arxiv-2409.12114

Cory M. Simon, Jeffrey Richley, Lucas Overbey, Darleen Perez-Lavin

Teams of mobile [aerial, ground, or aquatic] robots have applications inresource delivery, patrolling, information-gathering, agriculture, forest firefighting, chemical plume source localization and mapping, andsearch-and-rescue. Robot teams traversing hazardous environments -- with e.g.rough terrain or seas, strong winds, or adversaries capable of attacking orcapturing robots -- should plan and coordinate their trails in consideration ofrisks of disablement, destruction, or capture. Specifically, the robots shouldtake the safest trails, coordinate their trails to cooperatively achieve theteam-level objective with robustness to robot failures, and balance the rewardfrom visiting locations against risks of robot losses. Herein, we considerbi-objective trail-planning for a mobile team of robots orienteering in ahazardous environment. The hazardous environment is abstracted as a directedgraph whose arcs, when traversed by a robot, present known probabilities ofsurvival. Each node of the graph offers a reward to the team if visited by arobot (which e.g. delivers a good to or images the node). We wish to search forthe Pareto-optimal robot-team trail plans that maximize two [conflicting] teamobjectives: the expected (i) team reward and (ii) number of robots that survivethe mission. A human decision-maker can then select trail plans that balance,according to their values, reward and robot survival. We implement ant colonyoptimization, guided by heuristics, to search for the Pareto-optimal set ofrobot team trail plans. As a case study, we illustrate with aninformation-gathering mission in an art museum.

移动[空中、地面或水上]机器人小组可应用于资源运送、巡逻、信息收集、农业、森林灭火、化学羽流源定位和绘图以及搜索和救援。机器人团队在穿越危险环境（如崎岖的地形或海洋、强风或能够攻击或捕获机器人的对手）时，应在考虑到瘫痪、破坏或捕获风险的情况下规划和协调它们的路径。具体来说，机器人应选择最安全的路径，协调它们的路径，以合作实现团队级目标，同时保证机器人故障的稳健性，并平衡访问地点带来的奖励与机器人损失的风险。在这里，我们考虑的是在危险环境中进行定向越野的移动机器人团队的双目标路径规划。危险环境被抽象为一个有向图，当机器人穿越该图中的弧时，其生存概率是已知的。图中的每个节点如果被机器人访问（例如，向节点运送物品或图像），都会给团队带来奖励。我们希望搜索帕累托最优机器人-团队路径计划，使两个[相互冲突的]团队目标最大化：预期的 (i) 团队奖励和 (ii) 任务中存活的机器人数量。然后，人类决策者可以根据奖励和机器人存活率这两个目标的价值，选择平衡的路径计划。在启发式方法的指导下，我们采用了蚁群优化方法来搜索帕累托最优的机器人团队路径计划集。作为案例研究，我们以艺术博物馆中的信息收集任务为例进行说明。

{"title":"Bi-objective trail-planning for a robot team orienteering in a hazardous environment","authors":"Cory M. Simon, Jeffrey Richley, Lucas Overbey, Darleen Perez-Lavin","doi":"arxiv-2409.12114","DOIUrl":"https://doi.org/arxiv-2409.12114","url":null,"abstract":"Teams of mobile [aerial, ground, or aquatic] robots have applications in\u0000resource delivery, patrolling, information-gathering, agriculture, forest fire\u0000fighting, chemical plume source localization and mapping, and\u0000search-and-rescue. Robot teams traversing hazardous environments -- with e.g.\u0000rough terrain or seas, strong winds, or adversaries capable of attacking or\u0000capturing robots -- should plan and coordinate their trails in consideration of\u0000risks of disablement, destruction, or capture. Specifically, the robots should\u0000take the safest trails, coordinate their trails to cooperatively achieve the\u0000team-level objective with robustness to robot failures, and balance the reward\u0000from visiting locations against risks of robot losses. Herein, we consider\u0000bi-objective trail-planning for a mobile team of robots orienteering in a\u0000hazardous environment. The hazardous environment is abstracted as a directed\u0000graph whose arcs, when traversed by a robot, present known probabilities of\u0000survival. Each node of the graph offers a reward to the team if visited by a\u0000robot (which e.g. delivers a good to or images the node). We wish to search for\u0000the Pareto-optimal robot-team trail plans that maximize two [conflicting] team\u0000objectives: the expected (i) team reward and (ii) number of robots that survive\u0000the mission. A human decision-maker can then select trail plans that balance,\u0000according to their values, reward and robot survival. We implement ant colony\u0000optimization, guided by heuristics, to search for the Pareto-optimal set of\u0000robot team trail plans. As a case study, we illustrate with an\u0000information-gathering mission in an art museum.","PeriodicalId":501031,"journal":{"name":"arXiv - CS - Robotics","volume":"32 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142267028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

AlignBot: Aligning VLM-powered Customized Task Planning with User Reminders Through Fine-Tuning for Household Robots AlignBot：通过对家用机器人进行微调，使 VLM 驱动的定制任务规划与用户提醒相一致

arXiv - CS - Robotics

Pub Date : 2024-09-18 DOI: arxiv-2409.11905

Zhaxizhuoma, Pengan Chen, Ziniu Wu, Jiawei Sun, Dong Wang, Peng Zhou, Nieqing Cao, Yan Ding, Bin Zhao, Xuelong Li

This paper presents AlignBot, a novel framework designed to optimizeVLM-powered customized task planning for household robots by effectivelyaligning with user reminders. In domestic settings, aligning task planning withuser reminders poses significant challenges due to the limited quantity,diversity, and multimodal nature of the reminders. To address these challenges,AlignBot employs a fine-tuned LLaVA-7B model, functioning as an adapter forGPT-4o. This adapter model internalizes diverse forms of user reminders-such aspersonalized preferences, corrective guidance, and contextual assistance-intostructured instruction-formatted cues that prompt GPT-4o in generatingcustomized task plans. Additionally, AlignBot integrates a dynamic retrievalmechanism that selects task-relevant historical successes as prompts forGPT-4o, further enhancing task planning accuracy. To validate the effectivenessof AlignBot, experiments are conducted in real-world household environments,which are constructed within the laboratory to replicate typical householdsettings. A multimodal dataset with over 1,500 entries derived from volunteerreminders is used for training and evaluation. The results demonstrate thatAlignBot significantly improves customized task planning, outperformingexisting LLM- and VLM-powered planners by interpreting and aligning with userreminders, achieving 86.8% success rate compared to the vanilla GPT-4o baselineat 21.6%, reflecting a 65% improvement and over four times greatereffectiveness. Supplementary materials are available at:https://yding25.com/AlignBot/

本文介绍了 AlignBot，这是一个新颖的框架，旨在通过有效地与用户提醒保持一致，优化由 VLM 驱动的家用机器人定制任务规划。在家庭环境中，由于提醒的数量、多样性和多模态性有限，使任务规划与用户提醒保持一致面临着巨大挑战。为了应对这些挑战，AlignBot 采用了经过微调的 LLaVA-7B 模型，作为 GPT-4o 的适配器。该适配器模型将多种形式的用户提醒（如个性化偏好、纠正指导和上下文帮助）内化为结构化指令格式的提示，从而促使 GPT-4o 生成定制的任务计划。此外，AlignBot 还集成了动态检索机制，可选择与任务相关的历史成功案例作为 GPT-4o 的提示，从而进一步提高任务规划的准确性。为了验证 AlignBot 的有效性，我们在真实的家庭环境中进行了实验。训练和评估使用了一个多模式数据集，该数据集包含来自志愿者提醒的 1,500 多个条目。结果表明，AlignBot 显著改进了定制任务规划，通过解释和对齐用户提醒，它的表现优于现有的 LLM 和 VLM 驱动的规划器，成功率达到 86.8%，而 vanilla GPT-4o 的基线成功率仅为 21.6%，提高了 65%，效率提高了四倍多。补充材料见：https://yding25.com/AlignBot/

{"title":"AlignBot: Aligning VLM-powered Customized Task Planning with User Reminders Through Fine-Tuning for Household Robots","authors":"Zhaxizhuoma, Pengan Chen, Ziniu Wu, Jiawei Sun, Dong Wang, Peng Zhou, Nieqing Cao, Yan Ding, Bin Zhao, Xuelong Li","doi":"arxiv-2409.11905","DOIUrl":"https://doi.org/arxiv-2409.11905","url":null,"abstract":"This paper presents AlignBot, a novel framework designed to optimize\u0000VLM-powered customized task planning for household robots by effectively\u0000aligning with user reminders. In domestic settings, aligning task planning with\u0000user reminders poses significant challenges due to the limited quantity,\u0000diversity, and multimodal nature of the reminders. To address these challenges,\u0000AlignBot employs a fine-tuned LLaVA-7B model, functioning as an adapter for\u0000GPT-4o. This adapter model internalizes diverse forms of user reminders-such as\u0000personalized preferences, corrective guidance, and contextual assistance-into\u0000structured instruction-formatted cues that prompt GPT-4o in generating\u0000customized task plans. Additionally, AlignBot integrates a dynamic retrieval\u0000mechanism that selects task-relevant historical successes as prompts for\u0000GPT-4o, further enhancing task planning accuracy. To validate the effectiveness\u0000of AlignBot, experiments are conducted in real-world household environments,\u0000which are constructed within the laboratory to replicate typical household\u0000settings. A multimodal dataset with over 1,500 entries derived from volunteer\u0000reminders is used for training and evaluation. The results demonstrate that\u0000AlignBot significantly improves customized task planning, outperforming\u0000existing LLM- and VLM-powered planners by interpreting and aligning with user\u0000reminders, achieving 86.8% success rate compared to the vanilla GPT-4o baseline\u0000at 21.6%, reflecting a 65% improvement and over four times greater\u0000effectiveness. Supplementary materials are available at:\u0000https://yding25.com/AlignBot/","PeriodicalId":501031,"journal":{"name":"arXiv - CS - Robotics","volume":"32 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

RoboMorph: In-Context Meta-Learning for Robot Dynamics Modeling 机器人变形：机器人动力学建模的上下文元学习

arXiv - CS - Robotics

Pub Date : 2024-09-18 DOI: arxiv-2409.11815

Manuel Bianchi Bazzi, Asad Ali Shahid, Christopher Agia, John Alora, Marco Forgione, Dario Piga, Francesco Braghin, Marco Pavone, Loris Roveda

The landscape of Deep Learning has experienced a major shift with thepervasive adoption of Transformer-based architectures, particularly in NaturalLanguage Processing (NLP). Novel avenues for physical applications, such assolving Partial Differential Equations and Image Vision, have been explored.However, in challenging domains like robotics, where high non-linearity posessignificant challenges, Transformer-based applications are scarce. WhileTransformers have been used to provide robots with knowledge about high-leveltasks, few efforts have been made to perform system identification. This paperproposes a novel methodology to learn a meta-dynamical model of ahigh-dimensional physical system, such as the Franka robotic arm, using aTransformer-based architecture without prior knowledge of the system's physicalparameters. The objective is to predict quantities of interest (end-effectorpose and joint positions) given the torque signals for each joint. Thisprediction can be useful as a component for Deep Model Predictive Controlframeworks in robotics. The meta-model establishes the correlation betweentorques and positions and predicts the output for the complete trajectory. Thiswork provides empirical evidence of the efficacy of the in-context learningparadigm, suggesting future improvements in learning the dynamics of roboticsystems without explicit knowledge of physical parameters. Code, videos, andsupplementary materials can be found at project website. Seehttps://sites.google.com/view/robomorph/

随着基于变压器的架构被广泛采用，深度学习的格局发生了重大变化，尤其是在自然语言处理（NLP）领域。然而，在机器人等具有挑战性的领域，高非线性带来了重大挑战，基于变换器的应用却很少。虽然变换器已被用于为机器人提供有关高级任务的知识，但很少有人致力于进行系统识别。本文提出了一种新颖的方法，利用基于变压器的架构学习高维物理系统（如弗兰卡机械臂）的元动力学模型，而无需事先了解系统的物理参数。目标是根据每个关节的扭矩信号预测相关量（末端执行器姿势和关节位置）。这种预测可以作为机器人深度模型预测控制框架的一个组成部分。元模型建立了扭矩和位置之间的相关性，并预测了完整轨迹的输出。这项工作为情境学习范式的有效性提供了实证证据，为未来在没有明确物理参数知识的情况下学习机器人系统的动力学提出了改进建议。代码、视频和补充材料请访问项目网站。见https://sites.google.com/view/robomorph/

{"title":"RoboMorph: In-Context Meta-Learning for Robot Dynamics Modeling","authors":"Manuel Bianchi Bazzi, Asad Ali Shahid, Christopher Agia, John Alora, Marco Forgione, Dario Piga, Francesco Braghin, Marco Pavone, Loris Roveda","doi":"arxiv-2409.11815","DOIUrl":"https://doi.org/arxiv-2409.11815","url":null,"abstract":"The landscape of Deep Learning has experienced a major shift with the\u0000pervasive adoption of Transformer-based architectures, particularly in Natural\u0000Language Processing (NLP). Novel avenues for physical applications, such as\u0000solving Partial Differential Equations and Image Vision, have been explored.\u0000However, in challenging domains like robotics, where high non-linearity poses\u0000significant challenges, Transformer-based applications are scarce. While\u0000Transformers have been used to provide robots with knowledge about high-level\u0000tasks, few efforts have been made to perform system identification. This paper\u0000proposes a novel methodology to learn a meta-dynamical model of a\u0000high-dimensional physical system, such as the Franka robotic arm, using a\u0000Transformer-based architecture without prior knowledge of the system's physical\u0000parameters. The objective is to predict quantities of interest (end-effector\u0000pose and joint positions) given the torque signals for each joint. This\u0000prediction can be useful as a component for Deep Model Predictive Control\u0000frameworks in robotics. The meta-model establishes the correlation between\u0000torques and positions and predicts the output for the complete trajectory. This\u0000work provides empirical evidence of the efficacy of the in-context learning\u0000paradigm, suggesting future improvements in learning the dynamics of robotic\u0000systems without explicit knowledge of physical parameters. Code, videos, and\u0000supplementary materials can be found at project website. See\u0000https://sites.google.com/view/robomorph/","PeriodicalId":501031,"journal":{"name":"arXiv - CS - Robotics","volume":"49 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Human-Robot Cooperative Piano Playing with Learning-Based Real-Time Music Accompaniment 利用基于学习的实时音乐伴奏进行人机合作钢琴演奏

arXiv - CS - Robotics

Pub Date : 2024-09-18 DOI: arxiv-2409.11952

Huijiang Wang, Xiaoping Zhang, Fumiya Iida

Recent advances in machine learning have paved the way for the development ofmusical and entertainment robots. However, human-robot cooperative instrumentplaying remains a challenge, particularly due to the intricate motorcoordination and temporal synchronization. In this paper, we propose atheoretical framework for human-robot cooperative piano playing based onnon-verbal cues. First, we present a music improvisation model that employs arecurrent neural network (RNN) to predict appropriate chord progressions basedon the human's melodic input. Second, we propose a behavior-adaptive controllerto facilitate seamless temporal synchronization, allowing the cobot to generateharmonious acoustics. The collaboration takes into account the bidirectionalinformation flow between the human and robot. We have developed anentropy-based system to assess the quality of cooperation by analyzing theimpact of different communication modalities during human-robot collaboration.Experiments demonstrate that our RNN-based improvisation can achieve a 93%accuracy rate. Meanwhile, with the MPC adaptive controller, the robot couldrespond to the human teammate in homophony performances with real-timeaccompaniment. Our designed framework has been validated to be effective inallowing humans and robots to work collaboratively in the artisticpiano-playing task.

机器学习领域的最新进展为音乐和娱乐机器人的开发铺平了道路。然而，人机合作弹奏乐器仍然是一项挑战，特别是由于复杂的运动协调和时间同步。在本文中，我们提出了基于非语言线索的人机合作钢琴演奏理论框架。首先，我们提出了一个音乐即兴演奏模型，该模型利用电流神经网络（RNN）根据人类的旋律输入预测适当的和弦行进。其次，我们提出了一种行为自适应控制器，以促进无缝的时间同步，使 cobot 能够产生和谐的音响效果。这种协作考虑到了人类与机器人之间的双向信息流。我们开发了一个基于熵的系统，通过分析人机协作过程中不同通信方式的影响来评估合作质量。同时，通过 MPC 自适应控制器，机器人可以在同音表演中与人类队友进行实时伴奏。我们所设计的框架已被验证能够有效地让人类和机器人在艺术钢琴演奏任务中协同工作。

{"title":"Human-Robot Cooperative Piano Playing with Learning-Based Real-Time Music Accompaniment","authors":"Huijiang Wang, Xiaoping Zhang, Fumiya Iida","doi":"arxiv-2409.11952","DOIUrl":"https://doi.org/arxiv-2409.11952","url":null,"abstract":"Recent advances in machine learning have paved the way for the development of\u0000musical and entertainment robots. However, human-robot cooperative instrument\u0000playing remains a challenge, particularly due to the intricate motor\u0000coordination and temporal synchronization. In this paper, we propose a\u0000theoretical framework for human-robot cooperative piano playing based on\u0000non-verbal cues. First, we present a music improvisation model that employs a\u0000recurrent neural network (RNN) to predict appropriate chord progressions based\u0000on the human's melodic input. Second, we propose a behavior-adaptive controller\u0000to facilitate seamless temporal synchronization, allowing the cobot to generate\u0000harmonious acoustics. The collaboration takes into account the bidirectional\u0000information flow between the human and robot. We have developed an\u0000entropy-based system to assess the quality of cooperation by analyzing the\u0000impact of different communication modalities during human-robot collaboration.\u0000Experiments demonstrate that our RNN-based improvisation can achieve a 93%\u0000accuracy rate. Meanwhile, with the MPC adaptive controller, the robot could\u0000respond to the human teammate in homophony performances with real-time\u0000accompaniment. Our designed framework has been validated to be effective in\u0000allowing humans and robots to work collaboratively in the artistic\u0000piano-playing task.","PeriodicalId":501031,"journal":{"name":"arXiv - CS - Robotics","volume":"98 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266822","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Reactive Collision Avoidance for Safe Agile Navigation 用于安全敏捷导航的反应式防撞系统

arXiv - CS - Robotics

Pub Date : 2024-09-18 DOI: arxiv-2409.11962

Alessandro Saviolo, Niko Picello, Rishabh Verma, Giuseppe Loianno

Reactive collision avoidance is essential for agile robots navigating complexand dynamic environments, enabling real-time obstacle response. However, thistask is inherently challenging because it requires a tight integration ofperception, planning, and control, which traditional methods often handleseparately, resulting in compounded errors and delays. This paper introduces anovel approach that unifies these tasks into a single reactive framework usingsolely onboard sensing and computing. Our method combines nonlinear modelpredictive control with adaptive control barrier functions, directly linkingperception-driven constraints to real-time planning and control. Constraintsare determined by using a neural network to refine noisy RGB-D data, enhancingdepth accuracy, and selecting points with the minimum time-to-collision toprioritize the most immediate threats. To maintain a balance between safety andagility, a heuristic dynamically adjusts the optimization process, preventingoverconstraints in real time. Extensive experiments with an agile quadrotordemonstrate effective collision avoidance across diverse indoor and outdoorenvironments, without requiring environment-specific tuning or explicitmapping.

对于在复杂动态环境中航行的敏捷机器人来说，反应式防撞是实现实时障碍物响应的关键。然而，这项任务本身就极具挑战性，因为它需要将感知、规划和控制紧密结合在一起，而传统方法往往将这些任务分开处理，从而导致错误和延迟的加剧。本文介绍了一种新方法，它将这些任务统一到一个反应式框架中，只使用机载传感和计算。我们的方法将非线性模型预测控制与自适应控制障碍函数相结合，直接将感知驱动的约束条件与实时规划和控制联系起来。通过使用神经网络完善嘈杂的 RGB-D 数据来确定约束条件，提高深度精度，并选择碰撞时间最短的点，优先处理最紧迫的威胁。为了在安全性和敏捷性之间保持平衡，一种启发式方法会动态调整优化过程，实时防止过度约束。使用敏捷四旋翼飞行器进行的大量实验证明，在各种室内和室外环境中都能有效避免碰撞，而不需要针对特定环境进行调整或显式映射。

{"title":"Reactive Collision Avoidance for Safe Agile Navigation","authors":"Alessandro Saviolo, Niko Picello, Rishabh Verma, Giuseppe Loianno","doi":"arxiv-2409.11962","DOIUrl":"https://doi.org/arxiv-2409.11962","url":null,"abstract":"Reactive collision avoidance is essential for agile robots navigating complex\u0000and dynamic environments, enabling real-time obstacle response. However, this\u0000task is inherently challenging because it requires a tight integration of\u0000perception, planning, and control, which traditional methods often handle\u0000separately, resulting in compounded errors and delays. This paper introduces a\u0000novel approach that unifies these tasks into a single reactive framework using\u0000solely onboard sensing and computing. Our method combines nonlinear model\u0000predictive control with adaptive control barrier functions, directly linking\u0000perception-driven constraints to real-time planning and control. Constraints\u0000are determined by using a neural network to refine noisy RGB-D data, enhancing\u0000depth accuracy, and selecting points with the minimum time-to-collision to\u0000prioritize the most immediate threats. To maintain a balance between safety and\u0000agility, a heuristic dynamically adjusts the optimization process, preventing\u0000overconstraints in real time. Extensive experiments with an agile quadrotor\u0000demonstrate effective collision avoidance across diverse indoor and outdoor\u0000environments, without requiring environment-specific tuning or explicit\u0000mapping.","PeriodicalId":501031,"journal":{"name":"arXiv - CS - Robotics","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269795","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0