Lei Cheng, Junpeng Hu, Haodong Yan, Mariia Gladkova, Tianyu Huang, Yun-Hui Liu, Daniel Cremers, Haoang Li
Photometric bundle adjustment (PBA) is widely used in estimating the camera pose and 3D geometry by assuming a Lambertian world. However, the assumption of photometric consistency is often violated since the non-diffuse reflection is common in real-world environments. The photometric inconsistency significantly affects the reliability of existing PBA methods. To solve this problem, we propose a novel physically-based PBA method. Specifically, we introduce the physically-based weights regarding material, illumination, and light path. These weights distinguish the pixel pairs with different levels of photometric inconsistency. We also design corresponding models for material estimation based on sequential images and illumination estimation based on point clouds. In addition, we establish the first SLAM-related dataset of non-Lambertian scenes with complete ground truth of illumination and material. Extensive experiments demonstrated that our PBA method outperforms existing approaches in accuracy.
光度束调整(PBA)被广泛应用于通过假定朗伯世界来估计摄影机姿态和三维几何图形。然而,由于非漫反射在现实环境中很常见,因此光度一致性假设经常被违反。光度不一致严重影响了现有 PBA 方法的可靠性。为了解决这个问题,我们提出了一种基于物理的新型 PBA 方法。具体来说,我们引入了基于物理的材料、光照和光路权重,这些权重可以区分光度不一致程度不同的像素对。我们还为基于连续图像的材质估计和基于点云的光照估计设计了相应的模型。此外,我们还建立了第一个与 SLAM 相关的非朗伯场景数据集,该数据集具有完整的光照和材质地面实况。广泛的实验证明,我们的 PBA 方法在精度上优于现有方法。
{"title":"Physically-Based Photometric Bundle Adjustment in Non-Lambertian Environments","authors":"Lei Cheng, Junpeng Hu, Haodong Yan, Mariia Gladkova, Tianyu Huang, Yun-Hui Liu, Daniel Cremers, Haoang Li","doi":"arxiv-2409.11854","DOIUrl":"https://doi.org/arxiv-2409.11854","url":null,"abstract":"Photometric bundle adjustment (PBA) is widely used in estimating the camera\u0000pose and 3D geometry by assuming a Lambertian world. However, the assumption of\u0000photometric consistency is often violated since the non-diffuse reflection is\u0000common in real-world environments. The photometric inconsistency significantly\u0000affects the reliability of existing PBA methods. To solve this problem, we\u0000propose a novel physically-based PBA method. Specifically, we introduce the\u0000physically-based weights regarding material, illumination, and light path.\u0000These weights distinguish the pixel pairs with different levels of photometric\u0000inconsistency. We also design corresponding models for material estimation\u0000based on sequential images and illumination estimation based on point clouds.\u0000In addition, we establish the first SLAM-related dataset of non-Lambertian\u0000scenes with complete ground truth of illumination and material. Extensive\u0000experiments demonstrated that our PBA method outperforms existing approaches in\u0000accuracy.","PeriodicalId":501031,"journal":{"name":"arXiv - CS - Robotics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bundle adjustment (BA) is a critical technique in various robotic applications, such as simultaneous localization and mapping (SLAM), augmented reality (AR), and photogrammetry. BA optimizes parameters such as camera poses and 3D landmarks to align them with observations. With the growing importance of deep learning in perception systems, there is an increasing need to integrate BA with deep learning frameworks for enhanced reliability and performance. However, widely-used C++-based BA frameworks, such as GTSAM, g$^2$o, and Ceres, lack native integration with modern deep learning libraries like PyTorch. This limitation affects their flexibility, adaptability, ease of debugging, and overall implementation efficiency. To address this gap, we introduce an eager-mode BA framework seamlessly integrated with PyPose, providing PyTorch-compatible interfaces with high efficiency. Our approach includes GPU-accelerated, differentiable, and sparse operations designed for 2nd-order optimization, Lie group and Lie algebra operations, and linear solvers. Our eager-mode BA on GPU demonstrates substantial runtime efficiency, achieving an average speedup of 18.5$times$, 22$times$, and 23$times$ compared to GTSAM, g$^2$o, and Ceres, respectively.
捆绑调整(BA)是多种机器人应用中的一项关键技术,如同步定位与测绘(SLAM)、增强现实(AR)和摄影测量。BA可优化相机姿势和三维地标等参数,使其与观测结果保持一致。随着深度学习在感知系统中的重要性与日俱增,人们越来越需要将 BA 与深度学习框架集成起来,以提高可靠性和性能。然而,GTSAM、g$^2$o 和 Ceres 等广泛使用的基于 C++ 的 BA 框架缺乏与 PyTorch 等现代深度学习库的原生集成。这种限制影响了它们的灵活性、适应性、调试的简便性和整体实现效率。为了弥补这一缺陷,我们引入了与 PyPose 无缝集成的急迫模式 BA 框架,提供了与 PyTorch 兼容的高效接口。我们的方法包括为二阶优化设计的 GPU 加速、可微分和稀疏运算、李群和李代数运算以及线性求解器。与 GTSAM、g$^2$o 和 Ceres 相比,我们在 GPU 上的急迫模式 BA 的运行效率大幅提高,分别平均提速 18.5 倍、22 倍和 23 倍。
{"title":"Bundle Adjustment in the Eager Mode","authors":"Zitong Zhan, Huan Xu, Zihang Fang, Xinpeng Wei, Yaoyu Hu, Chen Wang","doi":"arxiv-2409.12190","DOIUrl":"https://doi.org/arxiv-2409.12190","url":null,"abstract":"Bundle adjustment (BA) is a critical technique in various robotic\u0000applications, such as simultaneous localization and mapping (SLAM), augmented\u0000reality (AR), and photogrammetry. BA optimizes parameters such as camera poses\u0000and 3D landmarks to align them with observations. With the growing importance\u0000of deep learning in perception systems, there is an increasing need to\u0000integrate BA with deep learning frameworks for enhanced reliability and\u0000performance. However, widely-used C++-based BA frameworks, such as GTSAM,\u0000g$^2$o, and Ceres, lack native integration with modern deep learning libraries\u0000like PyTorch. This limitation affects their flexibility, adaptability, ease of\u0000debugging, and overall implementation efficiency. To address this gap, we\u0000introduce an eager-mode BA framework seamlessly integrated with PyPose,\u0000providing PyTorch-compatible interfaces with high efficiency. Our approach\u0000includes GPU-accelerated, differentiable, and sparse operations designed for\u00002nd-order optimization, Lie group and Lie algebra operations, and linear\u0000solvers. Our eager-mode BA on GPU demonstrates substantial runtime efficiency,\u0000achieving an average speedup of 18.5$times$, 22$times$, and 23$times$\u0000compared to GTSAM, g$^2$o, and Ceres, respectively.","PeriodicalId":501031,"journal":{"name":"arXiv - CS - Robotics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142267024","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Imitation learning has proven to be a powerful tool for training complex visuomotor policies. However, current methods often require hundreds to thousands of expert demonstrations to handle high-dimensional visual observations. A key reason for this poor data efficiency is that visual representations are predominantly either pretrained on out-of-domain data or trained directly through a behavior cloning objective. In this work, we present DynaMo, a new in-domain, self-supervised method for learning visual representations. Given a set of expert demonstrations, we jointly learn a latent inverse dynamics model and a forward dynamics model over a sequence of image embeddings, predicting the next frame in latent space, without augmentations, contrastive sampling, or access to ground truth actions. Importantly, DynaMo does not require any out-of-domain data such as Internet datasets or cross-embodied datasets. On a suite of six simulated and real environments, we show that representations learned with DynaMo significantly improve downstream imitation learning performance over prior self-supervised learning objectives, and pretrained representations. Gains from using DynaMo hold across policy classes such as Behavior Transformer, Diffusion Policy, MLP, and nearest neighbors. Finally, we ablate over key components of DynaMo and measure its impact on downstream policy performance. Robot videos are best viewed at https://dynamo-ssl.github.io
事实证明,模仿学习是训练复杂视觉运动策略的有力工具。然而,目前的方法往往需要数百到数千个专家示范才能处理高维视觉观察。造成这种数据效率低下的一个关键原因是,视觉呈现主要是在域外数据上进行预训练,或者直接通过行为克隆目标进行训练。在这项工作中,我们提出了一种用于学习视觉呈现的全新域内自我监督方法--DynaMo。给定一组专家示范,我们在图像嵌入序列上联合学习恒定的反向动力学模型和正向动力学模型,在潜空间中预测下一帧,而无需增强、对比采样或访问地面真实动作。重要的是,DynaMo 不需要任何域外数据,如互联网数据集或交叉嵌入数据集。在一套六种模拟和真实环境中,我们发现与之前的自我监督学习目标和预训练表征相比,使用 DynaMo 学习到的表征显著提高了下游模仿学习性能。使用 DynaMohold 所带来的收益跨越了行为转换器、扩散策略、MLP 和近邻等策略类别。最后,我们消减了 DynaMo 的关键组件,并测量了它对下游策略性能的影响。机器人视频最佳观看地址:https://dynamo-ssl.github.io
{"title":"DynaMo: In-Domain Dynamics Pretraining for Visuo-Motor Control","authors":"Zichen Jeff Cui, Hengkai Pan, Aadhithya Iyer, Siddhant Haldar, Lerrel Pinto","doi":"arxiv-2409.12192","DOIUrl":"https://doi.org/arxiv-2409.12192","url":null,"abstract":"Imitation learning has proven to be a powerful tool for training complex\u0000visuomotor policies. However, current methods often require hundreds to\u0000thousands of expert demonstrations to handle high-dimensional visual\u0000observations. A key reason for this poor data efficiency is that visual\u0000representations are predominantly either pretrained on out-of-domain data or\u0000trained directly through a behavior cloning objective. In this work, we present\u0000DynaMo, a new in-domain, self-supervised method for learning visual\u0000representations. Given a set of expert demonstrations, we jointly learn a\u0000latent inverse dynamics model and a forward dynamics model over a sequence of\u0000image embeddings, predicting the next frame in latent space, without\u0000augmentations, contrastive sampling, or access to ground truth actions.\u0000Importantly, DynaMo does not require any out-of-domain data such as Internet\u0000datasets or cross-embodied datasets. On a suite of six simulated and real\u0000environments, we show that representations learned with DynaMo significantly\u0000improve downstream imitation learning performance over prior self-supervised\u0000learning objectives, and pretrained representations. Gains from using DynaMo\u0000hold across policy classes such as Behavior Transformer, Diffusion Policy, MLP,\u0000and nearest neighbors. Finally, we ablate over key components of DynaMo and\u0000measure its impact on downstream policy performance. Robot videos are best\u0000viewed at https://dynamo-ssl.github.io","PeriodicalId":501031,"journal":{"name":"arXiv - CS - Robotics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142267022","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Re-identification (ReID) is a critical challenge in computer vision, predominantly studied in the context of pedestrians and vehicles. However, robust object-instance ReID, which has significant implications for tasks such as autonomous exploration, long-term perception, and scene understanding, remains underexplored. In this work, we address this gap by proposing a novel dual-path object-instance re-identification transformer architecture that integrates multimodal RGB and depth information. By leveraging depth data, we demonstrate improvements in ReID across scenes that are cluttered or have varying illumination conditions. Additionally, we develop a ReID-based localization framework that enables accurate camera localization and pose identification across different viewpoints. We validate our methods using two custom-built RGB-D datasets, as well as multiple sequences from the open-source TUM RGB-D datasets. Our approach demonstrates significant improvements in both object instance ReID (mAP of 75.18) and localization accuracy (success rate of 83% on TUM-RGBD), highlighting the essential role of object ReID in advancing robotic perception. Our models, frameworks, and datasets have been made publicly available.
{"title":"Towards Global Localization using Multi-Modal Object-Instance Re-Identification","authors":"Aneesh Chavan, Vaibhav Agrawal, Vineeth Bhat, Sarthak Chittawar, Siddharth Srivastava, Chetan Arora, K Madhava Krishna","doi":"arxiv-2409.12002","DOIUrl":"https://doi.org/arxiv-2409.12002","url":null,"abstract":"Re-identification (ReID) is a critical challenge in computer vision,\u0000predominantly studied in the context of pedestrians and vehicles. However,\u0000robust object-instance ReID, which has significant implications for tasks such\u0000as autonomous exploration, long-term perception, and scene understanding,\u0000remains underexplored. In this work, we address this gap by proposing a novel\u0000dual-path object-instance re-identification transformer architecture that\u0000integrates multimodal RGB and depth information. By leveraging depth data, we\u0000demonstrate improvements in ReID across scenes that are cluttered or have\u0000varying illumination conditions. Additionally, we develop a ReID-based\u0000localization framework that enables accurate camera localization and pose\u0000identification across different viewpoints. We validate our methods using two\u0000custom-built RGB-D datasets, as well as multiple sequences from the open-source\u0000TUM RGB-D datasets. Our approach demonstrates significant improvements in both\u0000object instance ReID (mAP of 75.18) and localization accuracy (success rate of\u000083% on TUM-RGBD), highlighting the essential role of object ReID in advancing\u0000robotic perception. Our models, frameworks, and datasets have been made\u0000publicly available.","PeriodicalId":501031,"journal":{"name":"arXiv - CS - Robotics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142267035","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jikai Ye, Wanze Li, Shiraz Khan, Gregory S. Chirikjian
Cloth state estimation is an important problem in robotics. It is essential for the robot to know the accurate state to manipulate cloth and execute tasks such as robotic dressing, stitching, and covering/uncovering human beings. However, estimating cloth state accurately remains challenging due to its high flexibility and self-occlusion. This paper proposes a diffusion model-based pipeline that formulates the cloth state estimation as an image generation problem by representing the cloth state as an RGB image that describes the point-wise translation (translation map) between a pre-defined flattened mesh and the deformed mesh in a canonical space. Then we train a conditional diffusion-based image generation model to predict the translation map based on an observation. Experiments are conducted in both simulation and the real world to validate the performance of our method. Results indicate that our method outperforms two recent methods in both accuracy and speed.
{"title":"RaggeDi: Diffusion-based State Estimation of Disordered Rags, Sheets, Towels and Blankets","authors":"Jikai Ye, Wanze Li, Shiraz Khan, Gregory S. Chirikjian","doi":"arxiv-2409.11831","DOIUrl":"https://doi.org/arxiv-2409.11831","url":null,"abstract":"Cloth state estimation is an important problem in robotics. It is essential\u0000for the robot to know the accurate state to manipulate cloth and execute tasks\u0000such as robotic dressing, stitching, and covering/uncovering human beings.\u0000However, estimating cloth state accurately remains challenging due to its high\u0000flexibility and self-occlusion. This paper proposes a diffusion model-based\u0000pipeline that formulates the cloth state estimation as an image generation\u0000problem by representing the cloth state as an RGB image that describes the\u0000point-wise translation (translation map) between a pre-defined flattened mesh\u0000and the deformed mesh in a canonical space. Then we train a conditional\u0000diffusion-based image generation model to predict the translation map based on\u0000an observation. Experiments are conducted in both simulation and the real world\u0000to validate the performance of our method. Results indicate that our method\u0000outperforms two recent methods in both accuracy and speed.","PeriodicalId":501031,"journal":{"name":"arXiv - CS - Robotics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cory M. Simon, Jeffrey Richley, Lucas Overbey, Darleen Perez-Lavin
Teams of mobile [aerial, ground, or aquatic] robots have applications in resource delivery, patrolling, information-gathering, agriculture, forest fire fighting, chemical plume source localization and mapping, and search-and-rescue. Robot teams traversing hazardous environments -- with e.g. rough terrain or seas, strong winds, or adversaries capable of attacking or capturing robots -- should plan and coordinate their trails in consideration of risks of disablement, destruction, or capture. Specifically, the robots should take the safest trails, coordinate their trails to cooperatively achieve the team-level objective with robustness to robot failures, and balance the reward from visiting locations against risks of robot losses. Herein, we consider bi-objective trail-planning for a mobile team of robots orienteering in a hazardous environment. The hazardous environment is abstracted as a directed graph whose arcs, when traversed by a robot, present known probabilities of survival. Each node of the graph offers a reward to the team if visited by a robot (which e.g. delivers a good to or images the node). We wish to search for the Pareto-optimal robot-team trail plans that maximize two [conflicting] team objectives: the expected (i) team reward and (ii) number of robots that survive the mission. A human decision-maker can then select trail plans that balance, according to their values, reward and robot survival. We implement ant colony optimization, guided by heuristics, to search for the Pareto-optimal set of robot team trail plans. As a case study, we illustrate with an information-gathering mission in an art museum.
移动[空中、地面或水上]机器人小组可应用于资源运送、巡逻、信息收集、农业、森林灭火、化学羽流源定位和绘图以及搜索和救援。机器人团队在穿越危险环境(如崎岖的地形或海洋、强风或能够攻击或捕获机器人的对手)时,应在考虑到瘫痪、破坏或捕获风险的情况下规划和协调它们的路径。具体来说,机器人应选择最安全的路径,协调它们的路径,以合作实现团队级目标,同时保证机器人故障的稳健性,并平衡访问地点带来的奖励与机器人损失的风险。在这里,我们考虑的是在危险环境中进行定向越野的移动机器人团队的双目标路径规划。危险环境被抽象为一个有向图,当机器人穿越该图中的弧时,其生存概率是已知的。图中的每个节点如果被机器人访问(例如,向节点运送物品或图像),都会给团队带来奖励。我们希望搜索帕累托最优机器人-团队路径计划,使两个[相互冲突的]团队目标最大化:预期的 (i) 团队奖励和 (ii) 任务中存活的机器人数量。然后,人类决策者可以根据奖励和机器人存活率这两个目标的价值,选择平衡的路径计划。在启发式方法的指导下,我们采用了蚁群优化方法来搜索帕累托最优的机器人团队路径计划集。作为案例研究,我们以艺术博物馆中的信息收集任务为例进行说明。
{"title":"Bi-objective trail-planning for a robot team orienteering in a hazardous environment","authors":"Cory M. Simon, Jeffrey Richley, Lucas Overbey, Darleen Perez-Lavin","doi":"arxiv-2409.12114","DOIUrl":"https://doi.org/arxiv-2409.12114","url":null,"abstract":"Teams of mobile [aerial, ground, or aquatic] robots have applications in\u0000resource delivery, patrolling, information-gathering, agriculture, forest fire\u0000fighting, chemical plume source localization and mapping, and\u0000search-and-rescue. Robot teams traversing hazardous environments -- with e.g.\u0000rough terrain or seas, strong winds, or adversaries capable of attacking or\u0000capturing robots -- should plan and coordinate their trails in consideration of\u0000risks of disablement, destruction, or capture. Specifically, the robots should\u0000take the safest trails, coordinate their trails to cooperatively achieve the\u0000team-level objective with robustness to robot failures, and balance the reward\u0000from visiting locations against risks of robot losses. Herein, we consider\u0000bi-objective trail-planning for a mobile team of robots orienteering in a\u0000hazardous environment. The hazardous environment is abstracted as a directed\u0000graph whose arcs, when traversed by a robot, present known probabilities of\u0000survival. Each node of the graph offers a reward to the team if visited by a\u0000robot (which e.g. delivers a good to or images the node). We wish to search for\u0000the Pareto-optimal robot-team trail plans that maximize two [conflicting] team\u0000objectives: the expected (i) team reward and (ii) number of robots that survive\u0000the mission. A human decision-maker can then select trail plans that balance,\u0000according to their values, reward and robot survival. We implement ant colony\u0000optimization, guided by heuristics, to search for the Pareto-optimal set of\u0000robot team trail plans. As a case study, we illustrate with an\u0000information-gathering mission in an art museum.","PeriodicalId":501031,"journal":{"name":"arXiv - CS - Robotics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142267028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhaxizhuoma, Pengan Chen, Ziniu Wu, Jiawei Sun, Dong Wang, Peng Zhou, Nieqing Cao, Yan Ding, Bin Zhao, Xuelong Li
This paper presents AlignBot, a novel framework designed to optimize VLM-powered customized task planning for household robots by effectively aligning with user reminders. In domestic settings, aligning task planning with user reminders poses significant challenges due to the limited quantity, diversity, and multimodal nature of the reminders. To address these challenges, AlignBot employs a fine-tuned LLaVA-7B model, functioning as an adapter for GPT-4o. This adapter model internalizes diverse forms of user reminders-such as personalized preferences, corrective guidance, and contextual assistance-into structured instruction-formatted cues that prompt GPT-4o in generating customized task plans. Additionally, AlignBot integrates a dynamic retrieval mechanism that selects task-relevant historical successes as prompts for GPT-4o, further enhancing task planning accuracy. To validate the effectiveness of AlignBot, experiments are conducted in real-world household environments, which are constructed within the laboratory to replicate typical household settings. A multimodal dataset with over 1,500 entries derived from volunteer reminders is used for training and evaluation. The results demonstrate that AlignBot significantly improves customized task planning, outperforming existing LLM- and VLM-powered planners by interpreting and aligning with user reminders, achieving 86.8% success rate compared to the vanilla GPT-4o baseline at 21.6%, reflecting a 65% improvement and over four times greater effectiveness. Supplementary materials are available at: https://yding25.com/AlignBot/
{"title":"AlignBot: Aligning VLM-powered Customized Task Planning with User Reminders Through Fine-Tuning for Household Robots","authors":"Zhaxizhuoma, Pengan Chen, Ziniu Wu, Jiawei Sun, Dong Wang, Peng Zhou, Nieqing Cao, Yan Ding, Bin Zhao, Xuelong Li","doi":"arxiv-2409.11905","DOIUrl":"https://doi.org/arxiv-2409.11905","url":null,"abstract":"This paper presents AlignBot, a novel framework designed to optimize\u0000VLM-powered customized task planning for household robots by effectively\u0000aligning with user reminders. In domestic settings, aligning task planning with\u0000user reminders poses significant challenges due to the limited quantity,\u0000diversity, and multimodal nature of the reminders. To address these challenges,\u0000AlignBot employs a fine-tuned LLaVA-7B model, functioning as an adapter for\u0000GPT-4o. This adapter model internalizes diverse forms of user reminders-such as\u0000personalized preferences, corrective guidance, and contextual assistance-into\u0000structured instruction-formatted cues that prompt GPT-4o in generating\u0000customized task plans. Additionally, AlignBot integrates a dynamic retrieval\u0000mechanism that selects task-relevant historical successes as prompts for\u0000GPT-4o, further enhancing task planning accuracy. To validate the effectiveness\u0000of AlignBot, experiments are conducted in real-world household environments,\u0000which are constructed within the laboratory to replicate typical household\u0000settings. A multimodal dataset with over 1,500 entries derived from volunteer\u0000reminders is used for training and evaluation. The results demonstrate that\u0000AlignBot significantly improves customized task planning, outperforming\u0000existing LLM- and VLM-powered planners by interpreting and aligning with user\u0000reminders, achieving 86.8% success rate compared to the vanilla GPT-4o baseline\u0000at 21.6%, reflecting a 65% improvement and over four times greater\u0000effectiveness. Supplementary materials are available at:\u0000https://yding25.com/AlignBot/","PeriodicalId":501031,"journal":{"name":"arXiv - CS - Robotics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Manuel Bianchi Bazzi, Asad Ali Shahid, Christopher Agia, John Alora, Marco Forgione, Dario Piga, Francesco Braghin, Marco Pavone, Loris Roveda
The landscape of Deep Learning has experienced a major shift with the pervasive adoption of Transformer-based architectures, particularly in Natural Language Processing (NLP). Novel avenues for physical applications, such as solving Partial Differential Equations and Image Vision, have been explored. However, in challenging domains like robotics, where high non-linearity poses significant challenges, Transformer-based applications are scarce. While Transformers have been used to provide robots with knowledge about high-level tasks, few efforts have been made to perform system identification. This paper proposes a novel methodology to learn a meta-dynamical model of a high-dimensional physical system, such as the Franka robotic arm, using a Transformer-based architecture without prior knowledge of the system's physical parameters. The objective is to predict quantities of interest (end-effector pose and joint positions) given the torque signals for each joint. This prediction can be useful as a component for Deep Model Predictive Control frameworks in robotics. The meta-model establishes the correlation between torques and positions and predicts the output for the complete trajectory. This work provides empirical evidence of the efficacy of the in-context learning paradigm, suggesting future improvements in learning the dynamics of robotic systems without explicit knowledge of physical parameters. Code, videos, and supplementary materials can be found at project website. See https://sites.google.com/view/robomorph/
{"title":"RoboMorph: In-Context Meta-Learning for Robot Dynamics Modeling","authors":"Manuel Bianchi Bazzi, Asad Ali Shahid, Christopher Agia, John Alora, Marco Forgione, Dario Piga, Francesco Braghin, Marco Pavone, Loris Roveda","doi":"arxiv-2409.11815","DOIUrl":"https://doi.org/arxiv-2409.11815","url":null,"abstract":"The landscape of Deep Learning has experienced a major shift with the\u0000pervasive adoption of Transformer-based architectures, particularly in Natural\u0000Language Processing (NLP). Novel avenues for physical applications, such as\u0000solving Partial Differential Equations and Image Vision, have been explored.\u0000However, in challenging domains like robotics, where high non-linearity poses\u0000significant challenges, Transformer-based applications are scarce. While\u0000Transformers have been used to provide robots with knowledge about high-level\u0000tasks, few efforts have been made to perform system identification. This paper\u0000proposes a novel methodology to learn a meta-dynamical model of a\u0000high-dimensional physical system, such as the Franka robotic arm, using a\u0000Transformer-based architecture without prior knowledge of the system's physical\u0000parameters. The objective is to predict quantities of interest (end-effector\u0000pose and joint positions) given the torque signals for each joint. This\u0000prediction can be useful as a component for Deep Model Predictive Control\u0000frameworks in robotics. The meta-model establishes the correlation between\u0000torques and positions and predicts the output for the complete trajectory. This\u0000work provides empirical evidence of the efficacy of the in-context learning\u0000paradigm, suggesting future improvements in learning the dynamics of robotic\u0000systems without explicit knowledge of physical parameters. Code, videos, and\u0000supplementary materials can be found at project website. See\u0000https://sites.google.com/view/robomorph/","PeriodicalId":501031,"journal":{"name":"arXiv - CS - Robotics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent advances in machine learning have paved the way for the development of musical and entertainment robots. However, human-robot cooperative instrument playing remains a challenge, particularly due to the intricate motor coordination and temporal synchronization. In this paper, we propose a theoretical framework for human-robot cooperative piano playing based on non-verbal cues. First, we present a music improvisation model that employs a recurrent neural network (RNN) to predict appropriate chord progressions based on the human's melodic input. Second, we propose a behavior-adaptive controller to facilitate seamless temporal synchronization, allowing the cobot to generate harmonious acoustics. The collaboration takes into account the bidirectional information flow between the human and robot. We have developed an entropy-based system to assess the quality of cooperation by analyzing the impact of different communication modalities during human-robot collaboration. Experiments demonstrate that our RNN-based improvisation can achieve a 93% accuracy rate. Meanwhile, with the MPC adaptive controller, the robot could respond to the human teammate in homophony performances with real-time accompaniment. Our designed framework has been validated to be effective in allowing humans and robots to work collaboratively in the artistic piano-playing task.
{"title":"Human-Robot Cooperative Piano Playing with Learning-Based Real-Time Music Accompaniment","authors":"Huijiang Wang, Xiaoping Zhang, Fumiya Iida","doi":"arxiv-2409.11952","DOIUrl":"https://doi.org/arxiv-2409.11952","url":null,"abstract":"Recent advances in machine learning have paved the way for the development of\u0000musical and entertainment robots. However, human-robot cooperative instrument\u0000playing remains a challenge, particularly due to the intricate motor\u0000coordination and temporal synchronization. In this paper, we propose a\u0000theoretical framework for human-robot cooperative piano playing based on\u0000non-verbal cues. First, we present a music improvisation model that employs a\u0000recurrent neural network (RNN) to predict appropriate chord progressions based\u0000on the human's melodic input. Second, we propose a behavior-adaptive controller\u0000to facilitate seamless temporal synchronization, allowing the cobot to generate\u0000harmonious acoustics. The collaboration takes into account the bidirectional\u0000information flow between the human and robot. We have developed an\u0000entropy-based system to assess the quality of cooperation by analyzing the\u0000impact of different communication modalities during human-robot collaboration.\u0000Experiments demonstrate that our RNN-based improvisation can achieve a 93%\u0000accuracy rate. Meanwhile, with the MPC adaptive controller, the robot could\u0000respond to the human teammate in homophony performances with real-time\u0000accompaniment. Our designed framework has been validated to be effective in\u0000allowing humans and robots to work collaboratively in the artistic\u0000piano-playing task.","PeriodicalId":501031,"journal":{"name":"arXiv - CS - Robotics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266822","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alessandro Saviolo, Niko Picello, Rishabh Verma, Giuseppe Loianno
Reactive collision avoidance is essential for agile robots navigating complex and dynamic environments, enabling real-time obstacle response. However, this task is inherently challenging because it requires a tight integration of perception, planning, and control, which traditional methods often handle separately, resulting in compounded errors and delays. This paper introduces a novel approach that unifies these tasks into a single reactive framework using solely onboard sensing and computing. Our method combines nonlinear model predictive control with adaptive control barrier functions, directly linking perception-driven constraints to real-time planning and control. Constraints are determined by using a neural network to refine noisy RGB-D data, enhancing depth accuracy, and selecting points with the minimum time-to-collision to prioritize the most immediate threats. To maintain a balance between safety and agility, a heuristic dynamically adjusts the optimization process, preventing overconstraints in real time. Extensive experiments with an agile quadrotor demonstrate effective collision avoidance across diverse indoor and outdoor environments, without requiring environment-specific tuning or explicit mapping.
{"title":"Reactive Collision Avoidance for Safe Agile Navigation","authors":"Alessandro Saviolo, Niko Picello, Rishabh Verma, Giuseppe Loianno","doi":"arxiv-2409.11962","DOIUrl":"https://doi.org/arxiv-2409.11962","url":null,"abstract":"Reactive collision avoidance is essential for agile robots navigating complex\u0000and dynamic environments, enabling real-time obstacle response. However, this\u0000task is inherently challenging because it requires a tight integration of\u0000perception, planning, and control, which traditional methods often handle\u0000separately, resulting in compounded errors and delays. This paper introduces a\u0000novel approach that unifies these tasks into a single reactive framework using\u0000solely onboard sensing and computing. Our method combines nonlinear model\u0000predictive control with adaptive control barrier functions, directly linking\u0000perception-driven constraints to real-time planning and control. Constraints\u0000are determined by using a neural network to refine noisy RGB-D data, enhancing\u0000depth accuracy, and selecting points with the minimum time-to-collision to\u0000prioritize the most immediate threats. To maintain a balance between safety and\u0000agility, a heuristic dynamically adjusts the optimization process, preventing\u0000overconstraints in real time. Extensive experiments with an agile quadrotor\u0000demonstrate effective collision avoidance across diverse indoor and outdoor\u0000environments, without requiring environment-specific tuning or explicit\u0000mapping.","PeriodicalId":501031,"journal":{"name":"arXiv - CS - Robotics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269795","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}