{"title":"GauTOAO: Gaussian-based Task-Oriented Affordance of Objects","authors":"Jiawen Wang, Dingsheng Luo","doi":"arxiv-2409.11941","DOIUrl":null,"url":null,"abstract":"When your robot grasps an object using dexterous hands or grippers, it should\nunderstand the Task-Oriented Affordances of the Object(TOAO), as different\ntasks often require attention to specific parts of the object. To address this\nchallenge, we propose GauTOAO, a Gaussian-based framework for Task-Oriented\nAffordance of Objects, which leverages vision-language models in a zero-shot\nmanner to predict affordance-relevant regions of an object, given a natural\nlanguage query. Our approach introduces a new paradigm: \"static camera, moving\nobject,\" allowing the robot to better observe and understand the object in hand\nduring manipulation. GauTOAO addresses the limitations of existing methods,\nwhich often lack effective spatial grouping, by extracting a comprehensive 3D\nobject mask using DINO features. This mask is then used to conditionally query\ngaussians, producing a refined semantic distribution over the object for the\nspecified task. This approach results in more accurate TOAO extraction,\nenhancing the robot's understanding of the object and improving task\nperformance. We validate the effectiveness of GauTOAO through real-world\nexperiments, demonstrating its capability to generalize across various tasks.","PeriodicalId":501031,"journal":{"name":"arXiv - CS - Robotics","volume":"49 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Robotics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11941","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
When your robot grasps an object using dexterous hands or grippers, it should
understand the Task-Oriented Affordances of the Object(TOAO), as different
tasks often require attention to specific parts of the object. To address this
challenge, we propose GauTOAO, a Gaussian-based framework for Task-Oriented
Affordance of Objects, which leverages vision-language models in a zero-shot
manner to predict affordance-relevant regions of an object, given a natural
language query. Our approach introduces a new paradigm: "static camera, moving
object," allowing the robot to better observe and understand the object in hand
during manipulation. GauTOAO addresses the limitations of existing methods,
which often lack effective spatial grouping, by extracting a comprehensive 3D
object mask using DINO features. This mask is then used to conditionally query
gaussians, producing a refined semantic distribution over the object for the
specified task. This approach results in more accurate TOAO extraction,
enhancing the robot's understanding of the object and improving task
performance. We validate the effectiveness of GauTOAO through real-world
experiments, demonstrating its capability to generalize across various tasks.