利用空间和时间注意力构建类别图表征，实现视觉导航

IF 6 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS ACM Transactions on Multimedia Computing Communications and Applications Pub Date : 2024-03-22 DOI:10.1145/3653714

Xiaobo Hu, Youfang Lin, HeHe Fan, Shuo Wang, Zhihao Wu, Kai Lv

{"title":"利用空间和时间注意力构建类别图表征，实现视觉导航","authors":"Xiaobo Hu, Youfang Lin, HeHe Fan, Shuo Wang, Zhihao Wu, Kai Lv","doi":"10.1145/3653714","DOIUrl":null,"url":null,"abstract":"<p>Given an object of interest, visual navigation aims to reach the object’s location based on a sequence of partial observations. To this end, an agent needs to 1) acquire specific knowledge about the relations of object categories in the world during training and 2) locate the target object based on the pre-learned object category relations and its trajectory in the current unseen environment. In this paper, we propose a Category Relation Graph (CRG) to learn the knowledge of object category layout relations and a Temporal-Spatial-Region attention (TSR) architecture to perceive the long-term spatial-temporal dependencies of objects, aiding navigation. We establish CRG to learn prior knowledge of object layout and deduce the positions of specific objects. Subsequently, we propose the TSR architecture to capture relationships among objects in temporal, spatial, and regions within observation trajectories. Specifically, we implement a Temporal attention module (T) to model the temporal structure of the observation sequence, implicitly encoding historical moving or trajectory information. Then, a Spatial attention module (S) uncovers the spatial context of the current observation objects based on CRG and past observations. Last, a Region attention module (R) shifts the attention to the target-relevant region. Leveraging the visual representation extracted by our method, the agent accurately perceives the environment and easily learns a superior navigation policy. Experiments on AI2-THOR demonstrate that our CRG-TSR method significantly outperforms existing methods in both effectiveness and efficiency. The supplementary material includes the code and will be publicly available.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"16 1","pages":""},"PeriodicalIF":6.0000,"publicationDate":"2024-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Building Category Graphs Representation with Spatial and Temporal Attention for Visual Navigation\",\"authors\":\"Xiaobo Hu, Youfang Lin, HeHe Fan, Shuo Wang, Zhihao Wu, Kai Lv\",\"doi\":\"10.1145/3653714\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Given an object of interest, visual navigation aims to reach the object’s location based on a sequence of partial observations. To this end, an agent needs to 1) acquire specific knowledge about the relations of object categories in the world during training and 2) locate the target object based on the pre-learned object category relations and its trajectory in the current unseen environment. In this paper, we propose a Category Relation Graph (CRG) to learn the knowledge of object category layout relations and a Temporal-Spatial-Region attention (TSR) architecture to perceive the long-term spatial-temporal dependencies of objects, aiding navigation. We establish CRG to learn prior knowledge of object layout and deduce the positions of specific objects. Subsequently, we propose the TSR architecture to capture relationships among objects in temporal, spatial, and regions within observation trajectories. Specifically, we implement a Temporal attention module (T) to model the temporal structure of the observation sequence, implicitly encoding historical moving or trajectory information. Then, a Spatial attention module (S) uncovers the spatial context of the current observation objects based on CRG and past observations. Last, a Region attention module (R) shifts the attention to the target-relevant region. Leveraging the visual representation extracted by our method, the agent accurately perceives the environment and easily learns a superior navigation policy. Experiments on AI2-THOR demonstrate that our CRG-TSR method significantly outperforms existing methods in both effectiveness and efficiency. The supplementary material includes the code and will be publicly available.</p>\",\"PeriodicalId\":50937,\"journal\":{\"name\":\"ACM Transactions on Multimedia Computing Communications and Applications\",\"volume\":\"16 1\",\"pages\":\"\"},\"PeriodicalIF\":6.0000,\"publicationDate\":\"2024-03-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on Multimedia Computing Communications and Applications\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1145/3653714\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Multimedia Computing Communications and Applications","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3653714","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

给定一个感兴趣的物体，视觉导航的目的是根据一连串的部分观察结果找到该物体的位置。为此，代理需要：1）在训练过程中获取关于世界中物体类别关系的特定知识；2）根据预先学习的物体类别关系及其在当前未见环境中的轨迹定位目标物体。在本文中，我们提出了一个类别关系图（CRG）来学习物体类别布局关系的知识，并提出了一个时空区域注意（TSR）架构来感知物体的长期时空依赖关系，从而帮助导航。我们建立了 CRG 来学习物体布局的先验知识，并推断出特定物体的位置。随后，我们提出了 TSR 架构，以捕捉观察轨迹中物体在时间、空间和区域上的关系。具体来说，我们采用一个时间注意力模块（T）来模拟观察序列的时间结构，隐式编码历史移动或轨迹信息。然后，空间注意力模块（S）根据 CRG 和过去的观察结果，揭示当前观察对象的空间背景。最后，区域注意力模块（R）将注意力转移到目标相关区域。利用我们的方法所提取的视觉表征，代理可以准确地感知环境，并轻松地学习卓越的导航策略。在 AI2-THOR 上的实验表明，我们的 CRG-TSR 方法在效果和效率上都明显优于现有方法。补充材料包括代码，并将公开发布。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Building Category Graphs Representation with Spatial and Temporal Attention for Visual Navigation

Given an object of interest, visual navigation aims to reach the object’s location based on a sequence of partial observations. To this end, an agent needs to 1) acquire specific knowledge about the relations of object categories in the world during training and 2) locate the target object based on the pre-learned object category relations and its trajectory in the current unseen environment. In this paper, we propose a Category Relation Graph (CRG) to learn the knowledge of object category layout relations and a Temporal-Spatial-Region attention (TSR) architecture to perceive the long-term spatial-temporal dependencies of objects, aiding navigation. We establish CRG to learn prior knowledge of object layout and deduce the positions of specific objects. Subsequently, we propose the TSR architecture to capture relationships among objects in temporal, spatial, and regions within observation trajectories. Specifically, we implement a Temporal attention module (T) to model the temporal structure of the observation sequence, implicitly encoding historical moving or trajectory information. Then, a Spatial attention module (S) uncovers the spatial context of the current observation objects based on CRG and past observations. Last, a Region attention module (R) shifts the attention to the target-relevant region. Leveraging the visual representation extracted by our method, the agent accurately perceives the environment and easily learns a superior navigation policy. Experiments on AI2-THOR demonstrate that our CRG-TSR method significantly outperforms existing methods in both effectiveness and efficiency. The supplementary material includes the code and will be publicly available.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ACM Transactions on Multimedia Computing Communications and Applications 工程技术-计算机：理论方法

CiteScore

8.50

自引率

5.90%

发文量

285

审稿时长

7.5 months

期刊介绍： The ACM Transactions on Multimedia Computing, Communications, and Applications is the flagship publication of the ACM Special Interest Group in Multimedia (SIGMM). It is soliciting paper submissions on all aspects of multimedia. Papers on single media (for instance, audio, video, animation) and their processing are also welcome. TOMM is a peer-reviewed, archival journal, available in both print form and digital form. The Journal is published quarterly; with roughly 7 23-page articles in each issue. In addition, all Special Issues are published online-only to ensure a timely publication. The transactions consists primarily of research papers. This is an archival journal and it is intended that the papers will have lasting importance and value over time. In general, papers whose primary focus is on particular multimedia products or the current state of the industry will not be included.