Transformer-based descriptors with fine-grained region supervisions for visual place recognition

IF 7.2 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Knowledge-Based Systems Pub Date : 2023-09-19 DOI:10.1016/j.knosys.2023.110993
Yuwei Wang, Yuanying Qiu, Peitao Cheng, Junyu Zhang
{"title":"Transformer-based descriptors with fine-grained region supervisions for visual place recognition","authors":"Yuwei Wang,&nbsp;Yuanying Qiu,&nbsp;Peitao Cheng,&nbsp;Junyu Zhang","doi":"10.1016/j.knosys.2023.110993","DOIUrl":null,"url":null,"abstract":"<div><p>Visual place recognition is a fundamental component in autonomous systems<span> and robotics, which is easily limited in the real world with different viewpoints and changes in appearance. Existing approaches to tackle this problem mainly rely on dominant CNN-based architectures, which are difficult to model global correlation. Most recently, there has emerged little work that focuses on the effectiveness of the Transformer in modeling long-range dependencies, but this strategy ignores local interactions thus fails to localize the really important regions. To address the above issue, this paper proposes an effective Transformer-based architecture that takes full advantages of the strengths of Transformer related to global context modeling and local specific region capturing. We first design a dual-level Transformer descriptor encoder to successively perform self-attention within local windows and global extent of the CNN<span> feature map to obtain multi-scale spatial context, which combines local interaction and global information. Specifically, multi-layer classification tokens from the Transformer encoder are integrated to form the global image representation. Moreover, a Transformer-guided geometric verification module is introduced to leverage the strengths of the hierarchical Transformer’s inherent self-attention mechanism for fusing multi-level attention, which is employed to filter the output token to obtain key patches and associated attention weights for achieving spatial matching. Finally, we propose a descriptor refinement strategy that employs fine-grained region-level supervisions to further enhance the capability of the network to learn local discriminative features, which effectively alleviates the confusion caused from weak image-level labels. Extensive experiments on benchmark datasets show that our approach outperforms state-of-the-art methods with promising performance.</span></span></p></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"280 ","pages":"Article 110993"},"PeriodicalIF":7.2000,"publicationDate":"2023-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge-Based Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950705123007438","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Visual place recognition is a fundamental component in autonomous systems and robotics, which is easily limited in the real world with different viewpoints and changes in appearance. Existing approaches to tackle this problem mainly rely on dominant CNN-based architectures, which are difficult to model global correlation. Most recently, there has emerged little work that focuses on the effectiveness of the Transformer in modeling long-range dependencies, but this strategy ignores local interactions thus fails to localize the really important regions. To address the above issue, this paper proposes an effective Transformer-based architecture that takes full advantages of the strengths of Transformer related to global context modeling and local specific region capturing. We first design a dual-level Transformer descriptor encoder to successively perform self-attention within local windows and global extent of the CNN feature map to obtain multi-scale spatial context, which combines local interaction and global information. Specifically, multi-layer classification tokens from the Transformer encoder are integrated to form the global image representation. Moreover, a Transformer-guided geometric verification module is introduced to leverage the strengths of the hierarchical Transformer’s inherent self-attention mechanism for fusing multi-level attention, which is employed to filter the output token to obtain key patches and associated attention weights for achieving spatial matching. Finally, we propose a descriptor refinement strategy that employs fine-grained region-level supervisions to further enhance the capability of the network to learn local discriminative features, which effectively alleviates the confusion caused from weak image-level labels. Extensive experiments on benchmark datasets show that our approach outperforms state-of-the-art methods with promising performance.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于Transformer的具有细粒度区域监督的描述符用于视觉位置识别
视觉位置识别是自主系统和机器人技术的一个基本组成部分,在现实世界中,由于不同的视角和外观变化,视觉位置识别很容易受到限制。现有的解决这一问题的方法主要依赖于占主导地位的基于CNN的架构,这很难对全局相关性进行建模。最近,很少有工作关注Transformer在建模长期依赖关系方面的有效性,但这种策略忽略了局部交互,因此无法定位真正重要的区域。为了解决上述问题,本文提出了一种有效的基于Transformer的架构,该架构充分利用了Transformer在全局上下文建模和局部特定区域捕获方面的优势。我们首先设计了一个双层Transformer描述符编码器,在CNN特征图的局部窗口和全局范围内依次进行自关注,以获得多尺度空间上下文,该上下文结合了局部交互和全局信息。具体而言,来自Transformer编码器的多层分类标记被集成以形成全局图像表示。此外,引入了一个Transformer引导的几何验证模块,以利用分层Transformer固有的自注意机制的优势来融合多层次注意力,该机制用于过滤输出令牌,以获得关键补丁和相关的注意力权重,从而实现空间匹配。最后,我们提出了一种描述符细化策略,该策略采用细粒度的区域级监督来进一步增强网络学习局部判别特征的能力,有效地缓解了弱图像级标签带来的混乱。在基准数据集上进行的大量实验表明,我们的方法优于最先进的方法,具有良好的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Knowledge-Based Systems
Knowledge-Based Systems 工程技术-计算机:人工智能
CiteScore
14.80
自引率
12.50%
发文量
1245
审稿时长
7.8 months
期刊介绍: Knowledge-Based Systems, an international and interdisciplinary journal in artificial intelligence, publishes original, innovative, and creative research results in the field. It focuses on knowledge-based and other artificial intelligence techniques-based systems. The journal aims to support human prediction and decision-making through data science and computation techniques, provide a balanced coverage of theory and practical study, and encourage the development and implementation of knowledge-based intelligence models, methods, systems, and software tools. Applications in business, government, education, engineering, and healthcare are emphasized.
期刊最新文献
User disambiguation learning for precise shared-account marketing: A hierarchical self-attentive sequential recommendation method OptNet: Optimization-inspired network beyond deep unfolding for structural artifact reduction Graph out-of-distribution generalization through contrastive learning paradigm Boosting semi-supervised regressor via confidence-weighted consistency regularization Advanced deep learning framework for ECG arrhythmia classification using 1D-CNN with attention mechanism
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1