MatchFormer: Interleaving Attention in Transformers for Feature Matching

Qing Wang, Jiaming Zhang, Kailun Yang, Kunyu Peng, R. Stiefelhagen
{"title":"MatchFormer: Interleaving Attention in Transformers for Feature Matching","authors":"Qing Wang, Jiaming Zhang, Kailun Yang, Kunyu Peng, R. Stiefelhagen","doi":"10.48550/arXiv.2203.09645","DOIUrl":null,"url":null,"abstract":"Local feature matching is a computationally intensive task at the subpixel level. While detector-based methods coupled with feature descriptors struggle in low-texture scenes, CNN-based methods with a sequential extract-to-match pipeline, fail to make use of the matching capacity of the encoder and tend to overburden the decoder for matching. In contrast, we propose a novel hierarchical extract-and-match transformer, termed as MatchFormer. Inside each stage of the hierarchical encoder, we interleave self-attention for feature extraction and cross-attention for feature matching, yielding a human-intuitive extract-and-match scheme. Such a match-aware encoder releases the overloaded decoder and makes the model highly efficient. Further, combining self- and cross-attention on multi-scale features in a hierarchical architecture improves matching robustness, particularly in low-texture indoor scenes or with less outdoor training data. Thanks to such a strategy, MatchFormer is a multi-win solution in efficiency, robustness, and precision. Compared to the previous best method in indoor pose estimation, our lite MatchFormer has only 45% GFLOPs, yet achieves a +1.3% precision gain and a 41% running speed boost. The large MatchFormer reaches state-of-the-art on four different benchmarks, including indoor pose estimation (ScanNet), outdoor pose estimation (MegaDepth), homography estimation and image matching (HPatch), and visual localization (InLoc).","PeriodicalId":87238,"journal":{"name":"Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2022-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"51","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2203.09645","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 51

Abstract

Local feature matching is a computationally intensive task at the subpixel level. While detector-based methods coupled with feature descriptors struggle in low-texture scenes, CNN-based methods with a sequential extract-to-match pipeline, fail to make use of the matching capacity of the encoder and tend to overburden the decoder for matching. In contrast, we propose a novel hierarchical extract-and-match transformer, termed as MatchFormer. Inside each stage of the hierarchical encoder, we interleave self-attention for feature extraction and cross-attention for feature matching, yielding a human-intuitive extract-and-match scheme. Such a match-aware encoder releases the overloaded decoder and makes the model highly efficient. Further, combining self- and cross-attention on multi-scale features in a hierarchical architecture improves matching robustness, particularly in low-texture indoor scenes or with less outdoor training data. Thanks to such a strategy, MatchFormer is a multi-win solution in efficiency, robustness, and precision. Compared to the previous best method in indoor pose estimation, our lite MatchFormer has only 45% GFLOPs, yet achieves a +1.3% precision gain and a 41% running speed boost. The large MatchFormer reaches state-of-the-art on four different benchmarks, including indoor pose estimation (ScanNet), outdoor pose estimation (MegaDepth), homography estimation and image matching (HPatch), and visual localization (InLoc).
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
MatchFormer:用于特征匹配的互感器中的交叉注意
局部特征匹配是一项亚像素级的计算密集型任务。基于检测器的方法结合特征描述符在低纹理场景中表现不佳,而基于cnn的方法采用顺序提取匹配管道,无法利用编码器的匹配能力,而且往往会使解码器的匹配负担过重。相反,我们提出了一种新的分层提取和匹配转换器,称为MatchFormer。在分层编码器的每个阶段,我们将特征提取的自关注和特征匹配的交叉关注交织在一起,产生了一种人类直观的提取和匹配方案。这样的匹配感知编码器释放了过载的解码器,使模型非常高效。此外,在层次结构中结合多尺度特征的自关注和交叉关注可以提高匹配的鲁棒性,特别是在低纹理的室内场景或室外训练数据较少的情况下。由于这样的策略,MatchFormer是一个多赢的解决方案,在效率,鲁棒性和精度。与之前室内姿势估计的最佳方法相比,我们的lite MatchFormer只有45%的GFLOPs,但实现了+1.3%的精度增益和41%的运行速度提升。大型MatchFormer在四个不同的基准上达到了最先进的水平,包括室内姿态估计(ScanNet),室外姿态估计(MegaDepth),单应性估计和图像匹配(HPatch)以及视觉定位(InLoc)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
MaxGNR: A Dynamic Weight Strategy via Maximizing Gradient-to-Noise Ratio for Multi-Task Learning NoiseTransfer: Image Noise Generation with Contrastive Embeddings Layout-guided Indoor Panorama Inpainting with Plane-aware Normalization Layered-Garment Net: Generating Multiple Implicit Garment Layers from a Single Image RDRN: Recursively Defined Residual Network for Image Super-Resolution
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1