MatchFormer: Interleaving Attention in Transformers for Feature Matching

Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision Pub Date : 2022-03-17 DOI:10.48550/arXiv.2203.09645

Qing Wang, Jiaming Zhang, Kailun Yang, Kunyu Peng, R. Stiefelhagen

{"title":"MatchFormer: Interleaving Attention in Transformers for Feature Matching","authors":"Qing Wang, Jiaming Zhang, Kailun Yang, Kunyu Peng, R. Stiefelhagen","doi":"10.48550/arXiv.2203.09645","DOIUrl":null,"url":null,"abstract":"Local feature matching is a computationally intensive task at the subpixel level. While detector-based methods coupled with feature descriptors struggle in low-texture scenes, CNN-based methods with a sequential extract-to-match pipeline, fail to make use of the matching capacity of the encoder and tend to overburden the decoder for matching. In contrast, we propose a novel hierarchical extract-and-match transformer, termed as MatchFormer. Inside each stage of the hierarchical encoder, we interleave self-attention for feature extraction and cross-attention for feature matching, yielding a human-intuitive extract-and-match scheme. Such a match-aware encoder releases the overloaded decoder and makes the model highly efficient. Further, combining self- and cross-attention on multi-scale features in a hierarchical architecture improves matching robustness, particularly in low-texture indoor scenes or with less outdoor training data. Thanks to such a strategy, MatchFormer is a multi-win solution in efficiency, robustness, and precision. Compared to the previous best method in indoor pose estimation, our lite MatchFormer has only 45% GFLOPs, yet achieves a +1.3% precision gain and a 41% running speed boost. The large MatchFormer reaches state-of-the-art on four different benchmarks, including indoor pose estimation (ScanNet), outdoor pose estimation (MegaDepth), homography estimation and image matching (HPatch), and visual localization (InLoc).","PeriodicalId":87238,"journal":{"name":"Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2022-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"51","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2203.09645","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 51

Abstract

Local feature matching is a computationally intensive task at the subpixel level. While detector-based methods coupled with feature descriptors struggle in low-texture scenes, CNN-based methods with a sequential extract-to-match pipeline, fail to make use of the matching capacity of the encoder and tend to overburden the decoder for matching. In contrast, we propose a novel hierarchical extract-and-match transformer, termed as MatchFormer. Inside each stage of the hierarchical encoder, we interleave self-attention for feature extraction and cross-attention for feature matching, yielding a human-intuitive extract-and-match scheme. Such a match-aware encoder releases the overloaded decoder and makes the model highly efficient. Further, combining self- and cross-attention on multi-scale features in a hierarchical architecture improves matching robustness, particularly in low-texture indoor scenes or with less outdoor training data. Thanks to such a strategy, MatchFormer is a multi-win solution in efficiency, robustness, and precision. Compared to the previous best method in indoor pose estimation, our lite MatchFormer has only 45% GFLOPs, yet achieves a +1.3% precision gain and a 41% running speed boost. The large MatchFormer reaches state-of-the-art on four different benchmarks, including indoor pose estimation (ScanNet), outdoor pose estimation (MegaDepth), homography estimation and image matching (HPatch), and visual localization (InLoc).

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

MatchFormer:用于特征匹配的互感器中的交叉注意

局部特征匹配是一项亚像素级的计算密集型任务。基于检测器的方法结合特征描述符在低纹理场景中表现不佳，而基于cnn的方法采用顺序提取匹配管道，无法利用编码器的匹配能力，而且往往会使解码器的匹配负担过重。相反，我们提出了一种新的分层提取和匹配转换器，称为MatchFormer。在分层编码器的每个阶段，我们将特征提取的自关注和特征匹配的交叉关注交织在一起，产生了一种人类直观的提取和匹配方案。这样的匹配感知编码器释放了过载的解码器，使模型非常高效。此外，在层次结构中结合多尺度特征的自关注和交叉关注可以提高匹配的鲁棒性，特别是在低纹理的室内场景或室外训练数据较少的情况下。由于这样的策略，MatchFormer是一个多赢的解决方案，在效率，鲁棒性和精度。与之前室内姿势估计的最佳方法相比，我们的lite MatchFormer只有45%的GFLOPs，但实现了+1.3%的精度增益和41%的运行速度提升。大型MatchFormer在四个不同的基准上达到了最先进的水平，包括室内姿态估计(ScanNet)，室外姿态估计(MegaDepth)，单应性估计和图像匹配(HPatch)以及视觉定位(InLoc)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision

自引率

0.00%

发文量