VRTNet: Vector Rectifier Transformer for Two-View Correspondence Learning

IF 9.7 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS IEEE Transactions on Multimedia Pub Date : 2024-12-23 DOI:10.1109/TMM.2024.3521696

Meng Yang;Jun Chen;Xin Tian;Longsheng Wei;Jiayi Ma

{"title":"VRTNet: Vector Rectifier Transformer for Two-View Correspondence Learning","authors":"Meng Yang;Jun Chen;Xin Tian;Longsheng Wei;Jiayi Ma","doi":"10.1109/TMM.2024.3521696","DOIUrl":null,"url":null,"abstract":"Finding reliable correspondences in two-view image and recovering the camera poses are key problems in photogrammetry and image signal processing. Multilayer perceptron (MLP) has a wide application in two-view correspondence learning for which is good at learning disordered sparse correspondences, but it is susceptible to the dominant outliers and requires additional functional blocks to capture context information. CNN can naturally extract local context information, but it cannot handle disordered data and extract global context and channel information. In order to overcome the shortcomings of MLP and CNN, we design a correspondence learning network based on Transformer, named Vector Rectifier Transformer (VRTNet). Transformer is an encoder-decoder structure which can handle disordered sparse correspondences and output sequences of arbitrary length. Therefore, we design two sub-Transformers in VRTNet to achieve the mutual conversion between disordered and ordered correspondences. The self-attention and cross-attention mechanisms in them allow VRTNet to focus on the global context relations of all correspondences. To capture local context and channel information, we propose rectifier network (including CNN and channel attention block) as the backbone of VRTNet, which avoids the complex design of additional blocks. Rectifier network can correct the errors of ordered correspondences to obtain rectified correspondences. Finally, outliers are removed by comparing original and rectified correspondences. VRTNet performs better than the state-of-the-art methods in the tasks of relative pose estimation, outlier removal and image registration.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"515-530"},"PeriodicalIF":9.7000,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10812827/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Finding reliable correspondences in two-view image and recovering the camera poses are key problems in photogrammetry and image signal processing. Multilayer perceptron (MLP) has a wide application in two-view correspondence learning for which is good at learning disordered sparse correspondences, but it is susceptible to the dominant outliers and requires additional functional blocks to capture context information. CNN can naturally extract local context information, but it cannot handle disordered data and extract global context and channel information. In order to overcome the shortcomings of MLP and CNN, we design a correspondence learning network based on Transformer, named Vector Rectifier Transformer (VRTNet). Transformer is an encoder-decoder structure which can handle disordered sparse correspondences and output sequences of arbitrary length. Therefore, we design two sub-Transformers in VRTNet to achieve the mutual conversion between disordered and ordered correspondences. The self-attention and cross-attention mechanisms in them allow VRTNet to focus on the global context relations of all correspondences. To capture local context and channel information, we propose rectifier network (including CNN and channel attention block) as the backbone of VRTNet, which avoids the complex design of additional blocks. Rectifier network can correct the errors of ordered correspondences to obtain rectified correspondences. Finally, outliers are removed by comparing original and rectified correspondences. VRTNet performs better than the state-of-the-art methods in the tasks of relative pose estimation, outlier removal and image registration.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

用于二视图对应学习的矢量整流变压器

在双视图图像中寻找可靠的对应关系并恢复相机姿态是摄影测量和图像信号处理中的关键问题。多层感知器（MLP）在双视图对应学习中有着广泛的应用，它擅长无序稀疏对应的学习，但容易受到优势异常值的影响，并且需要额外的功能块来捕获上下文信息。CNN可以自然地提取局部上下文信息，但不能处理无序数据，提取全局上下文和频道信息。为了克服MLP和CNN的不足，我们设计了一个基于Transformer的对应学习网络，命名为矢量整流变压器（VRTNet）。变压器是一种可以处理无序稀疏对应和任意长度输出序列的编码器-解码器结构。因此，我们在VRTNet中设计了两个子变压器来实现无序对应和有序对应的相互转换。其中的自注意和交叉注意机制允许VRTNet关注所有通信的全局上下文关系。为了捕获本地上下文和频道信息，我们提出整流网络（包括CNN和频道注意块）作为VRTNet的主干，避免了额外块的复杂设计。整流网络可以对有序通信的误差进行校正，得到整流通信。最后，通过比较原始和校正对应来去除异常值。VRTNet在相对姿态估计、离群值去除和图像配准等方面的性能优于当前最先进的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Transactions on Multimedia 工程技术-电信学

CiteScore

11.70

自引率

11.00%

发文量

576

审稿时长

5.5 months

期刊介绍： The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.

期刊最新文献

Screen Detection from Egocentric Image Streams Leveraging Multi-View Vision Language Model. HMS2Net: Heterogeneous Multimodal State Space Network via CLIP for Dynamic Scene Classification in Livestreaming 2025 Reviewers List Long-Tailed Continual Learning for Visual Food Recognition SSPD: Spatial-Spectral Prior Decoupling Model for Spectral Snapshot Compressive Imaging