{"title":"CLG-Net: Rethinking Local and Global Perception in Lightweight Two-View Correspondence Learning","authors":"Minjun Shen;Guobao Xiao;Changcai Yang;Junwen Guo;Lei Zhu","doi":"10.1109/TCSVT.2024.3457816","DOIUrl":null,"url":null,"abstract":"Correspondence learning aims to identify correct correspondences from the initial correspondence set and estimate camera pose between a pair of images. At present, Transformer-based methods have make notable progress in the correspondence learning task due to their powerful non-local information modeling capabilities. However, these methods seem to neglect local structures during feature aggregation from all query-key pairs, resulting in computational inefficiency and inaccurate correspondence identification. To address this issue, we propose a novel Context-aware Local and Global interaction Transformer (CLGFormer), a lightweight Transformer-based module with dual-branches that address local and global context perception in attention mechanisms. CLGFormer explores the relationship between neighborhood consistency observed in correspondences and context-aware weights appearing in vanilla attention and introduces an attention-style convolution operator. On top of that, CLGFormer also incorporates a cascaded operation that splits full features into multiple subsets and then feeds to the attention heads, which not only reduces computational costs but also enhances attention diversity. At last, we also introduce a feature recombination operate with high jointness and a lightweight channel attention module. The culmination of our efforts is the Context-aware Local and Global interaction Network (CLG-Net), which accurately estimates camera pose and identifies inliers. Through rigorous experiments, we demonstrate that our CLG-Net network outperforms existing state-of-the-art methods while exhibiting robust generalization capabilities across various scenarios. Code will be available at <uri>https://github.com/guobaoxiao/CLG</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 1","pages":"207-218"},"PeriodicalIF":11.1000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10678746/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Correspondence learning aims to identify correct correspondences from the initial correspondence set and estimate camera pose between a pair of images. At present, Transformer-based methods have make notable progress in the correspondence learning task due to their powerful non-local information modeling capabilities. However, these methods seem to neglect local structures during feature aggregation from all query-key pairs, resulting in computational inefficiency and inaccurate correspondence identification. To address this issue, we propose a novel Context-aware Local and Global interaction Transformer (CLGFormer), a lightweight Transformer-based module with dual-branches that address local and global context perception in attention mechanisms. CLGFormer explores the relationship between neighborhood consistency observed in correspondences and context-aware weights appearing in vanilla attention and introduces an attention-style convolution operator. On top of that, CLGFormer also incorporates a cascaded operation that splits full features into multiple subsets and then feeds to the attention heads, which not only reduces computational costs but also enhances attention diversity. At last, we also introduce a feature recombination operate with high jointness and a lightweight channel attention module. The culmination of our efforts is the Context-aware Local and Global interaction Network (CLG-Net), which accurately estimates camera pose and identifies inliers. Through rigorous experiments, we demonstrate that our CLG-Net network outperforms existing state-of-the-art methods while exhibiting robust generalization capabilities across various scenarios. Code will be available at https://github.com/guobaoxiao/CLG.
期刊介绍:
The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.