Jianwen Song , Arcot Sowmya , Jien Kato , Changming Sun
{"title":"用于立体图像超分辨率的高效屏蔽特征和群体关注网络","authors":"Jianwen Song , Arcot Sowmya , Jien Kato , Changming Sun","doi":"10.1016/j.imavis.2024.105252","DOIUrl":null,"url":null,"abstract":"<div><p>Current stereo image super-resolution methods do not fully exploit cross-view and intra-view information, resulting in limited performance. While vision transformers have shown great potential in super-resolution, their application in stereo image super-resolution is hindered by high computational demands and insufficient channel interaction. This paper introduces an efficient masked feature and group attention network for stereo image super-resolution (EMGSSR) designed to integrate the strengths of transformers into stereo super-resolution while addressing their inherent limitations. Specifically, an efficient masked feature block is proposed to extract local features from critical areas within images, guided by sparse masks. A group-weighted cross-attention module consisting of group-weighted cross-view feature interactions along epipolar lines is proposed to fully extract cross-view information from stereo images. Additionally, a group-weighted self-attention module consisting of group-weighted self-attention feature extractions with different local windows is proposed to effectively extract intra-view information from stereo images. Experimental results demonstrate that the proposed EMGSSR outperforms state-of-the-art methods at relatively low computational costs. The proposed EMGSSR offers a robust solution that effectively extracts cross-view and intra-view information for stereo image super-resolution, bringing a promising direction for future research in high-fidelity stereo image super-resolution. Source codes will be released at <span><span>https://github.com/jianwensong/EMGSSR</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"151 ","pages":"Article 105252"},"PeriodicalIF":4.2000,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0262885624003573/pdfft?md5=f16b8e31aca64b2993c5abd2e28251d5&pid=1-s2.0-S0262885624003573-main.pdf","citationCount":"0","resultStr":"{\"title\":\"Efficient masked feature and group attention network for stereo image super-resolution\",\"authors\":\"Jianwen Song , Arcot Sowmya , Jien Kato , Changming Sun\",\"doi\":\"10.1016/j.imavis.2024.105252\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Current stereo image super-resolution methods do not fully exploit cross-view and intra-view information, resulting in limited performance. While vision transformers have shown great potential in super-resolution, their application in stereo image super-resolution is hindered by high computational demands and insufficient channel interaction. This paper introduces an efficient masked feature and group attention network for stereo image super-resolution (EMGSSR) designed to integrate the strengths of transformers into stereo super-resolution while addressing their inherent limitations. Specifically, an efficient masked feature block is proposed to extract local features from critical areas within images, guided by sparse masks. A group-weighted cross-attention module consisting of group-weighted cross-view feature interactions along epipolar lines is proposed to fully extract cross-view information from stereo images. Additionally, a group-weighted self-attention module consisting of group-weighted self-attention feature extractions with different local windows is proposed to effectively extract intra-view information from stereo images. Experimental results demonstrate that the proposed EMGSSR outperforms state-of-the-art methods at relatively low computational costs. The proposed EMGSSR offers a robust solution that effectively extracts cross-view and intra-view information for stereo image super-resolution, bringing a promising direction for future research in high-fidelity stereo image super-resolution. Source codes will be released at <span><span>https://github.com/jianwensong/EMGSSR</span><svg><path></path></svg></span>.</p></div>\",\"PeriodicalId\":50374,\"journal\":{\"name\":\"Image and Vision Computing\",\"volume\":\"151 \",\"pages\":\"Article 105252\"},\"PeriodicalIF\":4.2000,\"publicationDate\":\"2024-09-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S0262885624003573/pdfft?md5=f16b8e31aca64b2993c5abd2e28251d5&pid=1-s2.0-S0262885624003573-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Image and Vision Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0262885624003573\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885624003573","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Efficient masked feature and group attention network for stereo image super-resolution
Current stereo image super-resolution methods do not fully exploit cross-view and intra-view information, resulting in limited performance. While vision transformers have shown great potential in super-resolution, their application in stereo image super-resolution is hindered by high computational demands and insufficient channel interaction. This paper introduces an efficient masked feature and group attention network for stereo image super-resolution (EMGSSR) designed to integrate the strengths of transformers into stereo super-resolution while addressing their inherent limitations. Specifically, an efficient masked feature block is proposed to extract local features from critical areas within images, guided by sparse masks. A group-weighted cross-attention module consisting of group-weighted cross-view feature interactions along epipolar lines is proposed to fully extract cross-view information from stereo images. Additionally, a group-weighted self-attention module consisting of group-weighted self-attention feature extractions with different local windows is proposed to effectively extract intra-view information from stereo images. Experimental results demonstrate that the proposed EMGSSR outperforms state-of-the-art methods at relatively low computational costs. The proposed EMGSSR offers a robust solution that effectively extracts cross-view and intra-view information for stereo image super-resolution, bringing a promising direction for future research in high-fidelity stereo image super-resolution. Source codes will be released at https://github.com/jianwensong/EMGSSR.
期刊介绍:
Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.