Beyond Subspace Isolation: Many-to-Many Transformer for Light Field Image Super-Resolution

IF 9.7 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS IEEE Transactions on Multimedia Pub Date : 2024-12-23 DOI:10.1109/TMM.2024.3521795

Zeke Zexi Hu;Xiaoming Chen;Vera Yuk Ying Chung;Yiran Shen

{"title":"Beyond Subspace Isolation: Many-to-Many Transformer for Light Field Image Super-Resolution","authors":"Zeke Zexi Hu;Xiaoming Chen;Vera Yuk Ying Chung;Yiran Shen","doi":"10.1109/TMM.2024.3521795","DOIUrl":null,"url":null,"abstract":"The effective extraction of spatial-angular features plays a crucial role in light field image super-resolution (LFSR) tasks, and the introduction of convolution and Transformers leads to significant improvement in this area. Nevertheless, due to the large 4D data volume of light field images, many existing methods opted to decompose the data into a number of lower-dimensional subspaces and perform Transformers in each sub-space individually. As a side effect, these methods inadvertently restrict the self-attention mechanisms to a One-to-One scheme accessing only a limited subset of LF data, explicitly preventing comprehensive optimization on all spatial and angular cues. In this paper, we identify this limitation as subspace isolation and introduce a novel Many-to-Many Transformer (M2MT) to address it. M2MT aggregates angular information in the spatial subspace before performing the self-attention mechanism. It enables complete access to all information across all sub-aperture images (SAIs) in a light field image. Consequently, M2MT is enabled to comprehensively capture long-range correlation dependencies. With M2MT as the foundational component, we develop a simple yet effective M2MT network for LFSR. Our experimental results demonstrate that M2MT achieves state-of-the-art performance across various public datasets, and it offers a favorable balance between model performance and efficiency, yielding higher-quality LFSR results with substantially lower demand for memory and computation. We further conduct in-depth analysis using local attribution maps (LAM) to obtain visual interpretability, and the results validate that M2MT is empowered with a truly non-local context in both spatial and angular subspaces to mitigate subspace isolation and acquire effective spatial-angular representation.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1334-1348"},"PeriodicalIF":9.7000,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10812790/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

The effective extraction of spatial-angular features plays a crucial role in light field image super-resolution (LFSR) tasks, and the introduction of convolution and Transformers leads to significant improvement in this area. Nevertheless, due to the large 4D data volume of light field images, many existing methods opted to decompose the data into a number of lower-dimensional subspaces and perform Transformers in each sub-space individually. As a side effect, these methods inadvertently restrict the self-attention mechanisms to a One-to-One scheme accessing only a limited subset of LF data, explicitly preventing comprehensive optimization on all spatial and angular cues. In this paper, we identify this limitation as subspace isolation and introduce a novel Many-to-Many Transformer (M2MT) to address it. M2MT aggregates angular information in the spatial subspace before performing the self-attention mechanism. It enables complete access to all information across all sub-aperture images (SAIs) in a light field image. Consequently, M2MT is enabled to comprehensively capture long-range correlation dependencies. With M2MT as the foundational component, we develop a simple yet effective M2MT network for LFSR. Our experimental results demonstrate that M2MT achieves state-of-the-art performance across various public datasets, and it offers a favorable balance between model performance and efficiency, yielding higher-quality LFSR results with substantially lower demand for memory and computation. We further conduct in-depth analysis using local attribution maps (LAM) to obtain visual interpretability, and the results validate that M2MT is empowered with a truly non-local context in both spatial and angular subspaces to mitigate subspace isolation and acquire effective spatial-angular representation.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

超越子空间隔离：光场图像超分辨率的多对多变压器

空间角特征的有效提取在光场图像超分辨率（LFSR）任务中起着至关重要的作用，卷积和变形的引入使这一领域得到了显著改善。然而，由于光场图像的四维数据量较大，现有的许多方法选择将数据分解为多个低维子空间，并在每个子空间中单独执行transformer。作为副作用，这些方法无意中将自关注机制限制为一对一方案，仅访问LF数据的有限子集，明确地阻止了对所有空间和角度线索的全面优化。在本文中，我们将这种限制识别为子空间隔离，并引入了一种新的多对多变压器（M2MT）来解决它。M2MT在执行自注意机制之前，先在空间子空间中聚合角度信息。它可以完全访问光场图像中所有子孔径图像（SAIs）的所有信息。因此，M2MT能够全面捕获远程相关依赖。以M2MT作为基础组件，我们为LFSR开发了一个简单而有效的M2MT网络。我们的实验结果表明，M2MT在各种公共数据集上实现了最先进的性能，并且在模型性能和效率之间提供了良好的平衡，产生了更高质量的LFSR结果，大大降低了对内存和计算的需求。我们进一步使用局部属性图（LAM）进行了深入分析，以获得视觉可解释性，结果验证了M2MT在空间和角度子空间中都具有真正的非局部上下文，以减轻子空间隔离并获得有效的空间-角度表示。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Transactions on Multimedia 工程技术-电信学

CiteScore

11.70

自引率

11.00%

发文量

576

审稿时长

5.5 months

期刊介绍： The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.

期刊最新文献

Screen Detection from Egocentric Image Streams Leveraging Multi-View Vision Language Model. HMS2Net: Heterogeneous Multimodal State Space Network via CLIP for Dynamic Scene Classification in Livestreaming 2025 Reviewers List Long-Tailed Continual Learning for Visual Food Recognition SSPD: Spatial-Spectral Prior Decoupling Model for Spectral Snapshot Compressive Imaging