GlobalSR：通过可变形卷积注意和快速傅立叶卷积实现单幅图像超分辨率的全局上下文网络

IF 6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Neural Networks Pub Date : 2024-08-31 DOI:10.1016/j.neunet.2024.106686

{"title":"GlobalSR：通过可变形卷积注意和快速傅立叶卷积实现单幅图像超分辨率的全局上下文网络","authors":"","doi":"10.1016/j.neunet.2024.106686","DOIUrl":null,"url":null,"abstract":"<div><p>Vision Transformer have achieved impressive performance in image super-resolution. However, they suffer from low inference speed mainly because of the quadratic complexity of multi-head self-attention (MHSA), which is the key to learning long-range dependencies. On the contrary, most CNN-based methods neglect the important effect of global contextual information, resulting in inaccurate and blurring details. If one can make the best of both Transformers and CNNs, it will achieve a better trade-off between image quality and inference speed. Based on this observation, firstly assume that the main factor affecting the performance in the Transformer-based SR models is the general architecture design, not the specific MHSA component. To verify this, some ablation studies are made by replacing MHSA with large kernel convolutions, alongside other essential module replacements. Surprisingly, the derived models achieve competitive performance. Therefore, a general architecture design GlobalSR is extracted by not specifying the core modules including blocks and domain embeddings of Transformer-based SR models. It also contains three practical guidelines for designing a lightweight SR network utilizing image-level global contextual information to reconstruct SR images. Following the guidelines, the blocks and domain embeddings of GlobalSR are instantiated via Deformable Convolution Attention Block (DCAB) and Fast Fourier Convolution Domain Embedding (FCDE), respectively. The instantiation of general architecture, termed GlobalSR-DF, proposes a DCA to extract the global contextual feature by utilizing Deformable Convolution and a Hadamard product as the attention map at the block level. Meanwhile, the FCDE utilizes the Fast Fourier to transform the input spatial feature into frequency space and then extract image-level global information from it by convolutions. Extensive experiments demonstrate that GlobalSR is the key part in achieving a superior trade-off between SR quality and efficiency. Specifically, our proposed GlobalSR-DF outperforms state-of-the-art CNN-based and ViT-based SISR models regarding accuracy-speed trade-offs with sharp and natural details.</p></div>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":null,"pages":null},"PeriodicalIF":6.0000,"publicationDate":"2024-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0893608024006105/pdfft?md5=827cf63b9cd5ed18c3a60b975de54883&pid=1-s2.0-S0893608024006105-main.pdf","citationCount":"0","resultStr":"{\"title\":\"GlobalSR: Global context network for single image super-resolution via deformable convolution attention and fast Fourier convolution\",\"authors\":\"\",\"doi\":\"10.1016/j.neunet.2024.106686\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Vision Transformer have achieved impressive performance in image super-resolution. However, they suffer from low inference speed mainly because of the quadratic complexity of multi-head self-attention (MHSA), which is the key to learning long-range dependencies. On the contrary, most CNN-based methods neglect the important effect of global contextual information, resulting in inaccurate and blurring details. If one can make the best of both Transformers and CNNs, it will achieve a better trade-off between image quality and inference speed. Based on this observation, firstly assume that the main factor affecting the performance in the Transformer-based SR models is the general architecture design, not the specific MHSA component. To verify this, some ablation studies are made by replacing MHSA with large kernel convolutions, alongside other essential module replacements. Surprisingly, the derived models achieve competitive performance. Therefore, a general architecture design GlobalSR is extracted by not specifying the core modules including blocks and domain embeddings of Transformer-based SR models. It also contains three practical guidelines for designing a lightweight SR network utilizing image-level global contextual information to reconstruct SR images. Following the guidelines, the blocks and domain embeddings of GlobalSR are instantiated via Deformable Convolution Attention Block (DCAB) and Fast Fourier Convolution Domain Embedding (FCDE), respectively. The instantiation of general architecture, termed GlobalSR-DF, proposes a DCA to extract the global contextual feature by utilizing Deformable Convolution and a Hadamard product as the attention map at the block level. Meanwhile, the FCDE utilizes the Fast Fourier to transform the input spatial feature into frequency space and then extract image-level global information from it by convolutions. Extensive experiments demonstrate that GlobalSR is the key part in achieving a superior trade-off between SR quality and efficiency. Specifically, our proposed GlobalSR-DF outperforms state-of-the-art CNN-based and ViT-based SISR models regarding accuracy-speed trade-offs with sharp and natural details.</p></div>\",\"PeriodicalId\":49763,\"journal\":{\"name\":\"Neural Networks\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":6.0000,\"publicationDate\":\"2024-08-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S0893608024006105/pdfft?md5=827cf63b9cd5ed18c3a60b975de54883&pid=1-s2.0-S0893608024006105-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neural Networks\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0893608024006105\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neural Networks","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0893608024006105","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

视觉变换器在图像超分辨率方面取得了令人瞩目的成就。然而，它们的推理速度较低，主要原因是多头自注意（MHSA）的二次方复杂性，而多头自注意是学习长距离依赖关系的关键。相反，大多数基于 CNN 的方法忽视了全局上下文信息的重要作用，导致细节不准确和模糊。如果能同时利用变换器和 CNN，就能在图像质量和推理速度之间实现更好的权衡。基于这一观点，我们首先假设影响基于变换器的 SR 模型性能的主要因素是总体架构设计，而不是特定的 MHSA 组件。为了验证这一点，我们进行了一些消融研究，将 MHSA 替换为大内核卷积，同时替换了其他重要模块。出乎意料的是，衍生模型实现了具有竞争力的性能。因此，通过不指定核心模块（包括基于变换器的 SR 模型的模块和域嵌入），提取出了通用架构设计 GlobalSR。它还包含三个实用指南，用于设计利用图像级全局上下文信息重建 SR 图像的轻量级 SR 网络。根据指南，GlobalSR 的块和域嵌入分别通过可变形卷积注意块（DCAB）和快速傅立叶卷积域嵌入（FCDE）实例化。一般架构的实例化被称为 GlobalSR-DF，它通过利用可变形卷积和哈达玛乘积作为块级的注意图，提出了一种 DCA 来提取全局上下文特征。同时，FCDE 利用快速傅里叶将输入的空间特征转换到频率空间，然后通过卷积从中提取图像级的全局信息。广泛的实验证明，GlobalSR 是在 SR 质量和效率之间实现出色权衡的关键部分。具体来说，我们提出的 GlobalSR-DF 在精度-速度权衡方面优于最先进的基于 CNN 和基于 ViT 的 SISR 模型，细节清晰自然。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

GlobalSR: Global context network for single image super-resolution via deformable convolution attention and fast Fourier convolution

Vision Transformer have achieved impressive performance in image super-resolution. However, they suffer from low inference speed mainly because of the quadratic complexity of multi-head self-attention (MHSA), which is the key to learning long-range dependencies. On the contrary, most CNN-based methods neglect the important effect of global contextual information, resulting in inaccurate and blurring details. If one can make the best of both Transformers and CNNs, it will achieve a better trade-off between image quality and inference speed. Based on this observation, firstly assume that the main factor affecting the performance in the Transformer-based SR models is the general architecture design, not the specific MHSA component. To verify this, some ablation studies are made by replacing MHSA with large kernel convolutions, alongside other essential module replacements. Surprisingly, the derived models achieve competitive performance. Therefore, a general architecture design GlobalSR is extracted by not specifying the core modules including blocks and domain embeddings of Transformer-based SR models. It also contains three practical guidelines for designing a lightweight SR network utilizing image-level global contextual information to reconstruct SR images. Following the guidelines, the blocks and domain embeddings of GlobalSR are instantiated via Deformable Convolution Attention Block (DCAB) and Fast Fourier Convolution Domain Embedding (FCDE), respectively. The instantiation of general architecture, termed GlobalSR-DF, proposes a DCA to extract the global contextual feature by utilizing Deformable Convolution and a Hadamard product as the attention map at the block level. Meanwhile, the FCDE utilizes the Fast Fourier to transform the input spatial feature into frequency space and then extract image-level global information from it by convolutions. Extensive experiments demonstrate that GlobalSR is the key part in achieving a superior trade-off between SR quality and efficiency. Specifically, our proposed GlobalSR-DF outperforms state-of-the-art CNN-based and ViT-based SISR models regarding accuracy-speed trade-offs with sharp and natural details.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Neural Networks 工程技术-计算机：人工智能

CiteScore

13.90

自引率

7.70%

发文量

425

审稿时长

67 days

期刊介绍： Neural Networks is a platform that aims to foster an international community of scholars and practitioners interested in neural networks, deep learning, and other approaches to artificial intelligence and machine learning. Our journal invites submissions covering various aspects of neural networks research, from computational neuroscience and cognitive modeling to mathematical analyses and engineering applications. By providing a forum for interdisciplinary discussions between biology and technology, we aim to encourage the development of biologically-inspired artificial intelligence.