首页 > 最新文献

Image and Vision Computing最新文献

英文 中文
CPFSSR: Combined permuted self-attention and fast Fourier transform-based network for stereo image super-resolution CPFSSR:结合排列自注意和快速傅立叶变换的立体图像超分辨率网络
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-04-01 Epub Date: 2026-01-13 DOI: 10.1016/j.imavis.2025.105870
Wenwu Luo , Jing Wu , Feng Huang, Yunxiang Li
The pursuit of high-fidelity stereo image super-resolution (SR) is paramount for 3D vision applications. However, existing Transformer-based methods often suffer from high computational complexity and limited effectiveness in capturing long-range cross-view dependencies. To address these issues, we propose a combined permuted self-attention and fast Fourier transform-based network for stereo image SR (CPFSSR), a novel network that combines a permuted Swin Fourier Transformer block (PSFTB) with a deep cross-attention module (DCAM) to tackle these dual challenges. The PSFTB employs a permuted self-attention mechanism and fast Fourier convolution to achieve global receptive fields with linear computational complexity, and captures intra-view contextual details. For better fusion, a DCAM enables adaptive hierarchical interaction between views. In addition, we propose a spatial frequency reinforcement block (SFRB) to enhance the extraction of complex frequency information using fast Fourier convolution. Rigorous evaluation of benchmarks shows that CPFSSR sets a new state-of-the-art, outperforming existing methods by an average on the Flickr1024, Middlebury, KITTI2012, and KITTI2015 datasets. Visual assessments also confirm its superiority in reconstructing fine natural textures with minimal artifacts. The proposed method achieves a trade-off between parametric and stereo image SR task performance and is suitable for accurate high-resolution image reconstruction. The source code is available at https://github.com/Flt-Flag/CPFSSR.
追求高保真立体图像超分辨率(SR)对于3D视觉应用至关重要。然而,现有的基于transformer的方法在捕获远距离跨视图依赖关系方面通常存在计算复杂性高和有效性有限的问题。为了解决这些问题,我们提出了一种组合排列自注意和基于快速傅立叶变换的立体图像SR网络(CPFSSR),这是一种结合了排列Swin傅立叶变换块(PSFTB)和深度交叉注意模块(DCAM)的新型网络,以解决这些双重挑战。PSFTB采用排列自注意机制和快速傅立叶卷积来实现具有线性计算复杂性的全局接受场,并捕获视图内的上下文细节。为了更好地融合,DCAM支持视图之间的自适应分层交互。此外,我们提出了一种空间频率增强块(SFRB)来增强使用快速傅里叶卷积提取复频率信息的能力。严格的基准评估表明,CPFSSR设置了新的最先进的技术,在Flickr1024, Middlebury, KITTI2012和KITTI2015数据集上的平均性能优于现有方法。视觉评估也证实了它在用最少的人工制品重建精细自然纹理方面的优势。该方法实现了参数化和立体图像SR任务性能之间的平衡,适用于精确的高分辨率图像重建。源代码可从https://github.com/Flt-Flag/CPFSSR获得。
{"title":"CPFSSR: Combined permuted self-attention and fast Fourier transform-based network for stereo image super-resolution","authors":"Wenwu Luo ,&nbsp;Jing Wu ,&nbsp;Feng Huang,&nbsp;Yunxiang Li","doi":"10.1016/j.imavis.2025.105870","DOIUrl":"10.1016/j.imavis.2025.105870","url":null,"abstract":"<div><div>The pursuit of high-fidelity stereo image super-resolution (SR) is paramount for 3D vision applications. However, existing Transformer-based methods often suffer from high computational complexity and limited effectiveness in capturing long-range cross-view dependencies. To address these issues, we propose a combined permuted self-attention and fast Fourier transform-based network for stereo image SR (CPFSSR), a novel network that combines a permuted Swin Fourier Transformer block (PSFTB) with a deep cross-attention module (DCAM) to tackle these dual challenges. The PSFTB employs a permuted self-attention mechanism and fast Fourier convolution to achieve global receptive fields with linear computational complexity, and captures intra-view contextual details. For better fusion, a DCAM enables adaptive hierarchical interaction between views. In addition, we propose a spatial frequency reinforcement block (SFRB) to enhance the extraction of complex frequency information using fast Fourier convolution. Rigorous evaluation of benchmarks shows that CPFSSR sets a new state-of-the-art, outperforming existing methods by an average on the Flickr1024, Middlebury, KITTI2012, and KITTI2015 datasets. Visual assessments also confirm its superiority in reconstructing fine natural textures with minimal artifacts. The proposed method achieves a trade-off between parametric and stereo image SR task performance and is suitable for accurate high-resolution image reconstruction. The source code is available at <span><span>https://github.com/Flt-Flag/CPFSSR</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"168 ","pages":"Article 105870"},"PeriodicalIF":4.2,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Integrating spatial features and dynamically learned temporal features via contrastive learning for video temporal grounding in LLM 基于对比学习的LLM视频时间基础空间特征与动态学习时间特征集成
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-03-01 Epub Date: 2026-01-05 DOI: 10.1016/j.imavis.2026.105895
Peifu Wang , Yixiong Liang , Yigang Cen , Lihui Cen , Zhe Qu , Jingling Liu , Shichao Kan
Video temporal grounding (VTG) is crucial for fine-grained temporal understanding in vision-language tasks. While large vision-language models (LVLMs) have shown promising results through image–text alignment and video-instruction tuning, they represent videos as static sequences of sampled frames processed by image-based vision encoders, inherently limiting their capacity to capture dynamic and sequential information effectively, leading to suboptimal performance. To address this, we propose integrating spatial features with dynamically learned temporal features using contrastive learning. Temporal features are dynamically extracted by learning a set of temporal query tokens, which prompt temporal feature extraction via contrastive alignment between video sequences and their corresponding descriptions. On the other hand, VTG based on large language models are always supervised solely through the language modeling loss, which is insufficient for effectively guiding such tasks. Thus, the VTG model in our method is trained with a temporal localization loss that combines mean squared error (MSE), intersection-over-union (IoU) of the temporal range, and cosine similarity of temporal embeddings, which is designed to be applicable to large language models. Our experiments on benchmark datasets demonstrate the effectiveness of the proposed method.
视频时间基础(VTG)对于视觉语言任务中的细粒度时间理解至关重要。虽然大型视觉语言模型(LVLMs)通过图像-文本对齐和视频指令调优显示了有希望的结果,但它们将视频表示为由基于图像的视觉编码器处理的采样帧的静态序列,这固有地限制了它们有效捕获动态和顺序信息的能力,导致性能不佳。为了解决这个问题,我们建议使用对比学习将空间特征与动态学习的时间特征相结合。通过学习一组时间查询令牌来动态提取时间特征,这些令牌通过视频序列与其相应描述之间的对比比对来提示时间特征提取。另一方面,基于大型语言模型的VTG往往仅通过语言建模损失进行监督,不足以有效指导此类任务。因此,在我们的方法中,VTG模型是用结合均方误差(MSE)、时间范围的交集-过并(IoU)和时间嵌入的余弦相似度的时间定位损失来训练的,该方法被设计为适用于大型语言模型。我们在基准数据集上的实验证明了该方法的有效性。
{"title":"Integrating spatial features and dynamically learned temporal features via contrastive learning for video temporal grounding in LLM","authors":"Peifu Wang ,&nbsp;Yixiong Liang ,&nbsp;Yigang Cen ,&nbsp;Lihui Cen ,&nbsp;Zhe Qu ,&nbsp;Jingling Liu ,&nbsp;Shichao Kan","doi":"10.1016/j.imavis.2026.105895","DOIUrl":"10.1016/j.imavis.2026.105895","url":null,"abstract":"<div><div>Video temporal grounding (VTG) is crucial for fine-grained temporal understanding in vision-language tasks. While large vision-language models (LVLMs) have shown promising results through image–text alignment and video-instruction tuning, they represent videos as static sequences of sampled frames processed by image-based vision encoders, inherently limiting their capacity to capture dynamic and sequential information effectively, leading to suboptimal performance. To address this, we propose integrating spatial features with dynamically learned temporal features using contrastive learning. Temporal features are dynamically extracted by learning a set of temporal query tokens, which prompt temporal feature extraction via contrastive alignment between video sequences and their corresponding descriptions. On the other hand, VTG based on large language models are always supervised solely through the language modeling loss, which is insufficient for effectively guiding such tasks. Thus, the VTG model in our method is trained with a temporal localization loss that combines mean squared error (MSE), intersection-over-union (IoU) of the temporal range, and cosine similarity of temporal embeddings, which is designed to be applicable to large language models. Our experiments on benchmark datasets demonstrate the effectiveness of the proposed method.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"167 ","pages":"Article 105895"},"PeriodicalIF":4.2,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145927624","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DRM-YOLO: A YOLOv11-based structural optimization method for small object detection in UAV aerial imagery DRM-YOLO:一种基于yolov11的无人机航拍小目标检测结构优化方法
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-03-01 Epub Date: 2025-12-30 DOI: 10.1016/j.imavis.2025.105894
Hongbo Bi, Rui Dai, Fengyang Han, Cong Zhang
With the falling cost of UAVs and advances in automation, drones are increasingly applied in agriculture, inspection, and smart cities. However, small object detection remains difficult due to tiny targets, sparse features, and complex backgrounds. To tackle these challenges, this paper presents an improved small object detection framework for UAV imagery, optimized from the YOLOv11n architecture. First, the proposed MetaDWBlock integrates multi-branch depthwise separable convolutions with a lightweight MLP, and its hierarchical MetaDWStage enhances contextual and fine-grained feature modeling. Second, the Cross-scale Feature Fusion Module (CFFM) employs the CARAFE upsampling operator for precise fusion of shallow spatial and deep semantic features, improving multi-scale perception. Finally, a scale-, spatial-, and task-aware Dynamic Head with an added P2 branch forms a four-branch detection head, markedly boosting detection accuracy for tiny objects. Experimental results on the VisDrone2019 dataset demonstrate that the proposed DRM-YOLO model significantly outperforms the baseline YOLOv11n in small object detection tasks, achieving a 21.4% improvement in [email protected] and a 13.1% improvement in [email protected]. These results fully validate the effectiveness and practical value of the proposed method in enhancing the accuracy and robustness of small object detection in UAV aerial imagery. The code and results of our method are available at https://github.com/DRdairuiDR/DRM--YOLO.
随着无人机成本的下降和自动化程度的提高,无人机在农业、检查、智慧城市等领域的应用越来越多。然而,由于目标微小、特征稀疏、背景复杂等原因,小目标检测仍然是一个难题。为了解决这些挑战,本文提出了一种改进的无人机图像小目标检测框架,该框架在YOLOv11n架构的基础上进行了优化。首先,提出的MetaDWBlock将多分支深度可分离卷积与轻量级MLP集成在一起,其分层MetaDWStage增强了上下文和细粒度特征建模。其次,跨尺度特征融合模块(CFFM)采用CARAFE上采样算子对浅层空间和深层语义特征进行精确融合,提高多尺度感知能力。最后,一个具有尺度、空间和任务感知的动态头与一个额外的P2分支形成了一个四分支检测头,显著提高了对微小物体的检测精度。在VisDrone2019数据集上的实验结果表明,提出的DRM-YOLO模型在小目标检测任务中显著优于基线YOLOv11n,在[email protected]中提高了21.4%,在[email protected]中提高了13.1%。这些结果充分验证了该方法在提高无人机航测图像小目标检测精度和鲁棒性方面的有效性和实用价值。我们的方法的代码和结果可在https://github.com/DRdairuiDR/DRM--YOLO上获得。
{"title":"DRM-YOLO: A YOLOv11-based structural optimization method for small object detection in UAV aerial imagery","authors":"Hongbo Bi,&nbsp;Rui Dai,&nbsp;Fengyang Han,&nbsp;Cong Zhang","doi":"10.1016/j.imavis.2025.105894","DOIUrl":"10.1016/j.imavis.2025.105894","url":null,"abstract":"<div><div>With the falling cost of UAVs and advances in automation, drones are increasingly applied in agriculture, inspection, and smart cities. However, small object detection remains difficult due to tiny targets, sparse features, and complex backgrounds. To tackle these challenges, this paper presents an improved small object detection framework for UAV imagery, optimized from the YOLOv11n architecture. First, the proposed MetaDWBlock integrates multi-branch depthwise separable convolutions with a lightweight MLP, and its hierarchical MetaDWStage enhances contextual and fine-grained feature modeling. Second, the Cross-scale Feature Fusion Module (CFFM) employs the CARAFE upsampling operator for precise fusion of shallow spatial and deep semantic features, improving multi-scale perception. Finally, a scale-, spatial-, and task-aware Dynamic Head with an added P2 branch forms a four-branch detection head, markedly boosting detection accuracy for tiny objects. Experimental results on the VisDrone2019 dataset demonstrate that the proposed DRM-YOLO model significantly outperforms the baseline YOLOv11n in small object detection tasks, achieving a 21.4% improvement in [email protected] and a 13.1% improvement in [email protected]. These results fully validate the effectiveness and practical value of the proposed method in enhancing the accuracy and robustness of small object detection in UAV aerial imagery. The code and results of our method are available at <span><span>https://github.com/DRdairuiDR/DRM--YOLO</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"167 ","pages":"Article 105894"},"PeriodicalIF":4.2,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145885584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
OCC-MLLM-CoT: Self-correction enhanced occlusion recognition with large language models via 3D-aware supervision, chain-of-thoughts guidance occ - mlm - cot:通过3d感知监督、思维链引导,对大型语言模型进行自我纠错增强的遮挡识别
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-03-01 Epub Date: 2025-12-24 DOI: 10.1016/j.imavis.2025.105881
Chaoyi Wang , Fangzhou Meng , Jun Pei , Lijie Xia , Jianpo Liu , Xiaobing Yuan , Xinhan Di
Comprehending occluded objects remains an underexplored challenge for existing large-scale visual–language multi-modal models. Current state-of-the-art multi-modal large models struggle to provide satisfactory performance in comprehending occluded objects despite using universal visual encoders and supervised learning strategies. To address this limitation, we propose OCC-MLLM-CoT, a multi-modal large vision–language framework that integrates 3D-aware supervision with Chain-of-Thoughts reasoning. Our approach consists of three key components: (1) a comprehensive framework combining a large multi-modal vision–language model with a specialized 3D reconstruction expert model; (2) a multi-modal Chain-of-Thoughts mechanism trained through both supervised and reinforcement learning strategies, enabling the model to develop advanced reasoning and self-reflection capabilities; and (3) a novel large-scale dataset containing 110,000 samples of occluded objects held in hand, specifically designed for multi-modal chain-of-thoughts reasoning. Experimental evaluations demonstrate that our proposed method achieves an 11.14% improvement in decision score, increasing from 0.6412 to 0.7526 compared to state-of-the-art multi-modal large language models.
对于现有的大规模视觉语言多模态模型来说,理解被遮挡的物体仍然是一个未被充分探索的挑战。尽管使用了通用视觉编码器和监督学习策略,但目前最先进的多模态大型模型在理解遮挡物体方面仍难以提供令人满意的性能。为了解决这一限制,我们提出了OCC-MLLM-CoT,这是一个多模态大型视觉语言框架,将3d感知监督与思维链推理集成在一起。我们的方法由三个关键部分组成:(1)将大型多模态视觉语言模型与专门的3D重建专家模型相结合的综合框架;(2)通过监督学习和强化学习策略训练的多模态思维链机制,使模型能够发展高级推理和自我反思能力;(3)专门为多模态思维链推理设计的包含110,000个被遮挡物体样本的新型大规模数据集。实验评估表明,与最先进的多模态大型语言模型相比,我们提出的方法在决策得分方面提高了11.14%,从0.6412提高到0.7526。
{"title":"OCC-MLLM-CoT: Self-correction enhanced occlusion recognition with large language models via 3D-aware supervision, chain-of-thoughts guidance","authors":"Chaoyi Wang ,&nbsp;Fangzhou Meng ,&nbsp;Jun Pei ,&nbsp;Lijie Xia ,&nbsp;Jianpo Liu ,&nbsp;Xiaobing Yuan ,&nbsp;Xinhan Di","doi":"10.1016/j.imavis.2025.105881","DOIUrl":"10.1016/j.imavis.2025.105881","url":null,"abstract":"<div><div>Comprehending occluded objects remains an underexplored challenge for existing large-scale visual–language multi-modal models. Current state-of-the-art multi-modal large models struggle to provide satisfactory performance in comprehending occluded objects despite using universal visual encoders and supervised learning strategies. To address this limitation, we propose OCC-MLLM-CoT, a multi-modal large vision–language framework that integrates 3D-aware supervision with Chain-of-Thoughts reasoning. Our approach consists of three key components: (1) a comprehensive framework combining a large multi-modal vision–language model with a specialized 3D reconstruction expert model; (2) a multi-modal Chain-of-Thoughts mechanism trained through both supervised and reinforcement learning strategies, enabling the model to develop advanced reasoning and self-reflection capabilities; and (3) a novel large-scale dataset containing 110,000 samples of occluded objects held in hand, specifically designed for multi-modal chain-of-thoughts reasoning. Experimental evaluations demonstrate that our proposed method achieves an 11.14% improvement in decision score, increasing from 0.6412 to 0.7526 compared to state-of-the-art multi-modal large language models.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"167 ","pages":"Article 105881"},"PeriodicalIF":4.2,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145885583","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HDD-Unet: A Unet-based architecture for low-light image enhancement HDD-Unet:用于弱光图像增强的基于unet的架构
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-03-01 Epub Date: 2025-12-24 DOI: 10.1016/j.imavis.2025.105889
Elissavet Batziou , Konstantinos Ioannidis , Ioannis Patras , Stefanos Vrochidis , Ioannis Kompatsiaris
Low-light imaging has become a popular topic in image processing, with the quality enhancement of low light images being as a significant challenge, due to the difficulty in retaining colors, patterns, texture and style when generating a normal light image. Our objectives are mainly to firstly better preserve texture regions in image enhancement, while, secondly, preserving colors via color histogram blocks and, finally, to enhance the quality of image through dense denoising blocks. Our proposed novel framework, namely HDD-Unet, is a double Unet based on photorealistic style transfer for low-light image enhancement. The proposed low-light image enhancement method combines color histogram-based fusion, Haar wavelet pooling, dense-denoising blocks and U-net as a backbone architecture to enhance the contrast, reduce noise, and improve the visibility of low light images. Experimental results demonstrate that our proposed method outperforms existing methods in terms of PSNR and SSIM quantitative evaluation metrics, reaching or outperforming state-of-the-art accuracy, but with less resources. We also conduct an ablation study to investigate the impact of our approach on overexposed images, and systematic analysis on the late fusion weighting parameters. Multiple experiments were conducted with artificial noise inserted to accomplish more efficient comparison. The results show that the proposed framework enhances accurately images with various gamma corrections. The proposed method represents a significant advance in the field of low light image enhancement and has the potential to address several challenges associated with low light imaging.
弱光成像已经成为图像处理中的一个热门话题,由于在生成正常光图像时难以保留颜色、图案、纹理和风格,因此增强弱光图像的质量是一个重大挑战。我们的目标主要是首先在图像增强中更好地保留纹理区域,其次通过颜色直方图块来保留颜色,最后通过密集去噪块来增强图像质量。我们提出的新框架,即HDD-Unet,是一种基于真实感风格转移的双Unet,用于弱光图像增强。提出的弱光图像增强方法将基于颜色直方图的融合、Haar小波池、密集去噪块和U-net作为主干架构,增强弱光图像的对比度,降低噪声,提高图像的可见度。实验结果表明,我们提出的方法在PSNR和SSIM定量评估指标方面优于现有方法,达到或优于最先进的精度,但资源更少。我们还进行了消融研究,以研究我们的方法对过度曝光图像的影响,并对后期融合加权参数进行了系统分析。为了更有效地进行比较,多次实验都加入了人工噪声。结果表明,该框架在不同的伽玛校正下都能提高图像的精度。所提出的方法代表了低光图像增强领域的重大进步,并具有解决与低光成像相关的几个挑战的潜力。
{"title":"HDD-Unet: A Unet-based architecture for low-light image enhancement","authors":"Elissavet Batziou ,&nbsp;Konstantinos Ioannidis ,&nbsp;Ioannis Patras ,&nbsp;Stefanos Vrochidis ,&nbsp;Ioannis Kompatsiaris","doi":"10.1016/j.imavis.2025.105889","DOIUrl":"10.1016/j.imavis.2025.105889","url":null,"abstract":"<div><div>Low-light imaging has become a popular topic in image processing, with the quality enhancement of low light images being as a significant challenge, due to the difficulty in retaining colors, patterns, texture and style when generating a normal light image. Our objectives are mainly to firstly better preserve texture regions in image enhancement, while, secondly, preserving colors via color histogram blocks and, finally, to enhance the quality of image through dense denoising blocks. Our proposed novel framework, namely HDD-Unet, is a double Unet based on photorealistic style transfer for low-light image enhancement. The proposed low-light image enhancement method combines color histogram-based fusion, Haar wavelet pooling, dense-denoising blocks and U-net as a backbone architecture to enhance the contrast, reduce noise, and improve the visibility of low light images. Experimental results demonstrate that our proposed method outperforms existing methods in terms of PSNR and SSIM quantitative evaluation metrics, reaching or outperforming state-of-the-art accuracy, but with less resources. We also conduct an ablation study to investigate the impact of our approach on overexposed images, and systematic analysis on the late fusion weighting parameters. Multiple experiments were conducted with artificial noise inserted to accomplish more efficient comparison. The results show that the proposed framework enhances accurately images with various gamma corrections. The proposed method represents a significant advance in the field of low light image enhancement and has the potential to address several challenges associated with low light imaging.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"167 ","pages":"Article 105889"},"PeriodicalIF":4.2,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145842588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hierarchical texture-aware image inpainting via contextual attention and multi-scale fusion 基于上下文关注和多尺度融合的分层纹理感知图像绘画
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-03-01 Epub Date: 2025-12-16 DOI: 10.1016/j.imavis.2025.105875
Runing Li , Jiangyan Dai , Qibing Qin , Chengduan Wang , Yugen Yi , Jianzhong Wang
Image inpainting aims to restore missing regions in images with visually coherent and semantically plausible content. Although deep learning methods have achieved significant progress, current approaches still face challenges in handling large-area image inpainting tasks, often producing blurred textures or structurally inconsistent results. These limitations primarily stem from the insufficient exploitation of long-range dependencies and inadequate texture priors. To address these issues, we propose a novel two-stage image inpainting framework that integrates multi-directional texture priors with contextual information. In the first stage, we extract rich texture features from corrupted images using Gabor filters, which simulate human visual perception. These features are then fused to guide a texture inpainting network, where a Multi-Scale Dense Skip Connection (MSDSC) module is introduced to bridge semantic gaps across different feature levels. In the second stage, we design a hierarchical texture-aware guided image completion network that utilizes the repaired textures as auxiliary guidance. Specifically, a contextual attention module is incorporated to capture long-range spatial dependencies and enhance structural consistency. Extensive experiments conducted on three challenging benchmarks, such as CelebA-HQ, Places2, and Paris Street View, demonstrate that our method outperforms existing state-of-the-art approaches in both quantitative metrics and visual quality. The proposed framework significantly improves the realism and coherence of inpainting results, particularly for images with large missing regions or complex textures. The code is available at https://github.com/Runing-Lab/HTA2I.git.
图像修复的目的是恢复图像中缺失的区域,使其具有视觉连贯和语义可信的内容。尽管深度学习方法已经取得了重大进展,但目前的方法在处理大面积图像绘制任务时仍然面临挑战,通常会产生模糊的纹理或结构不一致的结果。这些限制主要源于对长期依赖关系的开发不足和不充分的纹理先验。为了解决这些问题,我们提出了一种新的两阶段图像绘制框架,该框架将多向纹理先验与上下文信息相结合。在第一阶段,我们使用Gabor滤波器从损坏的图像中提取丰富的纹理特征,模拟人类的视觉感知。然后将这些特征融合到纹理绘制网络中,其中引入了多尺度密集跳过连接(MSDSC)模块来弥合不同特征级别之间的语义差距。在第二阶段,我们设计了一个分层的纹理感知引导图像补全网络,利用修复后的纹理作为辅助引导。具体地说,我们采用了一个上下文注意模块来捕捉远程空间依赖性和增强结构一致性。在三个具有挑战性的基准测试(如CelebA-HQ、Places2和巴黎街景)上进行的大量实验表明,我们的方法在定量指标和视觉质量方面都优于现有的最先进的方法。提出的框架显著提高了绘制结果的真实感和一致性,特别是对于具有大缺失区域或复杂纹理的图像。代码可在https://github.com/Runing-Lab/HTA2I.git上获得。
{"title":"Hierarchical texture-aware image inpainting via contextual attention and multi-scale fusion","authors":"Runing Li ,&nbsp;Jiangyan Dai ,&nbsp;Qibing Qin ,&nbsp;Chengduan Wang ,&nbsp;Yugen Yi ,&nbsp;Jianzhong Wang","doi":"10.1016/j.imavis.2025.105875","DOIUrl":"10.1016/j.imavis.2025.105875","url":null,"abstract":"<div><div>Image inpainting aims to restore missing regions in images with visually coherent and semantically plausible content. Although deep learning methods have achieved significant progress, current approaches still face challenges in handling large-area image inpainting tasks, often producing blurred textures or structurally inconsistent results. These limitations primarily stem from the insufficient exploitation of long-range dependencies and inadequate texture priors. To address these issues, we propose a novel two-stage image inpainting framework that integrates multi-directional texture priors with contextual information. In the first stage, we extract rich texture features from corrupted images using Gabor filters, which simulate human visual perception. These features are then fused to guide a texture inpainting network, where a Multi-Scale Dense Skip Connection (MSDSC) module is introduced to bridge semantic gaps across different feature levels. In the second stage, we design a hierarchical texture-aware guided image completion network that utilizes the repaired textures as auxiliary guidance. Specifically, a contextual attention module is incorporated to capture long-range spatial dependencies and enhance structural consistency. Extensive experiments conducted on three challenging benchmarks, such as CelebA-HQ, Places2, and Paris Street View, demonstrate that our method outperforms existing state-of-the-art approaches in both quantitative metrics and visual quality. The proposed framework significantly improves the realism and coherence of inpainting results, particularly for images with large missing regions or complex textures. The code is available at <span><span>https://github.com/Runing-Lab/HTA2I.git</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"167 ","pages":"Article 105875"},"PeriodicalIF":4.2,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145842592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LoRA-empowered efficient diffusion for accurate fine-grained detail rendering in real-image cartoonization 基于lora的高效扩散,在实景图像卡通化中实现精确的细粒度细节渲染
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-03-01 Epub Date: 2026-01-06 DOI: 10.1016/j.imavis.2026.105898
Mingjin Liu , Yien Li
Recent advances in generative models have enabled diverse applications, from text-to-image synthesis to artistic content creation. However, generating high-quality, domain-specific content — particularly for culturally unique styles like Chinese opera — remains challenging due to limited generalization on long-tail data and the high cost of fine-tuning with specialized datasets. To address these limitations, we propose DreamOpera, a novel framework for transforming real-world Chinese opera character photographs into stylized cartoon representations. Our approach leverages a two-step process: (1) feature extraction using a pre-trained encoder to capture key visual attributes (e.g., clothing, facial features), and (2) domain transformation via a LoRA-fine-tuned diffusion model trained on a small, unpaired dataset of cartoon-style opera images. This strategy bypasses the need for costly paired data while preserving fine-grained details. Experiments demonstrate that DreamOpera outperforms existing methods in generating high-fidelity, culturally nuanced artwork, offering practical value for cultural dissemination and digital art.
生成模型的最新进展使各种应用成为可能,从文本到图像的合成到艺术内容的创作。然而,由于长尾数据的泛化有限,以及使用专门数据集进行微调的高成本,生成高质量的、特定领域的内容——特别是针对中国戏曲等文化独特风格的内容——仍然具有挑战性。为了解决这些限制,我们提出了DreamOpera,这是一个将现实世界的中国戏曲人物照片转换为程式化卡通表现的新框架。我们的方法利用了两个步骤的过程:(1)使用预训练的编码器进行特征提取,以捕获关键的视觉属性(例如,服装,面部特征);(2)通过lora微调扩散模型进行域转换,该模型训练在一个小型的,未配对的卡通风格歌剧图像数据集上。这种策略绕过了对昂贵的成对数据的需求,同时保留了细粒度的细节。实验表明,DreamOpera在生成高保真、文化细腻的艺术品方面优于现有方法,为文化传播和数字艺术提供了实用价值。
{"title":"LoRA-empowered efficient diffusion for accurate fine-grained detail rendering in real-image cartoonization","authors":"Mingjin Liu ,&nbsp;Yien Li","doi":"10.1016/j.imavis.2026.105898","DOIUrl":"10.1016/j.imavis.2026.105898","url":null,"abstract":"<div><div>Recent advances in generative models have enabled diverse applications, from text-to-image synthesis to artistic content creation. However, generating high-quality, domain-specific content — particularly for culturally unique styles like Chinese opera — remains challenging due to limited generalization on long-tail data and the high cost of fine-tuning with specialized datasets. To address these limitations, we propose DreamOpera, a novel framework for transforming real-world Chinese opera character photographs into stylized cartoon representations. Our approach leverages a two-step process: (1) feature extraction using a pre-trained encoder to capture key visual attributes (e.g., clothing, facial features), and (2) domain transformation via a LoRA-fine-tuned diffusion model trained on a small, unpaired dataset of cartoon-style opera images. This strategy bypasses the need for costly paired data while preserving fine-grained details. Experiments demonstrate that DreamOpera outperforms existing methods in generating high-fidelity, culturally nuanced artwork, offering practical value for cultural dissemination and digital art.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"167 ","pages":"Article 105898"},"PeriodicalIF":4.2,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145978221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CNN-CECA: Underwater image enhancement via CNN-driven nonlinear curve estimation and channel-wise attention in multi-color spaces CNN-CECA:基于cnn驱动的非线性曲线估计和多色空间的信道关注的水下图像增强
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-03-01 Epub Date: 2026-01-26 DOI: 10.1016/j.imavis.2026.105916
Imran Afzal, Guo Jichang, Fazeela Siddiqui, Muhammad Fahad
High-quality underwater images are essential for marine exploration, environmental monitoring, and scientific analysis. However, they are degraded by light attenuation, scattering, and wavelength-dependent absorption, which cause color shifts, low contrast, and detail loss. Furthermore, many existing deep learning techniques function as black boxes, offering limited interpretability and often generalizing poorly across diverse underwater conditions. To address this, we propose CNN-CECA, a novel deep learning framework whose core innovation is the hybrid integration of a convolutional backbone with physically-inspired, non-linear curve estimation across multiple color spaces. A lightweight CNN adjusts brightness, contrast, and color balance, and ResNet-50 guides the analysis of polynomial, sigmoid, and exponential curves in RGB, HSV, and CIELab, enabling both global and local adaptation. A key component is our novel Triple Channel-wise Attention (TCA) module, which fuses results across the three color spaces, dynamically allocating weights to recover natural colors and delicate structures. Post-processing with contrast stretching and edge sharpening adds final refinement while preserving efficiency for real-time use. Extensive experiments on synthetic and real-world datasets (e.g., UIEB, UCCS, EUVP, and NYU-v2) demonstrate superior quantitative scores and visually faithful restorations compared with traditional and state-of-the-art methods. Ablation studies verify the contributions of curve estimation and attention. This interpretable and adaptive approach offers a robust, scalable, and efficient solution for underwater image enhancement and is broadly applicable to vision tasks supporting autonomous platforms and human operators. The approach generalizes well across scenes and varying water conditions globally.
高质量的水下图像对海洋勘探、环境监测和科学分析至关重要。然而,由于光衰减、散射和波长相关的吸收,它们会导致颜色偏移、低对比度和细节损失。此外,许多现有的深度学习技术就像黑盒子一样,提供有限的可解释性,并且通常在不同的水下条件下泛化得很差。为了解决这个问题,我们提出了CNN-CECA,这是一个新的深度学习框架,其核心创新是卷积主干与跨多个颜色空间的物理启发非线性曲线估计的混合集成。轻量级CNN可以调整亮度、对比度和色彩平衡,ResNet-50可以指导RGB、HSV和CIELab中的多项式、s形曲线和指数曲线的分析,从而实现全局和局部适应。一个关键组件是我们新颖的三通道智能注意力(TCA)模块,它融合了三个色彩空间的结果,动态分配权重以恢复自然色彩和精致结构。后处理与对比度拉伸和边缘锐化增加了最终的细化,同时保持实时使用的效率。与传统和最先进的方法相比,在合成和现实世界数据集(例如,UIEB, UCCS, EUVP和NYU-v2)上进行的大量实验表明,与传统和最先进的方法相比,该方法具有更好的定量分数和视觉忠实度恢复。消融研究证实了曲线估计和关注的贡献。这种可解释和自适应的方法为水下图像增强提供了强大、可扩展和高效的解决方案,广泛适用于支持自主平台和人类操作员的视觉任务。该方法可以很好地推广到不同的场景和全球不同的水条件。
{"title":"CNN-CECA: Underwater image enhancement via CNN-driven nonlinear curve estimation and channel-wise attention in multi-color spaces","authors":"Imran Afzal,&nbsp;Guo Jichang,&nbsp;Fazeela Siddiqui,&nbsp;Muhammad Fahad","doi":"10.1016/j.imavis.2026.105916","DOIUrl":"10.1016/j.imavis.2026.105916","url":null,"abstract":"<div><div>High-quality underwater images are essential for marine exploration, environmental monitoring, and scientific analysis. However, they are degraded by light attenuation, scattering, and wavelength-dependent absorption, which cause color shifts, low contrast, and detail loss. Furthermore, many existing deep learning techniques function as black boxes, offering limited interpretability and often generalizing poorly across diverse underwater conditions. To address this, we propose CNN-CECA, a novel deep learning framework whose core innovation is the hybrid integration of a convolutional backbone with physically-inspired, non-linear curve estimation across multiple color spaces. A lightweight CNN adjusts brightness, contrast, and color balance, and ResNet-50 guides the analysis of polynomial, sigmoid, and exponential curves in RGB, HSV, and CIELab, enabling both global and local adaptation. A key component is our novel Triple Channel-wise Attention (TCA) module, which fuses results across the three color spaces, dynamically allocating weights to recover natural colors and delicate structures. Post-processing with contrast stretching and edge sharpening adds final refinement while preserving efficiency for real-time use. Extensive experiments on synthetic and real-world datasets (e.g., UIEB, UCCS, EUVP, and NYU-v2) demonstrate superior quantitative scores and visually faithful restorations compared with traditional and state-of-the-art methods. Ablation studies verify the contributions of curve estimation and attention. This interpretable and adaptive approach offers a robust, scalable, and efficient solution for underwater image enhancement and is broadly applicable to vision tasks supporting autonomous platforms and human operators. The approach generalizes well across scenes and varying water conditions globally.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"167 ","pages":"Article 105916"},"PeriodicalIF":4.2,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146078417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Object-level semantic alignment for enhancing fidelity in text-to-image generation with diffusion models 用扩散模型增强文本到图像生成的保真度的对象级语义对齐
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-03-01 Epub Date: 2026-01-29 DOI: 10.1016/j.imavis.2026.105923
Wenna Liu , Na Tian , Youjia Shao , Wencang Zhao
Text-to-image diffusion models have achieved remarkable success in generating diverse images. However, these models still face the challenge of semantic misalignment when handling text prompts containing multiple entities and attributes. It leads to object omission and attribute confusion, which impacts image fidelity. Inspired by the object-oriented structure implicit in the text prompt, we treat the text prompt as an organic system composed of objects, attributes and their interrelationships, aiming to unveil the underlying logic and semantic connections. We propose an object-centered attention map alignment method guided by the text’s syntactic structure to address the aforementioned issues. Firstly, we dynamically integrate textual semantic information through syntactic parsing and attention mechanisms, ensuring the model fully understands the prompt’s content. Then, we leverage fine-grained semantic-guided entity mask generation to accurately locate the target objects and alleviate the issue of object omission. Finally, we design a novel object-centric dual-loss binding function. The positive loss reinforces the association between objects and their attributes, while the negative loss mitigates the interference of irrelevant information, ensuring precise matching between objects and their attributes. Extensive experiments on the ABC-6K and AnE datasets demonstrate that the generated images confirm the model’s ability to accurately produce the objects and their corresponding visual attributes, further validating the effectiveness and superiority of our method.
文本到图像的扩散模型在生成不同的图像方面取得了显著的成功。然而,在处理包含多个实体和属性的文本提示时,这些模型仍然面临语义不对齐的挑战。它会导致物体遗漏和属性混淆,从而影响图像保真度。受文本提示隐含的面向对象结构的启发,我们将文本提示视为一个由对象、属性及其相互关系组成的有机系统,旨在揭示其底层的逻辑和语义联系。为了解决上述问题,我们提出了一种以文本语法结构为导向的以对象为中心的注意图对齐方法。首先,我们通过句法解析和注意机制动态整合文本语义信息,确保模型充分理解提示的内容。然后,我们利用细粒度语义引导的实体掩码生成来准确定位目标对象,并缓解对象遗漏问题。最后,我们设计了一个新的以对象为中心的双损失绑定函数。正损失增强了对象及其属性之间的关联,负损失减轻了不相关信息的干扰,确保了对象及其属性之间的精确匹配。在ABC-6K和AnE数据集上的大量实验表明,生成的图像证实了该模型能够准确地生成目标及其相应的视觉属性,进一步验证了该方法的有效性和优越性。
{"title":"Object-level semantic alignment for enhancing fidelity in text-to-image generation with diffusion models","authors":"Wenna Liu ,&nbsp;Na Tian ,&nbsp;Youjia Shao ,&nbsp;Wencang Zhao","doi":"10.1016/j.imavis.2026.105923","DOIUrl":"10.1016/j.imavis.2026.105923","url":null,"abstract":"<div><div>Text-to-image diffusion models have achieved remarkable success in generating diverse images. However, these models still face the challenge of semantic misalignment when handling text prompts containing multiple entities and attributes. It leads to object omission and attribute confusion, which impacts image fidelity. Inspired by the object-oriented structure implicit in the text prompt, we treat the text prompt as an organic system composed of objects, attributes and their interrelationships, aiming to unveil the underlying logic and semantic connections. We propose an object-centered attention map alignment method guided by the text’s syntactic structure to address the aforementioned issues. Firstly, we dynamically integrate textual semantic information through syntactic parsing and attention mechanisms, ensuring the model fully understands the prompt’s content. Then, we leverage fine-grained semantic-guided entity mask generation to accurately locate the target objects and alleviate the issue of object omission. Finally, we design a novel object-centric dual-loss binding function. The positive loss reinforces the association between objects and their attributes, while the negative loss mitigates the interference of irrelevant information, ensuring precise matching between objects and their attributes. Extensive experiments on the ABC-6K and AnE datasets demonstrate that the generated images confirm the model’s ability to accurately produce the objects and their corresponding visual attributes, further validating the effectiveness and superiority of our method.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"167 ","pages":"Article 105923"},"PeriodicalIF":4.2,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146173380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Dual-stage network combining transformer and hybrid convolutions for stereo image super-resolution 结合变压器和混合卷积的双级网络立体图像超分辨率
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-03-01 Epub Date: 2025-12-29 DOI: 10.1016/j.imavis.2025.105892
Jintao Zeng , Aiwen Jiang , Feiqiang Liu
Stereo image super-resolution aims to recover high-resolution image from given low-resolution left and right view images. Its challenges lie in fully feature extraction on each perspective and skillfully information integration from different perspectives. Among current methods, almost all super-resolution models employ single-stage strategy either based on transformer or convolution neural network(CNN). For highly nonlinear problems, single-stage network may not achieve very ideal performance with acceptable complexity. In this paper, we have proposed a dual-stage stereo image super-resolution network (DSSRNet) which integrates the complementary advantages of transformer and convolutions. Specifically, we design cross-stage attention module (CASM) to bridge informative feature transmission between successive stages. Moreover, we utilize fourier convolutions to efficiently model global and local features, benefiting restoring image details and texture. We have compared the proposed DSSRNet with several state-of-the-art methods on public benchmark datasets. The comprehensive experiments demonstrate that DSSRNet can restore clear structural features and richer texture details, achieving leading performance on PSNR, SSIM and LPIPS metrics with acceptable computation burden in stereo image super-resolution field. Related source codes and models will be released on https://github.com/Zjtao-lab/DSSRNet.
立体图像超分辨率旨在从给定的低分辨率左右视图图像中恢复高分辨率图像。其难点在于如何从各个角度充分提取特征,如何巧妙地将不同角度的信息进行整合。在现有的方法中,几乎所有的超分辨率模型都采用基于变压器或卷积神经网络(CNN)的单级策略。对于高度非线性问题,单级网络可能无法在可接受的复杂度下获得非常理想的性能。在本文中,我们提出了一种双级立体图像超分辨率网络(DSSRNet),它融合了变压器和卷积的互补优势。具体而言,我们设计了跨阶段注意模块(CASM),以架起信息特征在连续阶段之间传递的桥梁。此外,我们利用傅里叶卷积有效地建模全局和局部特征,有利于恢复图像的细节和纹理。我们将提出的DSSRNet与几种最先进的方法在公共基准数据集上进行了比较。综合实验表明,DSSRNet可以恢复清晰的结构特征和更丰富的纹理细节,在立体图像超分辨率领域的PSNR、SSIM和LPIPS指标上取得领先的性能,且计算负担可接受。相关源代码和模型将在https://github.com/Zjtao-lab/DSSRNet上发布。
{"title":"Dual-stage network combining transformer and hybrid convolutions for stereo image super-resolution","authors":"Jintao Zeng ,&nbsp;Aiwen Jiang ,&nbsp;Feiqiang Liu","doi":"10.1016/j.imavis.2025.105892","DOIUrl":"10.1016/j.imavis.2025.105892","url":null,"abstract":"<div><div>Stereo image super-resolution aims to recover high-resolution image from given low-resolution left and right view images. Its challenges lie in fully feature extraction on each perspective and skillfully information integration from different perspectives. Among current methods, almost all super-resolution models employ single-stage strategy either based on transformer or convolution neural network(CNN). For highly nonlinear problems, single-stage network may not achieve very ideal performance with acceptable complexity. In this paper, we have proposed a dual-stage stereo image super-resolution network (DSSRNet) which integrates the complementary advantages of transformer and convolutions. Specifically, we design cross-stage attention module (CASM) to bridge informative feature transmission between successive stages. Moreover, we utilize fourier convolutions to efficiently model global and local features, benefiting restoring image details and texture. We have compared the proposed DSSRNet with several state-of-the-art methods on public benchmark datasets. The comprehensive experiments demonstrate that DSSRNet can restore clear structural features and richer texture details, achieving leading performance on PSNR, SSIM and LPIPS metrics with acceptable computation burden in stereo image super-resolution field. Related source codes and models will be released on <span><span>https://github.com/Zjtao-lab/DSSRNet</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"167 ","pages":"Article 105892"},"PeriodicalIF":4.2,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145885581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Image and Vision Computing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1