首页 > 最新文献

Image and Vision Computing最新文献

英文 中文
OIDSty: One-shot identity-preserving face stylization OIDSty:一次性保留身份的面部样式化
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-03-01 Epub Date: 2026-01-07 DOI: 10.1016/j.imavis.2026.105899
Kairui Wang , Xinying Liu , Di Zhao , Xuelei Geng , Tian Xian , Yonghao Chang
In recent years, image generation techniques based on diffusion models have made significant progress in the field of facial stylization. However, existing methods still face challenges in achieving high identity fidelity while maintaining strong stylistic expressiveness, particularly in balancing the geometric deformations introduced by stylization with the preservation of fine facial features (such as facial features and poses). To address this issue, this paper proposes a novel single-sample facial stylization system—OIDSty. Its core innovation lies in decoupling identity preservation and style injection tasks across distinct attention layers, primarily achieved through two key designs: (1) High-Fidelity Identity Module, which innovatively combines strong semantic conditions and weak spatial conditions to guide cross-attention layers. This design enables precise retention of core identity and facial layout features while permitting stylized geometric deformations; (2) The DINO-Style Texture Guidance Module introduces this loss function into the self-attention layer to compute the feature difference between the ideal stylized output and the current output. This loss is integrated into the denoising sampling process, dynamically calibrating latent features through gradients to ensure efficient and accurate transfer of stylized textures onto the target image. Extensive experimental results demonstrate that OIDSty generates high-fidelity, stylistically distinct images across multiple styles. Compared to existing state-of-the-art methods, our method exhibits significant advantages across all objective and subjective evaluation metrics without requiring complex parameter tuning.
近年来,基于扩散模型的图像生成技术在人脸风格化领域取得了重大进展。然而,现有的方法仍然面临着在保持强烈的风格表现力的同时实现高身份保真度的挑战,特别是在平衡风格化引入的几何变形与保留精细的面部特征(如面部特征和姿势)方面。为了解决这一问题,本文提出了一种新的单样本面部风格化系统oidsty。其核心创新点在于将不同注意层之间的身份保存和风格注入任务解耦,主要通过两个关键设计来实现:(1)高保真身份模块,创新地将强语义条件和弱空间条件结合起来,引导跨注意层。这种设计能够精确地保留核心身份和面部布局特征,同时允许程式化的几何变形;(2) DINO-Style Texture Guidance Module将该损失函数引入自关注层,计算理想风格化输出与当前输出的特征差。这种损失被整合到去噪采样过程中,通过梯度动态校准潜在特征,以确保有效和准确地将风格化纹理转移到目标图像上。大量的实验结果表明,OIDSty可以生成高保真度、风格鲜明的多风格图像。与现有的最先进的方法相比,我们的方法在所有客观和主观评估指标上都表现出显著的优势,而不需要复杂的参数调整。
{"title":"OIDSty: One-shot identity-preserving face stylization","authors":"Kairui Wang ,&nbsp;Xinying Liu ,&nbsp;Di Zhao ,&nbsp;Xuelei Geng ,&nbsp;Tian Xian ,&nbsp;Yonghao Chang","doi":"10.1016/j.imavis.2026.105899","DOIUrl":"10.1016/j.imavis.2026.105899","url":null,"abstract":"<div><div>In recent years, image generation techniques based on diffusion models have made significant progress in the field of facial stylization. However, existing methods still face challenges in achieving high identity fidelity while maintaining strong stylistic expressiveness, particularly in balancing the geometric deformations introduced by stylization with the preservation of fine facial features (such as facial features and poses). To address this issue, this paper proposes a novel single-sample facial stylization system—OIDSty. Its core innovation lies in decoupling identity preservation and style injection tasks across distinct attention layers, primarily achieved through two key designs: (1) High-Fidelity Identity Module, which innovatively combines strong semantic conditions and weak spatial conditions to guide cross-attention layers. This design enables precise retention of core identity and facial layout features while permitting stylized geometric deformations; (2) The DINO-Style Texture Guidance Module introduces this loss function into the self-attention layer to compute the feature difference between the ideal stylized output and the current output. This loss is integrated into the denoising sampling process, dynamically calibrating latent features through gradients to ensure efficient and accurate transfer of stylized textures onto the target image. Extensive experimental results demonstrate that OIDSty generates high-fidelity, stylistically distinct images across multiple styles. Compared to existing state-of-the-art methods, our method exhibits significant advantages across all objective and subjective evaluation metrics without requiring complex parameter tuning.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"167 ","pages":"Article 105899"},"PeriodicalIF":4.2,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145978219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MMDehazeNet: Cross-modality attention with feature correction and multi-scale encoding for visible-infrared dehazing MMDehazeNet:基于特征校正和多尺度编码的可见红外消雾跨模态关注
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-03-01 Epub Date: 2026-01-13 DOI: 10.1016/j.imavis.2026.105896
Liangliang Duan
Haze-induced image degradation significantly degrades visual quality and impairs the performance of outdoor computer vision systems. Traditional single-image dehazing methods suffer from inherent limitations in dense haze scenarios due to the ill-posed nature of the problem. Leveraging complementary information from visible (RGB) and near-infrared (NIR) modalities offers a robust solution, as NIR signals exhibit superior penetration through atmospheric particles. This paper presents MMDehazeNet, a novel end-to-end multimodal fusion network for visible-infrared image dehazing. Adopting a U-Net-based dual-encoder architecture, it jointly processes hazy RGB and NIR images, with three key innovations: (1) a Gated Cross-Modality Attention (GCMA) module for efficient multi-level fusion; (2) a Multimodal Feature Correction (MMFC) module with a learned gating mechanism for adaptive inter-modal alignment; and (3) Multi-Scale Convolutional Layers (MSCL) for multi-receptive field feature extraction. Three variants (i.e., MMDehazeNet-S, -B, -L) are proposed. Extensive evaluations on the AirSim-VID, EPFL, and FANVID datasets demonstrate that MMDehazeNet achieves state-of-the-art performance. Quantitative and qualitative comparisons validate its significant superiority over existing single- and multi-modal methods, particularly under challenging medium and dense haze conditions.
雾霾引起的图像退化显著降低了视觉质量,损害了室外计算机视觉系统的性能。由于问题的病态性,传统的单图像除雾方法在密集雾霾场景中存在固有的局限性。利用可见光(RGB)和近红外(NIR)模式的互补信息提供了一个强大的解决方案,因为近红外信号在穿透大气颗粒方面表现出卓越的能力。本文提出了一种新型的端到端多模态融合网络MMDehazeNet,用于可见光-红外图像去雾。采用基于u - net的双编码器架构,对模糊RGB和近红外图像进行联合处理,主要创新有三个:(1)门控跨模态注意(GCMA)模块,实现高效多级融合;(2)基于学习门控机制的多模态特征校正(MMFC)模块,用于自适应多模态对齐;(3)用于多感受野特征提取的多尺度卷积层(MSCL)。提出了三种变体(即MMDehazeNet-S、-B、-L)。对AirSim-VID、EPFL和FANVID数据集的广泛评估表明,MMDehazeNet达到了最先进的性能。定量和定性比较证实了它比现有的单模态和多模态方法具有显著的优势,特别是在具有挑战性的中、重度雾霾条件下。
{"title":"MMDehazeNet: Cross-modality attention with feature correction and multi-scale encoding for visible-infrared dehazing","authors":"Liangliang Duan","doi":"10.1016/j.imavis.2026.105896","DOIUrl":"10.1016/j.imavis.2026.105896","url":null,"abstract":"<div><div>Haze-induced image degradation significantly degrades visual quality and impairs the performance of outdoor computer vision systems. Traditional single-image dehazing methods suffer from inherent limitations in dense haze scenarios due to the ill-posed nature of the problem. Leveraging complementary information from visible (RGB) and near-infrared (NIR) modalities offers a robust solution, as NIR signals exhibit superior penetration through atmospheric particles. This paper presents MMDehazeNet, a novel end-to-end multimodal fusion network for visible-infrared image dehazing. Adopting a U-Net-based dual-encoder architecture, it jointly processes hazy RGB and NIR images, with three key innovations: (1) a Gated Cross-Modality Attention (GCMA) module for efficient multi-level fusion; (2) a Multimodal Feature Correction (MMFC) module with a learned gating mechanism for adaptive inter-modal alignment; and (3) Multi-Scale Convolutional Layers (MSCL) for multi-receptive field feature extraction. Three variants (i.e., MMDehazeNet-S, -B, -L) are proposed. Extensive evaluations on the AirSim-VID, EPFL, and FANVID datasets demonstrate that MMDehazeNet achieves state-of-the-art performance. Quantitative and qualitative comparisons validate its significant superiority over existing single- and multi-modal methods, particularly under challenging medium and dense haze conditions.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"167 ","pages":"Article 105896"},"PeriodicalIF":4.2,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145978220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MSTVQA: A multi-path dynamic perception method for video quality assessment MSTVQA:一种多路径动态感知视频质量评估方法
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-03-01 Epub Date: 2025-12-24 DOI: 10.1016/j.imavis.2025.105891
Junwei Qi , Yingzhen Wang , Jingpeng Gao , Yichen Wu , Pujiang Liu
The proliferation of self-media and smart devices has led to uneven video quality on streaming platforms, so there is an urgent need for effective automated video quality assessment (VQA) methods. But most existing VQA methods fail to fully consider dynamic adaptability of the human visual perception system and its synergistic mechanism. In this study, we proposed a novel multi-path sensing framework for VQA to enhance the progressive sensing capability of the model. Specifically, the complete video has to be divided into three perceptual levels: patch clips, sampled frame stream, and inter-frame differences, a balance factor is used to give different levels of perceptual weights. Firstly, the purpose of defining a patch sampling method is to reduce the input data of the model while aligning temporal information, to extract subtle motion features in patch clips. After that, to further enhance the representation of local high-frequency details, the global variance-guided temporal dimension attention mechanism and spatial feature aggregation pool are used to accurately fit the sampling frame sequence. Finally, by embedding the feature map differences between consecutive frames and utilizing the long-term spatio-temporal dependence of Transformer to simulate the global dynamic evolution, the model achieves progressive interaction of cross scale spatio-temporal information. In addition, the improved temporal hysteresis pool enhances the ability to capture nonlinear dynamics in time series data and can more faithfully simulate subtle changes in the human visual perception system. Experimental results show that the proposed method outperforms existing no-reference VQA (NR-VQA) approaches across five in-the-wild datasets. In particular, it achieves outstanding performance on the CVD2014 dataset, which is the smallest in scale and contains the fewest scene variations, reaching a PLCC of 0.927 and an SRCC of 0.925. These results clearly demonstrate the effectiveness and advantages of our method in the VQA task.
自媒体和智能设备的普及导致流媒体平台视频质量参差不齐,迫切需要有效的自动化视频质量评估(VQA)方法。但现有的VQA方法大多没有充分考虑人类视觉感知系统的动态适应性及其协同机制。在本研究中,我们提出了一种新的VQA多路径感知框架,以增强模型的渐进感知能力。具体来说,完整的视频必须分为三个感知级别:片段剪辑、采样帧流和帧间差异,使用平衡因子来给出不同级别的感知权重。首先,定义斑块采样方法的目的是在对齐时间信息的同时减少模型的输入数据,提取斑块剪辑中的细微运动特征。之后,为了进一步增强局部高频细节的表征,采用全局方差引导的时间维注意机制和空间特征聚集池对采样帧序列进行精确拟合。最后,通过嵌入连续帧之间的特征映射差异,利用Transformer的长期时空依赖性来模拟全局动态演化,实现跨尺度时空信息的渐进式交互。此外,改进的时间滞后池增强了捕捉时间序列数据非线性动态的能力,能够更真实地模拟人类视觉感知系统的细微变化。实验结果表明,该方法在5个野外数据集上优于现有的无参考VQA (NR-VQA)方法。特别是在规模最小、场景变化最少的CVD2014数据集上,其PLCC和SRCC分别达到了0.927和0.925,取得了优异的性能。这些结果清楚地证明了我们的方法在VQA任务中的有效性和优势。
{"title":"MSTVQA: A multi-path dynamic perception method for video quality assessment","authors":"Junwei Qi ,&nbsp;Yingzhen Wang ,&nbsp;Jingpeng Gao ,&nbsp;Yichen Wu ,&nbsp;Pujiang Liu","doi":"10.1016/j.imavis.2025.105891","DOIUrl":"10.1016/j.imavis.2025.105891","url":null,"abstract":"<div><div>The proliferation of self-media and smart devices has led to uneven video quality on streaming platforms, so there is an urgent need for effective automated video quality assessment (VQA) methods. But most existing VQA methods fail to fully consider dynamic adaptability of the human visual perception system and its synergistic mechanism. In this study, we proposed a novel multi-path sensing framework for VQA to enhance the progressive sensing capability of the model. Specifically, the complete video has to be divided into three perceptual levels: patch clips, sampled frame stream, and inter-frame differences, a balance factor is used to give different levels of perceptual weights. Firstly, the purpose of defining a patch sampling method is to reduce the input data of the model while aligning temporal information, to extract subtle motion features in patch clips. After that, to further enhance the representation of local high-frequency details, the global variance-guided temporal dimension attention mechanism and spatial feature aggregation pool are used to accurately fit the sampling frame sequence. Finally, by embedding the feature map differences between consecutive frames and utilizing the long-term spatio-temporal dependence of Transformer to simulate the global dynamic evolution, the model achieves progressive interaction of cross scale spatio-temporal information. In addition, the improved temporal hysteresis pool enhances the ability to capture nonlinear dynamics in time series data and can more faithfully simulate subtle changes in the human visual perception system. Experimental results show that the proposed method outperforms existing no-reference VQA (NR-VQA) approaches across five in-the-wild datasets. In particular, it achieves outstanding performance on the CVD2014 dataset, which is the smallest in scale and contains the fewest scene variations, reaching a PLCC of 0.927 and an SRCC of 0.925. These results clearly demonstrate the effectiveness and advantages of our method in the VQA task.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"167 ","pages":"Article 105891"},"PeriodicalIF":4.2,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145842590","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enhancing UAV small target detection: A balanced accuracy-efficiency algorithm with tiered feature focus 增强无人机小目标检测:一种具有分层特征焦点的精度-效率平衡算法
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-03-01 Epub Date: 2026-01-10 DOI: 10.1016/j.imavis.2026.105897
Hanwei Guo, Shugang Liu
Small target detection in unmanned aerial vehicle (UAV) imagery is crucial for both military and civilian applications. However, achieving a balance between detection performance, efficiency, and lightweight architecture remains challenging. This paper introduces TF-DEIM-DFINE, a tiered focused small target detection model designed specifically for UAV tasks.We propose the Convolutional Gated-Visual Mamba (CG-VIM) module to enhance global dependency capture and local detail extraction through long sequence modeling, along with the Half-Channel Single-Head Attention (HCSA) module for global modeling, which improves fine-grained representation while reducing computational redundancy. Additionally, our Tiered Focus-Feature Pyramid Networks (TF-FPN) improve the representational capability of high-frequency information in multi-scale features without significantly increasing computational overhead. Experimental results on the VisDrone dataset demonstrate a 4.7% improvement in APM and a 5.8% improvement in AP metrics, with a 37% reduction in parameter count and only a 6% increase in GFLOPs, maintaining unchanged FPS. These results highlight TF-DEIM-DFINE’s ability to improve detection accuracy while preserving a lightweight and efficient structure
无人机图像中的小目标检测在军事和民用应用中都是至关重要的。然而,在检测性能、效率和轻量级架构之间取得平衡仍然具有挑战性。本文介绍了一种针对无人机任务设计的分层聚焦小目标检测模型TF-DEIM-DFINE。我们提出了卷积门控视觉曼巴(CG-VIM)模块,通过长序列建模增强了全局依赖捕获和局部细节提取,以及半通道单头注意(HCSA)模块,用于全局建模,提高了细粒度表示,同时减少了计算冗余。此外,我们的分层焦点-特征金字塔网络(TF-FPN)在不显著增加计算开销的情况下提高了多尺度特征中高频信息的表示能力。在VisDrone数据集上的实验结果表明,APM提高了4.7%,AP指标提高了5.8%,参数计数减少了37%,GFLOPs仅增加了6%,FPS保持不变。这些结果突出了TF-DEIM-DFINE在保持轻量化和高效结构的同时提高检测精度的能力
{"title":"Enhancing UAV small target detection: A balanced accuracy-efficiency algorithm with tiered feature focus","authors":"Hanwei Guo,&nbsp;Shugang Liu","doi":"10.1016/j.imavis.2026.105897","DOIUrl":"10.1016/j.imavis.2026.105897","url":null,"abstract":"<div><div>Small target detection in unmanned aerial vehicle (UAV) imagery is crucial for both military and civilian applications. However, achieving a balance between detection performance, efficiency, and lightweight architecture remains challenging. This paper introduces TF-DEIM-DFINE, a tiered focused small target detection model designed specifically for UAV tasks.We propose the Convolutional Gated-Visual Mamba (CG-VIM) module to enhance global dependency capture and local detail extraction through long sequence modeling, along with the Half-Channel Single-Head Attention (HCSA) module for global modeling, which improves fine-grained representation while reducing computational redundancy. Additionally, our Tiered Focus-Feature Pyramid Networks (TF-FPN) improve the representational capability of high-frequency information in multi-scale features without significantly increasing computational overhead. Experimental results on the VisDrone dataset demonstrate a 4.7% improvement in AP<span><math><msub><mrow></mrow><mrow><mtext>M</mtext></mrow></msub></math></span> and a 5.8% improvement in AP metrics, with a 37% reduction in parameter count and only a 6% increase in GFLOPs, maintaining unchanged FPS. These results highlight TF-DEIM-DFINE’s ability to improve detection accuracy while preserving a lightweight and efficient structure</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"167 ","pages":"Article 105897"},"PeriodicalIF":4.2,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145927627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PD-DDPM: Prior-driven diffusion model for single image dehazing PD-DDPM:先验驱动的单幅图像去雾扩散模型
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-03-01 Epub Date: 2025-12-24 DOI: 10.1016/j.imavis.2025.105888
Haoqin Sun, Jindong Xu, Jiaxin Gong, Yijie Wang
Haze significantly reduces the visual quality of images, particularly in dense atmospheric conditions, resulting in a substantial loss of perceptible structural and semantic information. This degradation negatively affects the performance of vision-based systems in critical applications such as autonomous navigation and intelligent surveillance. Consequently, single image dehazing has been recognized as a challenging inverse problem, aiming to restore clear images from hazy observations. Although significant progress has been made with existing dehazing approaches, the intrinsic mixing of haze-related features with unrelated image content often leads to distortions in color and detail preservation, limiting restoration accuracy. In recent years, Denoising Diffusion Probabilistic Model (DDPM) has demonstrated excellent performance in image generation and restoration tasks. However, the effectiveness of these methods in single image dehazing remains constrained by both irrelevant image content and temporal redundancy during sampling. To address these limitations, we propose a diffusion model-based dehazing method that effectively recovers image content by integrating both local and global priors through differential convolution. Furthermore, the generative capability of DDPM is exploited to enhance image texture and fine details. To reduce temporal redundancy during the diffusion process, a noise addition strategy based on the Fibonacci Sequence is introduced, which significantly optimizes the sampling time and improves overall computational efficiency. Experimental validation shows that the proposed method requires only 1/5 to 1/6 of the time required by the linear noise addition method. Additionally, the overall network achieves excellent performance in both synthetic and real dehazing datasets.
雾霾大大降低了图像的视觉质量,特别是在密集的大气条件下,导致可感知的结构和语义信息的大量损失。这种退化对自动导航和智能监视等关键应用中基于视觉的系统的性能产生负面影响。因此,单幅图像去雾被认为是一个具有挑战性的反问题,旨在从朦胧观测中恢复清晰的图像。尽管现有的去雾方法已经取得了重大进展,但与雾相关的特征与不相关的图像内容的内在混合往往会导致颜色和细节保存的失真,从而限制了恢复的准确性。近年来,去噪扩散概率模型(DDPM)在图像生成和恢复任务中表现出优异的性能。然而,这些方法在单幅图像去雾中的有效性仍然受到不相关图像内容和采样期间的时间冗余的限制。为了解决这些限制,我们提出了一种基于扩散模型的去雾方法,该方法通过微分卷积整合局部和全局先验,有效地恢复图像内容。此外,利用DDPM的生成能力增强图像纹理和细节。为了减少扩散过程中的时间冗余,引入了基于斐波那契序列的噪声添加策略,显著优化了采样时间,提高了整体计算效率。实验验证表明,该方法所需时间仅为线性噪声添加法的1/5 ~ 1/6。此外,整个网络在合成和真实除雾数据集上都取得了优异的性能。
{"title":"PD-DDPM: Prior-driven diffusion model for single image dehazing","authors":"Haoqin Sun,&nbsp;Jindong Xu,&nbsp;Jiaxin Gong,&nbsp;Yijie Wang","doi":"10.1016/j.imavis.2025.105888","DOIUrl":"10.1016/j.imavis.2025.105888","url":null,"abstract":"<div><div>Haze significantly reduces the visual quality of images, particularly in dense atmospheric conditions, resulting in a substantial loss of perceptible structural and semantic information. This degradation negatively affects the performance of vision-based systems in critical applications such as autonomous navigation and intelligent surveillance. Consequently, single image dehazing has been recognized as a challenging inverse problem, aiming to restore clear images from hazy observations. Although significant progress has been made with existing dehazing approaches, the intrinsic mixing of haze-related features with unrelated image content often leads to distortions in color and detail preservation, limiting restoration accuracy. In recent years, Denoising Diffusion Probabilistic Model (DDPM) has demonstrated excellent performance in image generation and restoration tasks. However, the effectiveness of these methods in single image dehazing remains constrained by both irrelevant image content and temporal redundancy during sampling. To address these limitations, we propose a diffusion model-based dehazing method that effectively recovers image content by integrating both local and global priors through differential convolution. Furthermore, the generative capability of DDPM is exploited to enhance image texture and fine details. To reduce temporal redundancy during the diffusion process, a noise addition strategy based on the Fibonacci Sequence is introduced, which significantly optimizes the sampling time and improves overall computational efficiency. Experimental validation shows that the proposed method requires only 1/5 to 1/6 of the time required by the linear noise addition method. Additionally, the overall network achieves excellent performance in both synthetic and real dehazing datasets.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"167 ","pages":"Article 105888"},"PeriodicalIF":4.2,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145885580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
KPTFusion: Knowledge Prior-based Task-Driven Multimodal Image Fusion KPTFusion:基于知识先验的任务驱动多模态图像融合
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-03-01 Epub Date: 2025-12-23 DOI: 10.1016/j.imavis.2025.105886
Yubo Fu, Xia Ye, Xinyan Kong
Multimodal image fusion aims to generate fused images that are richer in information, more credible in content, and perform better in relevant downstream tasks. However, this task typically faces two major challenges: First, due to the lack of fusion ground truth, it is difficult to guide the model’s parameters to converge to the optimal feature distribution without explicit supervision signals; Second, existing methods generally suffer from insufficient intermodal feature interaction, limiting the network’s ability to fully exploit the inherent complementarity of multimodal features. To address these issues, we propose the Knowledge Prior-based Task-Driven Multimodal Image Fusion (KPTFusion) framework. This framework introduces a knowledge prior to approximate the true distribution and sets corresponding task constraints for different downstream tasks, thereby guiding the network’s fusion output to approximate the target distribution. Specifically, we define knowledge prior as the learning objective for the fusion distribution and further design a Task-Perception Constraint Module (TPCM) to guide the network toward the optimal distribution required for specific tasks. Additionally, to enhance intermodal interactions, we embed a Dynamic Cross-Feature Module (DCA) within the network. This module utilizes a dual-stream attention mechanism to strengthen cross-modal feature interactions, ensuring the fused image fully preserves and integrates information from all modalities. Experimental results demonstrate that KPTFusion not only generates visually high-quality fusion outputs in infrared-visible and medical image fusion tasks but also achieves significant performance improvements in downstream tasks such as object detection and semantic segmentation based on the fusion results. This fully validates the effectiveness of its task-oriented fusion approach.
多模态图像融合旨在生成信息更丰富、内容更可信、在相关下游任务中表现更好的融合图像。然而,该任务通常面临两大挑战:第一,由于缺乏融合的地面真值,在没有明确的监督信号的情况下,很难引导模型的参数收敛到最优特征分布;其次,现有方法普遍存在多式联运特征交互不足的问题,限制了网络充分利用多式联运特征内在互补性的能力。为了解决这些问题,我们提出了基于知识先验的任务驱动多模态图像融合(KPTFusion)框架。该框架引入近似真实分布的先验知识,并对不同的下游任务设置相应的任务约束,从而引导网络的融合输出近似目标分布。具体而言,我们将知识先验定义为融合分布的学习目标,并进一步设计任务感知约束模块(Task-Perception Constraint Module, TPCM)来引导网络向特定任务所需的最优分布。此外,为了增强多式联运交互,我们在网络中嵌入了一个动态跨特征模块(DCA)。该模块利用双流注意机制加强跨模态特征交互,确保融合后的图像充分保留和融合了所有模态的信息。实验结果表明,KPTFusion不仅在红外可见光和医学图像融合任务中产生了视觉上高质量的融合输出,而且在目标检测和语义分割等下游任务中也取得了显著的性能提升。这充分验证了其面向任务的融合方法的有效性。
{"title":"KPTFusion: Knowledge Prior-based Task-Driven Multimodal Image Fusion","authors":"Yubo Fu,&nbsp;Xia Ye,&nbsp;Xinyan Kong","doi":"10.1016/j.imavis.2025.105886","DOIUrl":"10.1016/j.imavis.2025.105886","url":null,"abstract":"<div><div>Multimodal image fusion aims to generate fused images that are richer in information, more credible in content, and perform better in relevant downstream tasks. However, this task typically faces two major challenges: First, due to the lack of fusion ground truth, it is difficult to guide the model’s parameters to converge to the optimal feature distribution without explicit supervision signals; Second, existing methods generally suffer from insufficient intermodal feature interaction, limiting the network’s ability to fully exploit the inherent complementarity of multimodal features. To address these issues, we propose the Knowledge Prior-based Task-Driven Multimodal Image Fusion (KPTFusion) framework. This framework introduces a knowledge prior to approximate the true distribution and sets corresponding task constraints for different downstream tasks, thereby guiding the network’s fusion output to approximate the target distribution. Specifically, we define knowledge prior as the learning objective for the fusion distribution and further design a Task-Perception Constraint Module (TPCM) to guide the network toward the optimal distribution required for specific tasks. Additionally, to enhance intermodal interactions, we embed a Dynamic Cross-Feature Module (DCA) within the network. This module utilizes a dual-stream attention mechanism to strengthen cross-modal feature interactions, ensuring the fused image fully preserves and integrates information from all modalities. Experimental results demonstrate that KPTFusion not only generates visually high-quality fusion outputs in infrared-visible and medical image fusion tasks but also achieves significant performance improvements in downstream tasks such as object detection and semantic segmentation based on the fusion results. This fully validates the effectiveness of its task-oriented fusion approach.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"167 ","pages":"Article 105886"},"PeriodicalIF":4.2,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145842591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Disentangling co-occurrence with class-specific banks for Weakly Supervised Semantic Segmentation 弱监督语义分割中类特定库的共现解纠结
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-03-01 Epub Date: 2026-01-12 DOI: 10.1016/j.imavis.2025.105893
Hang Yao, Yuanchen Wu, Kequan Yang, Jide Li, Chao Yin, Zihang Li, Xiaoqiang Li
In Weakly Supervised Semantic Segmentation (WSSS), co-occurring objects often degrade the quality of Class Activation Maps (CAMs), ultimately compromising segmentation accuracy. Many recent WSSS methods leverage Contrastive Language-Image Pre-training (CLIP) by contrasting target-class images with text descriptions of background classes, thus providing additional supervision. However, these methods only rely on a shared background class set across all target classes, ignoring that each class has its own unique co-occurring objects. To resolve this limitation, this paper proposes a novel method that constructs semantically related class banks for each target class to disentangle co-occurring objects (dubbed DiCo). Specifically, DiCo first uses Large Language Models (LLMs) to generate semantically related class banks for each target class, which are further divided into negative and positive class banks to form contrastive pairs. The negative class banks include co-occurring objects related to the target class, while the positive class banks consist of the target class itself, along with its super-classes and sub-classes. By contrasting these negative and positive class banks with images through CLIP, DiCo disentangles target classes from co-occurring classes, simultaneously enhancing the semantic representations of the target class. Moreover, different classes have differential contributions to the disentanglement of co-occurring classes. DiCo introduces an adaptive weighting mechanism to adjust the contributions of co-occurring classes. Experimental results demonstrate that DiCo achieves superior performance compared to previous work on PASCAL VOC 2012 and MS COCO 2014.
在弱监督语义分割(WSSS)中,共同出现的对象通常会降低类激活图(CAMs)的质量,最终影响分割的准确性。最近的许多WSSS方法利用对比语言图像预训练(CLIP),将目标类图像与背景类的文本描述进行对比,从而提供额外的监督。但是,这些方法只依赖于跨所有目标类设置的共享背景类,而忽略了每个类都有自己唯一的共同发生对象。为了解决这一限制,本文提出了一种新的方法,即为每个目标类构建语义相关的类库来解纠缠共存对象(称为DiCo)。具体而言,DiCo首先使用大型语言模型(Large Language Models, llm)为每个目标类生成语义相关的类库,并将其进一步划分为负类库和正类库,形成对比对。负类库包括与目标类相关的共生对象,而正类库由目标类本身及其超类和子类组成。通过CLIP将这些负类库和正类库与图像进行对比,DiCo将目标类从共存类中分离出来,同时增强了目标类的语义表示。此外,不同的类对共发生类的解纠缠有不同的贡献。DiCo引入了一种自适应加权机制来调整共同发生类的贡献。实验结果表明,与之前在PASCAL VOC 2012和MS COCO 2014上的工作相比,DiCo取得了更好的性能。
{"title":"Disentangling co-occurrence with class-specific banks for Weakly Supervised Semantic Segmentation","authors":"Hang Yao,&nbsp;Yuanchen Wu,&nbsp;Kequan Yang,&nbsp;Jide Li,&nbsp;Chao Yin,&nbsp;Zihang Li,&nbsp;Xiaoqiang Li","doi":"10.1016/j.imavis.2025.105893","DOIUrl":"10.1016/j.imavis.2025.105893","url":null,"abstract":"<div><div>In Weakly Supervised Semantic Segmentation (WSSS), co-occurring objects often degrade the quality of Class Activation Maps (CAMs), ultimately compromising segmentation accuracy. Many recent WSSS methods leverage Contrastive Language-Image Pre-training (CLIP) by contrasting target-class images with text descriptions of background classes, thus providing additional supervision. However, these methods only rely on a shared background class set across all target classes, ignoring that each class has its own unique co-occurring objects. To resolve this limitation, this paper proposes a novel method that constructs semantically related class banks for each target class to <strong>di</strong>sentangle <strong>co</strong>-occurring objects (dubbed <strong>DiCo</strong>). Specifically, DiCo first uses Large Language Models (LLMs) to generate semantically related class banks for each target class, which are further divided into negative and positive class banks to form contrastive pairs. The negative class banks include co-occurring objects related to the target class, while the positive class banks consist of the target class itself, along with its super-classes and sub-classes. By contrasting these negative and positive class banks with images through CLIP, DiCo disentangles target classes from co-occurring classes, simultaneously enhancing the semantic representations of the target class. Moreover, different classes have differential contributions to the disentanglement of co-occurring classes. DiCo introduces an adaptive weighting mechanism to adjust the contributions of co-occurring classes. Experimental results demonstrate that DiCo achieves superior performance compared to previous work on PASCAL VOC 2012 and MS COCO 2014.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"167 ","pages":"Article 105893"},"PeriodicalIF":4.2,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145978218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A video anomaly detection and classification method based on cross-modal feature alignment 一种基于跨模态特征对齐的视频异常检测与分类方法
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-03-01 Epub Date: 2025-12-20 DOI: 10.1016/j.imavis.2025.105874
Yan Fu, Ting Hou, Ou Ye, Gaolin Ye
Detecting anomalous behaviors in surveillance videos is crucial for enhancing public safety and industrial monitoring. However, existing methods typically only detect the presence of anomalies without identifying their specific types, making targeted responses difficult. Additionally, these methods fail to effectively capture the dynamic relationship between persistent and sudden anomalies in complex scenarios. To address these issues, we propose an innovative anomaly detection model based on a dual-branch architecture. This model uses a cross-modal alignment mechanism to explicitly associate visual features with semantic concepts, enabling it to discriminate based on interpretable semantic evidence, thereby significantly improving the accuracy of anomaly detection. Specifically, the coarse-grained branch introduces an additive dilated convolution pyramid collaborative module (ADCP) that uniquely replaces traditional large-scale matrix multiplication with additive operations. This module dynamically fuses temporal information at different time scales and avoids the over-mixing of anomaly types, maintaining long-term memory and stable information flow, allowing the model to flexibly capture the relationship between long-term trends and short-term fluctuations. We also design a dynamic smoothing enhancement module (DSE) that uses a weighted average mechanism with sliding windows of different sizes to dynamically integrate features in local periods, filtering out long-term noise and sudden fluctuations, aiding in more precise anomaly boundary detection. The fine-grained branch focuses on semantic information, converting raw text related to anomaly types into category labels and generating learnable prompt text features. By combining these with visual features, cosine similarity is computed to precisely identify anomaly types. Experimental results show significant improvements on the XD-Violence and UCF-Crime datasets.
监控视频中的异常行为检测对于加强公共安全和工业监控至关重要。然而,现有的方法通常只能检测异常的存在,而不能识别其具体类型,这使得有针对性的响应变得困难。此外,这些方法无法有效捕获复杂场景中持续异常和突然异常之间的动态关系。为了解决这些问题,我们提出了一种创新的基于双分支架构的异常检测模型。该模型使用跨模态对齐机制将视觉特征与语义概念显式关联,使其能够基于可解释的语义证据进行区分,从而显著提高异常检测的准确性。具体来说,粗粒度分支引入了一个加法扩张卷积金字塔协同模块(ADCP),该模块独特地用加法运算取代了传统的大规模矩阵乘法。该模块动态融合了不同时间尺度的时间信息,避免了异常类型的过度混合,保持了长期记忆和稳定的信息流,使模型能够灵活捕捉长期趋势与短期波动之间的关系。我们还设计了一个动态平滑增强模块(DSE),该模块使用不同大小滑动窗口的加权平均机制来动态整合局部周期的特征,过滤掉长期噪声和突然波动,有助于更精确地检测异常边界。细粒度分支关注语义信息,将与异常类型相关的原始文本转换为类别标签,并生成可学习的提示文本特征。通过将这些特征与视觉特征相结合,计算余弦相似度来精确识别异常类型。实验结果表明,在XD-Violence和UCF-Crime数据集上有了显著的改进。
{"title":"A video anomaly detection and classification method based on cross-modal feature alignment","authors":"Yan Fu,&nbsp;Ting Hou,&nbsp;Ou Ye,&nbsp;Gaolin Ye","doi":"10.1016/j.imavis.2025.105874","DOIUrl":"10.1016/j.imavis.2025.105874","url":null,"abstract":"<div><div>Detecting anomalous behaviors in surveillance videos is crucial for enhancing public safety and industrial monitoring. However, existing methods typically only detect the presence of anomalies without identifying their specific types, making targeted responses difficult. Additionally, these methods fail to effectively capture the dynamic relationship between persistent and sudden anomalies in complex scenarios. To address these issues, we propose an innovative anomaly detection model based on a dual-branch architecture. This model uses a cross-modal alignment mechanism to explicitly associate visual features with semantic concepts, enabling it to discriminate based on interpretable semantic evidence, thereby significantly improving the accuracy of anomaly detection. Specifically, the coarse-grained branch introduces an additive dilated convolution pyramid collaborative module (ADCP) that uniquely replaces traditional large-scale matrix multiplication with additive operations. This module dynamically fuses temporal information at different time scales and avoids the over-mixing of anomaly types, maintaining long-term memory and stable information flow, allowing the model to flexibly capture the relationship between long-term trends and short-term fluctuations. We also design a dynamic smoothing enhancement module (DSE) that uses a weighted average mechanism with sliding windows of different sizes to dynamically integrate features in local periods, filtering out long-term noise and sudden fluctuations, aiding in more precise anomaly boundary detection. The fine-grained branch focuses on semantic information, converting raw text related to anomaly types into category labels and generating learnable prompt text features. By combining these with visual features, cosine similarity is computed to precisely identify anomaly types. Experimental results show significant improvements on the XD-Violence and UCF-Crime datasets.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"167 ","pages":"Article 105874"},"PeriodicalIF":4.2,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145808547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-level global context fusion for camouflaged object detection 用于伪装目标检测的多级全局上下文融合
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-03-01 Epub Date: 2026-01-28 DOI: 10.1016/j.imavis.2026.105915
Baichuan Shen , Yan Dou , Yaolei Li , Wenjun Zhang , Xiaoyan Wang
Due to the low contrast between camouflaged objects and backgrounds, the diversity of object edge shapes, and occlusions in complex scenes, existing deep learning-based Camouflaged Object Detection (COD) methods still face significant challenges in achieving high-precision detection. These challenges include difficulties in extracting multi-scale detail features for small object detection, modeling global context in occluded scenarios, and accurately distinguishing the boundaries between objects and backgrounds in complex edge detection tasks.To address these issues, this paper proposes MGCF-Net (Multi-level Global Context Fusion Network), a novel approach that integrates multi-scale context learning and feature fusion. The method employs an improved Pyramid Vision Transformer (PVTv2) as the backbone, coupled with a Cross-Scale Self-Attention (CSSA) module and a Multi-scale Fusion Attention (MFA) module. A Guided Alignment Feature Module (GAFM) aligns multi-scale features, while a large-kernel convolution structure (SHRF) enhances the global context capture capability. Experimental results on several COD benchmark data sets show that the proposed method improves 2.2%, 2.1% and 4.9% in structure metric, mean enhancement metric and weighted F metric respectively compared with FEDER, which is the second best overall performance, while the mean absolute error (MAE) decreases by 21.4%. It shows significant advantages in detection accuracy and generalization performance compared with several state-of-the-art methods (SOTA). Additionally, the method demonstrates excellent generalization to related tasks, such as polyp segmentation, COVID-19, lung infection detection, and defect detection.
由于伪装物体与背景对比度低、物体边缘形状多样、复杂场景中存在遮挡等问题,现有基于深度学习的伪装物体检测方法在实现高精度检测方面仍面临重大挑战。这些挑战包括难以提取用于小目标检测的多尺度细节特征,在遮挡场景中建模全局上下文,以及在复杂的边缘检测任务中准确区分目标和背景之间的边界。为了解决这些问题,本文提出了MGCF-Net(多层次全球上下文融合网络),这是一种集成了多尺度上下文学习和特征融合的新方法。该方法采用改进的金字塔视觉变压器(PVTv2)作为主干,结合跨尺度自注意(CSSA)模块和多尺度融合注意(MFA)模块。制导对齐特征模块(GAFM)用于多尺度特征对齐,而大核卷积结构(SHRF)增强了全局上下文捕获能力。在多个COD基准数据集上的实验结果表明,与FEDER相比,该方法在结构度量、平均增强度量和加权F度量上分别提高了2.2%、2.1%和4.9%,综合性能排名第二,平均绝对误差(MAE)降低了21.4%。与几种最先进的方法(SOTA)相比,它在检测精度和泛化性能方面具有显著的优势。此外,该方法对息肉分割、COVID-19、肺部感染检测和缺陷检测等相关任务具有很好的泛化能力。
{"title":"Multi-level global context fusion for camouflaged object detection","authors":"Baichuan Shen ,&nbsp;Yan Dou ,&nbsp;Yaolei Li ,&nbsp;Wenjun Zhang ,&nbsp;Xiaoyan Wang","doi":"10.1016/j.imavis.2026.105915","DOIUrl":"10.1016/j.imavis.2026.105915","url":null,"abstract":"<div><div>Due to the low contrast between camouflaged objects and backgrounds, the diversity of object edge shapes, and occlusions in complex scenes, existing deep learning-based Camouflaged Object Detection (COD) methods still face significant challenges in achieving high-precision detection. These challenges include difficulties in extracting multi-scale detail features for small object detection, modeling global context in occluded scenarios, and accurately distinguishing the boundaries between objects and backgrounds in complex edge detection tasks.To address these issues, this paper proposes MGCF-Net (Multi-level Global Context Fusion Network), a novel approach that integrates multi-scale context learning and feature fusion. The method employs an improved Pyramid Vision Transformer (PVTv2) as the backbone, coupled with a Cross-Scale Self-Attention (CSSA) module and a Multi-scale Fusion Attention (MFA) module. A Guided Alignment Feature Module (GAFM) aligns multi-scale features, while a large-kernel convolution structure (SHRF) enhances the global context capture capability. Experimental results on several COD benchmark data sets show that the proposed method improves 2.2%, 2.1% and 4.9% in structure metric, mean enhancement metric and weighted F metric respectively compared with FEDER, which is the second best overall performance, while the mean absolute error (MAE) decreases by 21.4%. It shows significant advantages in detection accuracy and generalization performance compared with several state-of-the-art methods (SOTA). Additionally, the method demonstrates excellent generalization to related tasks, such as polyp segmentation, COVID-19, lung infection detection, and defect detection.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"167 ","pages":"Article 105915"},"PeriodicalIF":4.2,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146078415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Dynamic multi-scenario prompt learning with knowledge augmentation for image emotion analysis 动态多场景提示学习与图像情感分析的知识增强
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-03-01 Epub Date: 2025-12-20 DOI: 10.1016/j.imavis.2025.105884
Tan Chen, Guozeng Zhang, Yiwei Wei, Jialin Chen, Cheng Feng
Image emotion analysis (IEA) aims to identify and comprehend human emotional states from visual content, which has garnered significant attention due to the growing trend of expressing opinions online. Existing IEA approaches typically attempt to explore the emotional semantic space by generating prompts for images through fixed templates or randomly generated vectors. However, these methods neglect the diverse fine-grained emotions across scenes within the same emotional category, thereby limiting the nuanced expression of emotional semantics. Moreover, fine-grained emotional information is often abstract, and its quantity remains unknown, making its extraction particularly challenging. In light of this issue, we propose a novel approach, Dynamic Multi-Scenario Prompt Learning with Knowledge Augmentation (DMSP-KA). We first design a similarity-based selection mechanism (SSM) to construct fine-grained multi-scenario emotional knowledge for all emotional categories. Subsequently, we integrate the image’s intrinsic semantics with fine-grained emotional knowledge to generate a consistent emotional bias at the composite level, creating dynamic multi-scenario prompts (DMSP) for each instance. Additionally, we leverage predefined emotional texts to assist in building cross-modal semantic associations and enhancing emotional information fusion. Finally, we establish a caching mechanism (CM) based on the multi-scenario knowledge to improve the accuracy of single-emotion classification. Experimental results on four widely used emotion datasets demonstrate that our proposed method outperforms current state-of-the-art (SOTA) approaches, achieving accuracies of 80.68% on FI, 73.74% on EmotioRoI, 92.13% on TwitterI, and 88.72% on TwitterII.
图像情感分析(IEA)旨在从视觉内容中识别和理解人类的情绪状态,随着网上表达意见的趋势日益增长,这一领域受到了极大的关注。现有的IEA方法通常试图通过固定模板或随机生成的向量生成图像提示来探索情感语义空间。然而,这些方法忽略了同一情感类别中跨场景的各种细粒度情感,从而限制了情感语义的细微表达。此外,细粒度的情感信息通常是抽象的,其数量仍然未知,这使得其提取特别具有挑战性。针对这一问题,我们提出了一种新的方法,动态多场景提示学习与知识增强(DMSP-KA)。我们首先设计了一个基于相似性的选择机制(SSM)来构建所有情感类别的细粒度多场景情感知识。随后,我们将图像的内在语义与细粒度的情感知识相结合,在合成层面上生成一致的情感偏差,为每个实例创建动态多场景提示(DMSP)。此外,我们利用预定义的情感文本来帮助建立跨模态语义关联和增强情感信息融合。最后,我们建立了一种基于多场景知识的缓存机制,以提高单情绪分类的准确率。在四个广泛使用的情绪数据集上的实验结果表明,我们提出的方法优于当前最先进的(SOTA)方法,在FI上达到80.68%,在emotiori上达到73.74%,在TwitterI上达到92.13%,在TwitterII上达到88.72%。
{"title":"Dynamic multi-scenario prompt learning with knowledge augmentation for image emotion analysis","authors":"Tan Chen,&nbsp;Guozeng Zhang,&nbsp;Yiwei Wei,&nbsp;Jialin Chen,&nbsp;Cheng Feng","doi":"10.1016/j.imavis.2025.105884","DOIUrl":"10.1016/j.imavis.2025.105884","url":null,"abstract":"<div><div>Image emotion analysis (IEA) aims to identify and comprehend human emotional states from visual content, which has garnered significant attention due to the growing trend of expressing opinions online. Existing IEA approaches typically attempt to explore the emotional semantic space by generating prompts for images through fixed templates or randomly generated vectors. However, these methods neglect the diverse fine-grained emotions across scenes within the same emotional category, thereby limiting the nuanced expression of emotional semantics. Moreover, fine-grained emotional information is often abstract, and its quantity remains unknown, making its extraction particularly challenging. In light of this issue, we propose a novel approach, Dynamic Multi-Scenario Prompt Learning with Knowledge Augmentation (DMSP-KA). We first design a similarity-based selection mechanism (SSM) to construct fine-grained multi-scenario emotional knowledge for all emotional categories. Subsequently, we integrate the image’s intrinsic semantics with fine-grained emotional knowledge to generate a consistent emotional bias at the composite level, creating dynamic multi-scenario prompts (DMSP) for each instance. Additionally, we leverage predefined emotional texts to assist in building cross-modal semantic associations and enhancing emotional information fusion. Finally, we establish a caching mechanism (CM) based on the multi-scenario knowledge to improve the accuracy of single-emotion classification. Experimental results on four widely used emotion datasets demonstrate that our proposed method outperforms current state-of-the-art (SOTA) approaches, achieving accuracies of 80.68% on FI, 73.74% on EmotioRoI, 92.13% on TwitterI, and 88.72% on TwitterII.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"167 ","pages":"Article 105884"},"PeriodicalIF":4.2,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145842589","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Image and Vision Computing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1