首页 > 最新文献

Image and Vision Computing最新文献

英文 中文
MSTVQA: A multi-path dynamic perception method for video quality assessment MSTVQA:一种多路径动态感知视频质量评估方法
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-24 DOI: 10.1016/j.imavis.2025.105891
Junwei Qi , Yingzhen Wang , Jingpeng Gao , Yichen Wu , Pujiang Liu
The proliferation of self-media and smart devices has led to uneven video quality on streaming platforms, so there is an urgent need for effective automated video quality assessment (VQA) methods. But most existing VQA methods fail to fully consider dynamic adaptability of the human visual perception system and its synergistic mechanism. In this study, we proposed a novel multi-path sensing framework for VQA to enhance the progressive sensing capability of the model. Specifically, the complete video has to be divided into three perceptual levels: patch clips, sampled frame stream, and inter-frame differences, a balance factor is used to give different levels of perceptual weights. Firstly, the purpose of defining a patch sampling method is to reduce the input data of the model while aligning temporal information, to extract subtle motion features in patch clips. After that, to further enhance the representation of local high-frequency details, the global variance-guided temporal dimension attention mechanism and spatial feature aggregation pool are used to accurately fit the sampling frame sequence. Finally, by embedding the feature map differences between consecutive frames and utilizing the long-term spatio-temporal dependence of Transformer to simulate the global dynamic evolution, the model achieves progressive interaction of cross scale spatio-temporal information. In addition, the improved temporal hysteresis pool enhances the ability to capture nonlinear dynamics in time series data and can more faithfully simulate subtle changes in the human visual perception system. Experimental results show that the proposed method outperforms existing no-reference VQA (NR-VQA) approaches across five in-the-wild datasets. In particular, it achieves outstanding performance on the CVD2014 dataset, which is the smallest in scale and contains the fewest scene variations, reaching a PLCC of 0.927 and an SRCC of 0.925. These results clearly demonstrate the effectiveness and advantages of our method in the VQA task.
自媒体和智能设备的普及导致流媒体平台视频质量参差不齐,迫切需要有效的自动化视频质量评估(VQA)方法。但现有的VQA方法大多没有充分考虑人类视觉感知系统的动态适应性及其协同机制。在本研究中,我们提出了一种新的VQA多路径感知框架,以增强模型的渐进感知能力。具体来说,完整的视频必须分为三个感知级别:片段剪辑、采样帧流和帧间差异,使用平衡因子来给出不同级别的感知权重。首先,定义斑块采样方法的目的是在对齐时间信息的同时减少模型的输入数据,提取斑块剪辑中的细微运动特征。之后,为了进一步增强局部高频细节的表征,采用全局方差引导的时间维注意机制和空间特征聚集池对采样帧序列进行精确拟合。最后,通过嵌入连续帧之间的特征映射差异,利用Transformer的长期时空依赖性来模拟全局动态演化,实现跨尺度时空信息的渐进式交互。此外,改进的时间滞后池增强了捕捉时间序列数据非线性动态的能力,能够更真实地模拟人类视觉感知系统的细微变化。实验结果表明,该方法在5个野外数据集上优于现有的无参考VQA (NR-VQA)方法。特别是在规模最小、场景变化最少的CVD2014数据集上,其PLCC和SRCC分别达到了0.927和0.925,取得了优异的性能。这些结果清楚地证明了我们的方法在VQA任务中的有效性和优势。
{"title":"MSTVQA: A multi-path dynamic perception method for video quality assessment","authors":"Junwei Qi ,&nbsp;Yingzhen Wang ,&nbsp;Jingpeng Gao ,&nbsp;Yichen Wu ,&nbsp;Pujiang Liu","doi":"10.1016/j.imavis.2025.105891","DOIUrl":"10.1016/j.imavis.2025.105891","url":null,"abstract":"<div><div>The proliferation of self-media and smart devices has led to uneven video quality on streaming platforms, so there is an urgent need for effective automated video quality assessment (VQA) methods. But most existing VQA methods fail to fully consider dynamic adaptability of the human visual perception system and its synergistic mechanism. In this study, we proposed a novel multi-path sensing framework for VQA to enhance the progressive sensing capability of the model. Specifically, the complete video has to be divided into three perceptual levels: patch clips, sampled frame stream, and inter-frame differences, a balance factor is used to give different levels of perceptual weights. Firstly, the purpose of defining a patch sampling method is to reduce the input data of the model while aligning temporal information, to extract subtle motion features in patch clips. After that, to further enhance the representation of local high-frequency details, the global variance-guided temporal dimension attention mechanism and spatial feature aggregation pool are used to accurately fit the sampling frame sequence. Finally, by embedding the feature map differences between consecutive frames and utilizing the long-term spatio-temporal dependence of Transformer to simulate the global dynamic evolution, the model achieves progressive interaction of cross scale spatio-temporal information. In addition, the improved temporal hysteresis pool enhances the ability to capture nonlinear dynamics in time series data and can more faithfully simulate subtle changes in the human visual perception system. Experimental results show that the proposed method outperforms existing no-reference VQA (NR-VQA) approaches across five in-the-wild datasets. In particular, it achieves outstanding performance on the CVD2014 dataset, which is the smallest in scale and contains the fewest scene variations, reaching a PLCC of 0.927 and an SRCC of 0.925. These results clearly demonstrate the effectiveness and advantages of our method in the VQA task.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"167 ","pages":"Article 105891"},"PeriodicalIF":4.2,"publicationDate":"2025-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145842590","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PD-DDPM: Prior-driven diffusion model for single image dehazing PD-DDPM:先验驱动的单幅图像去雾扩散模型
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-24 DOI: 10.1016/j.imavis.2025.105888
Haoqin Sun, Jindong Xu, Jiaxin Gong, Yijie Wang
Haze significantly reduces the visual quality of images, particularly in dense atmospheric conditions, resulting in a substantial loss of perceptible structural and semantic information. This degradation negatively affects the performance of vision-based systems in critical applications such as autonomous navigation and intelligent surveillance. Consequently, single image dehazing has been recognized as a challenging inverse problem, aiming to restore clear images from hazy observations. Although significant progress has been made with existing dehazing approaches, the intrinsic mixing of haze-related features with unrelated image content often leads to distortions in color and detail preservation, limiting restoration accuracy. In recent years, Denoising Diffusion Probabilistic Model (DDPM) has demonstrated excellent performance in image generation and restoration tasks. However, the effectiveness of these methods in single image dehazing remains constrained by both irrelevant image content and temporal redundancy during sampling. To address these limitations, we propose a diffusion model-based dehazing method that effectively recovers image content by integrating both local and global priors through differential convolution. Furthermore, the generative capability of DDPM is exploited to enhance image texture and fine details. To reduce temporal redundancy during the diffusion process, a noise addition strategy based on the Fibonacci Sequence is introduced, which significantly optimizes the sampling time and improves overall computational efficiency. Experimental validation shows that the proposed method requires only 1/5 to 1/6 of the time required by the linear noise addition method. Additionally, the overall network achieves excellent performance in both synthetic and real dehazing datasets.
雾霾大大降低了图像的视觉质量,特别是在密集的大气条件下,导致可感知的结构和语义信息的大量损失。这种退化对自动导航和智能监视等关键应用中基于视觉的系统的性能产生负面影响。因此,单幅图像去雾被认为是一个具有挑战性的反问题,旨在从朦胧观测中恢复清晰的图像。尽管现有的去雾方法已经取得了重大进展,但与雾相关的特征与不相关的图像内容的内在混合往往会导致颜色和细节保存的失真,从而限制了恢复的准确性。近年来,去噪扩散概率模型(DDPM)在图像生成和恢复任务中表现出优异的性能。然而,这些方法在单幅图像去雾中的有效性仍然受到不相关图像内容和采样期间的时间冗余的限制。为了解决这些限制,我们提出了一种基于扩散模型的去雾方法,该方法通过微分卷积整合局部和全局先验,有效地恢复图像内容。此外,利用DDPM的生成能力增强图像纹理和细节。为了减少扩散过程中的时间冗余,引入了基于斐波那契序列的噪声添加策略,显著优化了采样时间,提高了整体计算效率。实验验证表明,该方法所需时间仅为线性噪声添加法的1/5 ~ 1/6。此外,整个网络在合成和真实除雾数据集上都取得了优异的性能。
{"title":"PD-DDPM: Prior-driven diffusion model for single image dehazing","authors":"Haoqin Sun,&nbsp;Jindong Xu,&nbsp;Jiaxin Gong,&nbsp;Yijie Wang","doi":"10.1016/j.imavis.2025.105888","DOIUrl":"10.1016/j.imavis.2025.105888","url":null,"abstract":"<div><div>Haze significantly reduces the visual quality of images, particularly in dense atmospheric conditions, resulting in a substantial loss of perceptible structural and semantic information. This degradation negatively affects the performance of vision-based systems in critical applications such as autonomous navigation and intelligent surveillance. Consequently, single image dehazing has been recognized as a challenging inverse problem, aiming to restore clear images from hazy observations. Although significant progress has been made with existing dehazing approaches, the intrinsic mixing of haze-related features with unrelated image content often leads to distortions in color and detail preservation, limiting restoration accuracy. In recent years, Denoising Diffusion Probabilistic Model (DDPM) has demonstrated excellent performance in image generation and restoration tasks. However, the effectiveness of these methods in single image dehazing remains constrained by both irrelevant image content and temporal redundancy during sampling. To address these limitations, we propose a diffusion model-based dehazing method that effectively recovers image content by integrating both local and global priors through differential convolution. Furthermore, the generative capability of DDPM is exploited to enhance image texture and fine details. To reduce temporal redundancy during the diffusion process, a noise addition strategy based on the Fibonacci Sequence is introduced, which significantly optimizes the sampling time and improves overall computational efficiency. Experimental validation shows that the proposed method requires only 1/5 to 1/6 of the time required by the linear noise addition method. Additionally, the overall network achieves excellent performance in both synthetic and real dehazing datasets.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"167 ","pages":"Article 105888"},"PeriodicalIF":4.2,"publicationDate":"2025-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145885580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
KPTFusion: Knowledge Prior-based Task-Driven Multimodal Image Fusion KPTFusion:基于知识先验的任务驱动多模态图像融合
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-23 DOI: 10.1016/j.imavis.2025.105886
Yubo Fu, Xia Ye, Xinyan Kong
Multimodal image fusion aims to generate fused images that are richer in information, more credible in content, and perform better in relevant downstream tasks. However, this task typically faces two major challenges: First, due to the lack of fusion ground truth, it is difficult to guide the model’s parameters to converge to the optimal feature distribution without explicit supervision signals; Second, existing methods generally suffer from insufficient intermodal feature interaction, limiting the network’s ability to fully exploit the inherent complementarity of multimodal features. To address these issues, we propose the Knowledge Prior-based Task-Driven Multimodal Image Fusion (KPTFusion) framework. This framework introduces a knowledge prior to approximate the true distribution and sets corresponding task constraints for different downstream tasks, thereby guiding the network’s fusion output to approximate the target distribution. Specifically, we define knowledge prior as the learning objective for the fusion distribution and further design a Task-Perception Constraint Module (TPCM) to guide the network toward the optimal distribution required for specific tasks. Additionally, to enhance intermodal interactions, we embed a Dynamic Cross-Feature Module (DCA) within the network. This module utilizes a dual-stream attention mechanism to strengthen cross-modal feature interactions, ensuring the fused image fully preserves and integrates information from all modalities. Experimental results demonstrate that KPTFusion not only generates visually high-quality fusion outputs in infrared-visible and medical image fusion tasks but also achieves significant performance improvements in downstream tasks such as object detection and semantic segmentation based on the fusion results. This fully validates the effectiveness of its task-oriented fusion approach.
多模态图像融合旨在生成信息更丰富、内容更可信、在相关下游任务中表现更好的融合图像。然而,该任务通常面临两大挑战:第一,由于缺乏融合的地面真值,在没有明确的监督信号的情况下,很难引导模型的参数收敛到最优特征分布;其次,现有方法普遍存在多式联运特征交互不足的问题,限制了网络充分利用多式联运特征内在互补性的能力。为了解决这些问题,我们提出了基于知识先验的任务驱动多模态图像融合(KPTFusion)框架。该框架引入近似真实分布的先验知识,并对不同的下游任务设置相应的任务约束,从而引导网络的融合输出近似目标分布。具体而言,我们将知识先验定义为融合分布的学习目标,并进一步设计任务感知约束模块(Task-Perception Constraint Module, TPCM)来引导网络向特定任务所需的最优分布。此外,为了增强多式联运交互,我们在网络中嵌入了一个动态跨特征模块(DCA)。该模块利用双流注意机制加强跨模态特征交互,确保融合后的图像充分保留和融合了所有模态的信息。实验结果表明,KPTFusion不仅在红外可见光和医学图像融合任务中产生了视觉上高质量的融合输出,而且在目标检测和语义分割等下游任务中也取得了显著的性能提升。这充分验证了其面向任务的融合方法的有效性。
{"title":"KPTFusion: Knowledge Prior-based Task-Driven Multimodal Image Fusion","authors":"Yubo Fu,&nbsp;Xia Ye,&nbsp;Xinyan Kong","doi":"10.1016/j.imavis.2025.105886","DOIUrl":"10.1016/j.imavis.2025.105886","url":null,"abstract":"<div><div>Multimodal image fusion aims to generate fused images that are richer in information, more credible in content, and perform better in relevant downstream tasks. However, this task typically faces two major challenges: First, due to the lack of fusion ground truth, it is difficult to guide the model’s parameters to converge to the optimal feature distribution without explicit supervision signals; Second, existing methods generally suffer from insufficient intermodal feature interaction, limiting the network’s ability to fully exploit the inherent complementarity of multimodal features. To address these issues, we propose the Knowledge Prior-based Task-Driven Multimodal Image Fusion (KPTFusion) framework. This framework introduces a knowledge prior to approximate the true distribution and sets corresponding task constraints for different downstream tasks, thereby guiding the network’s fusion output to approximate the target distribution. Specifically, we define knowledge prior as the learning objective for the fusion distribution and further design a Task-Perception Constraint Module (TPCM) to guide the network toward the optimal distribution required for specific tasks. Additionally, to enhance intermodal interactions, we embed a Dynamic Cross-Feature Module (DCA) within the network. This module utilizes a dual-stream attention mechanism to strengthen cross-modal feature interactions, ensuring the fused image fully preserves and integrates information from all modalities. Experimental results demonstrate that KPTFusion not only generates visually high-quality fusion outputs in infrared-visible and medical image fusion tasks but also achieves significant performance improvements in downstream tasks such as object detection and semantic segmentation based on the fusion results. This fully validates the effectiveness of its task-oriented fusion approach.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"167 ","pages":"Article 105886"},"PeriodicalIF":4.2,"publicationDate":"2025-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145842591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Attention and mask-guided context fusion network for camouflaged object detection 用于伪装目标检测的注意力和面具引导上下文融合网络
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-22 DOI: 10.1016/j.imavis.2025.105887
Qiuying Han , Shaohui Zhang , Peng Wang
Camouflaged Object Detection (COD) aims to accurately identify objects visually embedded in their surroundings, which is considerably more challenging than conventional object detection due to low contrast, complex backgrounds, and varying object scales. Although recent deep learning approaches have shown promising results, they often suffer from incomplete or inaccurate detections, primarily due to the inadequate exploitation of multi-scale contextual features and cross-level information. To address these limitations, we propose a novel architecture, termed Attention and Mask-guided Context Fusion Network (AMCFNet). The framework comprises two core modules: Attentional Multi-scale Context Aggregation (AMCA) and Mask-guided Cross-level Fusion (MCF). The AMCA module improves the semantic representation of features at different levels by merging both global and local context information through a bidirectional attention mechanism, which includes wavelet-based modulation of channel and spatial data. The MCF module leverages high-level mask priors to guide the fusion of semantic and spatial features, applying an attention-weighted mechanism to highlight object-related regions while minimizing background interference. Comprehensive tests on four well-known COD benchmark datasets show that AMCFNet outperforms existing methods, providing more accurate camouflaged object detection under various challenging conditions.
伪装目标检测(COD)旨在准确识别视觉上嵌入周围环境的目标,由于低对比度、复杂背景和不同的目标尺度,这比传统的目标检测更具挑战性。尽管最近的深度学习方法已经显示出有希望的结果,但它们经常遭受不完整或不准确的检测,主要是由于对多尺度上下文特征和跨层信息的利用不足。为了解决这些限制,我们提出了一种新的架构,称为注意力和面具引导的上下文融合网络(AMCFNet)。该框架包括两个核心模块:注意多尺度上下文聚合(AMCA)和掩码引导跨层融合(MCF)。AMCA模块通过双向注意机制(包括基于小波的信道和空间数据调制)合并全局和局部上下文信息,提高了不同层次特征的语义表示。MCF模块利用高级掩模先验来指导语义和空间特征的融合,应用注意力加权机制来突出目标相关区域,同时最大限度地减少背景干扰。在四个知名COD基准数据集上的综合测试表明,AMCFNet优于现有方法,在各种具有挑战性的条件下提供更准确的伪装目标检测。
{"title":"Attention and mask-guided context fusion network for camouflaged object detection","authors":"Qiuying Han ,&nbsp;Shaohui Zhang ,&nbsp;Peng Wang","doi":"10.1016/j.imavis.2025.105887","DOIUrl":"10.1016/j.imavis.2025.105887","url":null,"abstract":"<div><div>Camouflaged Object Detection (COD) aims to accurately identify objects visually embedded in their surroundings, which is considerably more challenging than conventional object detection due to low contrast, complex backgrounds, and varying object scales. Although recent deep learning approaches have shown promising results, they often suffer from incomplete or inaccurate detections, primarily due to the inadequate exploitation of multi-scale contextual features and cross-level information. To address these limitations, we propose a novel architecture, termed Attention and Mask-guided Context Fusion Network (AMCFNet). The framework comprises two core modules: Attentional Multi-scale Context Aggregation (AMCA) and Mask-guided Cross-level Fusion (MCF). The AMCA module improves the semantic representation of features at different levels by merging both global and local context information through a bidirectional attention mechanism, which includes wavelet-based modulation of channel and spatial data. The MCF module leverages high-level mask priors to guide the fusion of semantic and spatial features, applying an attention-weighted mechanism to highlight object-related regions while minimizing background interference. Comprehensive tests on four well-known COD benchmark datasets show that AMCFNet outperforms existing methods, providing more accurate camouflaged object detection under various challenging conditions.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"166 ","pages":"Article 105887"},"PeriodicalIF":4.2,"publicationDate":"2025-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145840003","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CMALDD-PTAF: Cross-modal adversarial learning for deepfake detection by leveraging pre-trained models and cross-attention fusion CMALDD-PTAF:利用预训练模型和交叉注意融合进行深度假检测的跨模态对抗学习
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-20 DOI: 10.1016/j.imavis.2025.105885
Yuanfan Jin, Yongfang Wang
The emergence of novel deepfake algorithms capable of generating highly realistic manipulated audio–visual content has sparked significant public concern regarding the authenticity and trustworthiness of digital media. This concern has driven the development of multimodal deepfake detection methods. In this paper, we present a novel two-stage multimodal detection framework that harnesses pre-trained audio–visual speech recognition models and cross-attention fusion to achieve state-of-the-art performance with efficient cross-domain adversarial training. Our approach consists of two stages. In the first stage, we utilize a pre-trained audio–visual representation learning model from the speech recognition domain to extract unimodal features. Comprehensive analysis confirms the efficacy of these features for deepfake detection. In the second stage, we propose a specialized cross-modality fusion module to integrate the unimodal features for multimodal deepfake detection. Furthermore, we utilize a transformer model for final classification and implement an adversarial learning strategy to enhance robustness of the model. Our proposed method achieves 98.9% accuracy and 99.6% AUC on the multimodal deepfake detection benchmark FakeAVCeleb, outperforming the latest multimodal detector NPVForensics by 0.57 percentage points in AUC , while maintaining low training cost and a relatively simple architecture.
新型深度假算法的出现能够生成高度逼真的操纵视听内容,这引发了公众对数字媒体真实性和可信度的极大关注。这种担忧推动了多模态深度假检测方法的发展。在本文中,我们提出了一种新的两阶段多模态检测框架,该框架利用预训练的视听语音识别模型和交叉注意融合,通过高效的跨域对抗训练实现最先进的性能。我们的方法包括两个阶段。在第一阶段,我们利用来自语音识别领域的预训练的视听表示学习模型来提取单峰特征。综合分析证实了这些特征在深度造假检测中的有效性。在第二阶段,我们提出了一个专门的跨模态融合模块来整合单模态特征,用于多模态深度伪造检测。此外,我们利用变压器模型进行最终分类,并实施对抗学习策略来增强模型的鲁棒性。本文提出的方法在多模态深度伪造检测基准FakeAVCeleb上达到了98.9%的准确率和99.6%的AUC, AUC比最新的多模态检测器NPVForensics高出0.57个百分点,同时保持了较低的训练成本和相对简单的架构。
{"title":"CMALDD-PTAF: Cross-modal adversarial learning for deepfake detection by leveraging pre-trained models and cross-attention fusion","authors":"Yuanfan Jin,&nbsp;Yongfang Wang","doi":"10.1016/j.imavis.2025.105885","DOIUrl":"10.1016/j.imavis.2025.105885","url":null,"abstract":"<div><div>The emergence of novel deepfake algorithms capable of generating highly realistic manipulated audio–visual content has sparked significant public concern regarding the authenticity and trustworthiness of digital media. This concern has driven the development of multimodal deepfake detection methods. In this paper, we present a novel two-stage multimodal detection framework that harnesses pre-trained audio–visual speech recognition models and cross-attention fusion to achieve state-of-the-art performance with efficient cross-domain adversarial training. Our approach consists of two stages. In the first stage, we utilize a pre-trained audio–visual representation learning model from the speech recognition domain to extract unimodal features. Comprehensive analysis confirms the efficacy of these features for deepfake detection. In the second stage, we propose a specialized cross-modality fusion module to integrate the unimodal features for multimodal deepfake detection. Furthermore, we utilize a transformer model for final classification and implement an adversarial learning strategy to enhance robustness of the model. Our proposed method achieves 98.9% accuracy and 99.6% AUC on the multimodal deepfake detection benchmark FakeAVCeleb, outperforming the latest multimodal detector NPVForensics by 0.57 percentage points in AUC , while maintaining low training cost and a relatively simple architecture.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"166 ","pages":"Article 105885"},"PeriodicalIF":4.2,"publicationDate":"2025-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145840004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A video anomaly detection and classification method based on cross-modal feature alignment 一种基于跨模态特征对齐的视频异常检测与分类方法
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-20 DOI: 10.1016/j.imavis.2025.105874
Yan Fu, Ting Hou, Ou Ye, Gaolin Ye
Detecting anomalous behaviors in surveillance videos is crucial for enhancing public safety and industrial monitoring. However, existing methods typically only detect the presence of anomalies without identifying their specific types, making targeted responses difficult. Additionally, these methods fail to effectively capture the dynamic relationship between persistent and sudden anomalies in complex scenarios. To address these issues, we propose an innovative anomaly detection model based on a dual-branch architecture. This model uses a cross-modal alignment mechanism to explicitly associate visual features with semantic concepts, enabling it to discriminate based on interpretable semantic evidence, thereby significantly improving the accuracy of anomaly detection. Specifically, the coarse-grained branch introduces an additive dilated convolution pyramid collaborative module (ADCP) that uniquely replaces traditional large-scale matrix multiplication with additive operations. This module dynamically fuses temporal information at different time scales and avoids the over-mixing of anomaly types, maintaining long-term memory and stable information flow, allowing the model to flexibly capture the relationship between long-term trends and short-term fluctuations. We also design a dynamic smoothing enhancement module (DSE) that uses a weighted average mechanism with sliding windows of different sizes to dynamically integrate features in local periods, filtering out long-term noise and sudden fluctuations, aiding in more precise anomaly boundary detection. The fine-grained branch focuses on semantic information, converting raw text related to anomaly types into category labels and generating learnable prompt text features. By combining these with visual features, cosine similarity is computed to precisely identify anomaly types. Experimental results show significant improvements on the XD-Violence and UCF-Crime datasets.
监控视频中的异常行为检测对于加强公共安全和工业监控至关重要。然而,现有的方法通常只能检测异常的存在,而不能识别其具体类型,这使得有针对性的响应变得困难。此外,这些方法无法有效捕获复杂场景中持续异常和突然异常之间的动态关系。为了解决这些问题,我们提出了一种创新的基于双分支架构的异常检测模型。该模型使用跨模态对齐机制将视觉特征与语义概念显式关联,使其能够基于可解释的语义证据进行区分,从而显著提高异常检测的准确性。具体来说,粗粒度分支引入了一个加法扩张卷积金字塔协同模块(ADCP),该模块独特地用加法运算取代了传统的大规模矩阵乘法。该模块动态融合了不同时间尺度的时间信息,避免了异常类型的过度混合,保持了长期记忆和稳定的信息流,使模型能够灵活捕捉长期趋势与短期波动之间的关系。我们还设计了一个动态平滑增强模块(DSE),该模块使用不同大小滑动窗口的加权平均机制来动态整合局部周期的特征,过滤掉长期噪声和突然波动,有助于更精确地检测异常边界。细粒度分支关注语义信息,将与异常类型相关的原始文本转换为类别标签,并生成可学习的提示文本特征。通过将这些特征与视觉特征相结合,计算余弦相似度来精确识别异常类型。实验结果表明,在XD-Violence和UCF-Crime数据集上有了显著的改进。
{"title":"A video anomaly detection and classification method based on cross-modal feature alignment","authors":"Yan Fu,&nbsp;Ting Hou,&nbsp;Ou Ye,&nbsp;Gaolin Ye","doi":"10.1016/j.imavis.2025.105874","DOIUrl":"10.1016/j.imavis.2025.105874","url":null,"abstract":"<div><div>Detecting anomalous behaviors in surveillance videos is crucial for enhancing public safety and industrial monitoring. However, existing methods typically only detect the presence of anomalies without identifying their specific types, making targeted responses difficult. Additionally, these methods fail to effectively capture the dynamic relationship between persistent and sudden anomalies in complex scenarios. To address these issues, we propose an innovative anomaly detection model based on a dual-branch architecture. This model uses a cross-modal alignment mechanism to explicitly associate visual features with semantic concepts, enabling it to discriminate based on interpretable semantic evidence, thereby significantly improving the accuracy of anomaly detection. Specifically, the coarse-grained branch introduces an additive dilated convolution pyramid collaborative module (ADCP) that uniquely replaces traditional large-scale matrix multiplication with additive operations. This module dynamically fuses temporal information at different time scales and avoids the over-mixing of anomaly types, maintaining long-term memory and stable information flow, allowing the model to flexibly capture the relationship between long-term trends and short-term fluctuations. We also design a dynamic smoothing enhancement module (DSE) that uses a weighted average mechanism with sliding windows of different sizes to dynamically integrate features in local periods, filtering out long-term noise and sudden fluctuations, aiding in more precise anomaly boundary detection. The fine-grained branch focuses on semantic information, converting raw text related to anomaly types into category labels and generating learnable prompt text features. By combining these with visual features, cosine similarity is computed to precisely identify anomaly types. Experimental results show significant improvements on the XD-Violence and UCF-Crime datasets.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"167 ","pages":"Article 105874"},"PeriodicalIF":4.2,"publicationDate":"2025-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145808547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Dynamic multi-scenario prompt learning with knowledge augmentation for image emotion analysis 动态多场景提示学习与图像情感分析的知识增强
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-20 DOI: 10.1016/j.imavis.2025.105884
Tan Chen, Guozeng Zhang, Yiwei Wei, Jialin Chen, Cheng Feng
Image emotion analysis (IEA) aims to identify and comprehend human emotional states from visual content, which has garnered significant attention due to the growing trend of expressing opinions online. Existing IEA approaches typically attempt to explore the emotional semantic space by generating prompts for images through fixed templates or randomly generated vectors. However, these methods neglect the diverse fine-grained emotions across scenes within the same emotional category, thereby limiting the nuanced expression of emotional semantics. Moreover, fine-grained emotional information is often abstract, and its quantity remains unknown, making its extraction particularly challenging. In light of this issue, we propose a novel approach, Dynamic Multi-Scenario Prompt Learning with Knowledge Augmentation (DMSP-KA). We first design a similarity-based selection mechanism (SSM) to construct fine-grained multi-scenario emotional knowledge for all emotional categories. Subsequently, we integrate the image’s intrinsic semantics with fine-grained emotional knowledge to generate a consistent emotional bias at the composite level, creating dynamic multi-scenario prompts (DMSP) for each instance. Additionally, we leverage predefined emotional texts to assist in building cross-modal semantic associations and enhancing emotional information fusion. Finally, we establish a caching mechanism (CM) based on the multi-scenario knowledge to improve the accuracy of single-emotion classification. Experimental results on four widely used emotion datasets demonstrate that our proposed method outperforms current state-of-the-art (SOTA) approaches, achieving accuracies of 80.68% on FI, 73.74% on EmotioRoI, 92.13% on TwitterI, and 88.72% on TwitterII.
图像情感分析(IEA)旨在从视觉内容中识别和理解人类的情绪状态,随着网上表达意见的趋势日益增长,这一领域受到了极大的关注。现有的IEA方法通常试图通过固定模板或随机生成的向量生成图像提示来探索情感语义空间。然而,这些方法忽略了同一情感类别中跨场景的各种细粒度情感,从而限制了情感语义的细微表达。此外,细粒度的情感信息通常是抽象的,其数量仍然未知,这使得其提取特别具有挑战性。针对这一问题,我们提出了一种新的方法,动态多场景提示学习与知识增强(DMSP-KA)。我们首先设计了一个基于相似性的选择机制(SSM)来构建所有情感类别的细粒度多场景情感知识。随后,我们将图像的内在语义与细粒度的情感知识相结合,在合成层面上生成一致的情感偏差,为每个实例创建动态多场景提示(DMSP)。此外,我们利用预定义的情感文本来帮助建立跨模态语义关联和增强情感信息融合。最后,我们建立了一种基于多场景知识的缓存机制,以提高单情绪分类的准确率。在四个广泛使用的情绪数据集上的实验结果表明,我们提出的方法优于当前最先进的(SOTA)方法,在FI上达到80.68%,在emotiori上达到73.74%,在TwitterI上达到92.13%,在TwitterII上达到88.72%。
{"title":"Dynamic multi-scenario prompt learning with knowledge augmentation for image emotion analysis","authors":"Tan Chen,&nbsp;Guozeng Zhang,&nbsp;Yiwei Wei,&nbsp;Jialin Chen,&nbsp;Cheng Feng","doi":"10.1016/j.imavis.2025.105884","DOIUrl":"10.1016/j.imavis.2025.105884","url":null,"abstract":"<div><div>Image emotion analysis (IEA) aims to identify and comprehend human emotional states from visual content, which has garnered significant attention due to the growing trend of expressing opinions online. Existing IEA approaches typically attempt to explore the emotional semantic space by generating prompts for images through fixed templates or randomly generated vectors. However, these methods neglect the diverse fine-grained emotions across scenes within the same emotional category, thereby limiting the nuanced expression of emotional semantics. Moreover, fine-grained emotional information is often abstract, and its quantity remains unknown, making its extraction particularly challenging. In light of this issue, we propose a novel approach, Dynamic Multi-Scenario Prompt Learning with Knowledge Augmentation (DMSP-KA). We first design a similarity-based selection mechanism (SSM) to construct fine-grained multi-scenario emotional knowledge for all emotional categories. Subsequently, we integrate the image’s intrinsic semantics with fine-grained emotional knowledge to generate a consistent emotional bias at the composite level, creating dynamic multi-scenario prompts (DMSP) for each instance. Additionally, we leverage predefined emotional texts to assist in building cross-modal semantic associations and enhancing emotional information fusion. Finally, we establish a caching mechanism (CM) based on the multi-scenario knowledge to improve the accuracy of single-emotion classification. Experimental results on four widely used emotion datasets demonstrate that our proposed method outperforms current state-of-the-art (SOTA) approaches, achieving accuracies of 80.68% on FI, 73.74% on EmotioRoI, 92.13% on TwitterI, and 88.72% on TwitterII.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"167 ","pages":"Article 105884"},"PeriodicalIF":4.2,"publicationDate":"2025-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145842589","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cross-level fusion network for two-stage polyp segmentation via integrity learning 基于完整性学习的两阶段息肉分割交叉融合网络
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-19 DOI: 10.1016/j.imavis.2025.105883
Junzhuo Liu , Dorit Merhof , Zhixiang Wang
Colorectal cancer is one of the most prevalent and lethal forms of cancer. The automated detection, segmentation and classification of early polyp tissues from endoscopy images of the colorectum has demonstrated impressive potential in improving clinical diagnostic accuracy, avoiding missed detections and reducing the incidence of colorectal cancer in the population. However, most existing studies fail to consider the potential of information fusion between different deep neural network layers and optimization with respect to model complexity, resulting in poor clinical utility. To address the above limitations, the concept of integrity learning is introduced, which divides polyp segmentation into two stages for progressive completion, and a cross-level fusion lightweight network, IC-FusionNet, is proposed to accurately segment polyps from endoscopy images. First, the Context Fusion Module (CFM) of the network aggregates the encoder neighboring branches and current level information to achieve macro-integrity learning. In the second stage, polyp detail information from shallower layers and deeper high-dimensional semantic information are aggregated to achieve enhancement between different layers of complementary information. IC-FusionNet is evaluated on five polyp segmentation benchmark datasets across eight evaluation metrics to assess its performance. IC-FusionNet achieves mDice of 0.908 and 0.925 on the Kvasir and CVC-ClinicDB datasets, respectively, along with mIou of 0.851 and 0.973. On three external polyp segmentation test datasets, the model obtains an average mDice of 0.788 and an average mIou of 0.712. Compared to existing methods, IC-FusionNet achieves superior or near-optimal performance across most evaluation metrics. Moreover, IC-FusionNet contains only 3.84 M parameters and 0.76G MACs, representing a reduction of 9.22% in parameter count and 74.15% in computational complexity compared to recent lightweight segmentation networks.
结直肠癌是最常见和最致命的癌症之一。从结直肠内镜图像中对早期息肉组织进行自动检测、分割和分类,在提高临床诊断准确性、避免漏诊和降低人群中结直肠癌的发病率方面显示出令人印象深刻的潜力。然而,现有的研究大多没有考虑不同深度神经网络层之间信息融合的潜力和模型复杂性优化,导致临床实用性较差。针对上述局限性,本文引入了完整性学习的概念,将息肉分割分为两个阶段逐步完成,并提出了一个跨层次融合的轻量级网络IC-FusionNet,从内镜图像中准确分割息肉。首先,网络的上下文融合模块(Context Fusion Module, CFM)对编码器相邻分支和当前级别信息进行聚合,实现宏观完整性学习;第二阶段,将较浅层的息肉细节信息与较深层的高维语义信息进行聚合,实现不同层间互补信息的增强。IC-FusionNet在5个息肉分割基准数据集上进行了8个评估指标的评估,以评估其性能。IC-FusionNet在Kvasir和CVC-ClinicDB数据集上的mDice分别为0.908和0.925,mIou分别为0.851和0.973。在三个外部息肉分割测试数据集上,该模型的平均mdevice为0.788,平均mIou为0.712。与现有方法相比,IC-FusionNet在大多数评估指标上都达到了卓越或接近最佳的性能。此外,IC-FusionNet仅包含3.84 M个参数和0.76G个mac,与最近的轻量级分段网络相比,参数数量减少了9.22%,计算复杂度减少了74.15%。
{"title":"Cross-level fusion network for two-stage polyp segmentation via integrity learning","authors":"Junzhuo Liu ,&nbsp;Dorit Merhof ,&nbsp;Zhixiang Wang","doi":"10.1016/j.imavis.2025.105883","DOIUrl":"10.1016/j.imavis.2025.105883","url":null,"abstract":"<div><div>Colorectal cancer is one of the most prevalent and lethal forms of cancer. The automated detection, segmentation and classification of early polyp tissues from endoscopy images of the colorectum has demonstrated impressive potential in improving clinical diagnostic accuracy, avoiding missed detections and reducing the incidence of colorectal cancer in the population. However, most existing studies fail to consider the potential of information fusion between different deep neural network layers and optimization with respect to model complexity, resulting in poor clinical utility. To address the above limitations, the concept of integrity learning is introduced, which divides polyp segmentation into two stages for progressive completion, and a cross-level fusion lightweight network, IC-FusionNet, is proposed to accurately segment polyps from endoscopy images. First, the Context Fusion Module (CFM) of the network aggregates the encoder neighboring branches and current level information to achieve macro-integrity learning. In the second stage, polyp detail information from shallower layers and deeper high-dimensional semantic information are aggregated to achieve enhancement between different layers of complementary information. IC-FusionNet is evaluated on five polyp segmentation benchmark datasets across eight evaluation metrics to assess its performance. IC-FusionNet achieves <span><math><mi>mDice</mi></math></span> of 0.908 and 0.925 on the Kvasir and CVC-ClinicDB datasets, respectively, along with <span><math><mi>mIou</mi></math></span> of 0.851 and 0.973. On three external polyp segmentation test datasets, the model obtains an average <span><math><mi>mDice</mi></math></span> of 0.788 and an average <span><math><mi>mIou</mi></math></span> of 0.712. Compared to existing methods, IC-FusionNet achieves superior or near-optimal performance across most evaluation metrics. Moreover, IC-FusionNet contains only 3.84 M parameters and 0.76G MACs, representing a reduction of 9.22% in parameter count and 74.15% in computational complexity compared to recent lightweight segmentation networks.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"167 ","pages":"Article 105883"},"PeriodicalIF":4.2,"publicationDate":"2025-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145842587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enhancing Zero-Shot Object-Goal Visual Navigation with target context and appearance awareness 利用目标上下文和外观感知增强零射击目标-目标视觉导航
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-18 DOI: 10.1016/j.imavis.2025.105873
Yu Fu , Lichun Wang , Tong Bie , Shuang Li , Tong Gao , Baocai Yin
The Zero-Shot Object-Goal Visual Navigation (ZSON) task focuses on leveraging the target semantic information to transfer the end-to-end navigation policy learned from seen classes to unseen classes. For most ZSON methods, the target label serves as the primary source of semantic information. However, the learning of navigation policies based on the single semantic clue limits the transferability of navigation policies. Inspired by the phenomenon that humans associate object labels during searching for targets, we propose the Dual Target Awareness Network (DTAN), which expands the label semantic to target context and target appearance, providing more target clues for the navigation policy learning. By using Large Language Model (LLM), DTAN first infers target context and target attribute based on the target label. The target context is encoded and then interacts with the observation to obtain the context-aware feature. The target attribute is used to generate target appearance and then interacts with the observation to obtain the appearance-aware feature. By fusing the two kinds of features, the target-aware feature is obtained and fed into the policy network to make action decisions. Experimental results demonstrate that DTAN outperforms the state-of-the-art ZSON method by 6.9% in Success Rate (SR) and 3.1% in Success weighted by Path Length (SPL) for unseen targets on AI2-THOR simulator. Experiments conducted on RoboTHOR and Habitat (MP3D) simulators further prove the scalability of DTAN to larger-scale and more realistic scenes.
零射击目标-目标视觉导航(Zero-Shot Object-Goal Visual Navigation, ZSON)任务侧重于利用目标语义信息,将从可见类学习到的端到端导航策略转移到不可见类。对于大多数ZSON方法,目标标签充当语义信息的主要来源。然而,基于单一语义线索的导航策略学习限制了导航策略的可移植性。受人类在搜索目标时联想对象标签现象的启发,我们提出了双目标感知网络(Dual Target Awareness Network, DTAN),将标签语义扩展到目标上下文和目标外观,为导航策略学习提供更多的目标线索。DTAN首先利用大语言模型(LLM)根据目标标签推断目标上下文和目标属性。对目标上下文进行编码,然后与观察结果交互以获得上下文感知特征。目标属性用于生成目标外观,然后与观测交互以获得外观感知特征。通过融合这两种特征,得到目标感知特征,并将其输入策略网络进行行动决策。实验结果表明,在AI2-THOR模拟器上,DTAN方法对未见目标的成功率(SR)和路径长度加权成功率(SPL)分别比ZSON方法高6.9%和3.1%。在RoboTHOR和Habitat (MP3D)模拟器上进行的实验进一步证明了DTAN在更大规模和更逼真场景中的可扩展性。
{"title":"Enhancing Zero-Shot Object-Goal Visual Navigation with target context and appearance awareness","authors":"Yu Fu ,&nbsp;Lichun Wang ,&nbsp;Tong Bie ,&nbsp;Shuang Li ,&nbsp;Tong Gao ,&nbsp;Baocai Yin","doi":"10.1016/j.imavis.2025.105873","DOIUrl":"10.1016/j.imavis.2025.105873","url":null,"abstract":"<div><div>The Zero-Shot Object-Goal Visual Navigation (ZSON) task focuses on leveraging the target semantic information to transfer the end-to-end navigation policy learned from seen classes to unseen classes. For most ZSON methods, the target label serves as the primary source of semantic information. However, the learning of navigation policies based on the single semantic clue limits the transferability of navigation policies. Inspired by the phenomenon that humans associate object labels during searching for targets, we propose the Dual Target Awareness Network (DTAN), which expands the label semantic to target context and target appearance, providing more target clues for the navigation policy learning. By using Large Language Model (LLM), DTAN first infers target context and target attribute based on the target label. The target context is encoded and then interacts with the observation to obtain the context-aware feature. The target attribute is used to generate target appearance and then interacts with the observation to obtain the appearance-aware feature. By fusing the two kinds of features, the target-aware feature is obtained and fed into the policy network to make action decisions. Experimental results demonstrate that DTAN outperforms the state-of-the-art ZSON method by 6.9% in Success Rate (SR) and 3.1% in Success weighted by Path Length (SPL) for unseen targets on AI2-THOR simulator. Experiments conducted on RoboTHOR and Habitat (MP3D) simulators further prove the scalability of DTAN to larger-scale and more realistic scenes.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"166 ","pages":"Article 105873"},"PeriodicalIF":4.2,"publicationDate":"2025-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145840005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MSBC-Segformer: An automatic segmentation model of clinical target volume and organs at risk in CT images for radiotherapy after breast-conserving surgery MSBC-Segformer:一种用于保乳术后放疗CT图像中临床靶体积和危险器官的自动分割模型
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-17 DOI: 10.1016/j.imavis.2025.105878
Yadi Gao , Qian Sun , Lan Ye , Chengliang Li , Peipei Dang , Min Han
In breast-conserving surgery (BCS) radiotherapy for breast cancer (BC), clinical target volume (CTV) and organs at risk (OARs) on CT images are mainly manually delineated layer by layer by radiation oncologists (RO), a time-consuming process prone to variability due to clinical experience differences and inter- and intra-observer variations. To address this, we developed a new automatic delineation model aimed at medical CT images, specifically for computer-assisted medical detection and diagnosis. The CT scans of 100 patients who underwent BCS and radiotherapy were collected. These data were used to create, train, and validate a new deep-learning (DL) model, the MSBC-Segformer (Multi-Scale Boundary-Constrained Segmentation Model Based on Transformer) model, which was proposed to automatically segment the CTV and OARs. The Dice Similarity Coefficient (DSC) and 95th percentile Hausdorff Distance (95HD) were used to evaluate the effectiveness of the proposed model. In result, the MSBC-Segformer model can provide accurate and efficient delineation of CTV and OARs for BC patients underwent radiotherapy after BCS, outperforming both junior doctors and almost all other existing CNN models, and reducing the instability of segmentation results due to observer differences, thus significantly enhancing clinical efficiency. Moreover, evaluation by three ROs revealed no significant difference between the model and manual delineation by the senior doctors (p>0.98 for CTV and p>0.59 for OARs). The model significantly reduced segmentation time, with an average of only 12.53 s per patient.
在乳腺癌保乳手术(BCS)放疗中,CT图像上的临床靶体积(CTV)和危险器官(OARs)主要由放射肿瘤学家(RO)手工逐层描绘,这是一个耗时的过程,容易因临床经验差异和观察者之间和观察者内部的差异而发生变化。为了解决这个问题,我们开发了一种新的针对医学CT图像的自动描绘模型,特别是用于计算机辅助医学检测和诊断。收集100例接受BCS和放疗的患者的CT扫描。这些数据用于创建、训练和验证一种新的深度学习(DL)模型,即MSBC-Segformer(基于Transformer的多尺度边界约束分割模型)模型,该模型用于自动分割CTV和OARs。采用骰子相似系数(DSC)和第95百分位豪斯多夫距离(95HD)来评价模型的有效性。结果表明,MSBC-Segformer模型能够准确、高效地对bccs后放疗的BC患者进行CTV和OARs的描绘,优于初级医生和几乎所有现有的CNN模型,并且减少了由于观察者差异导致的分割结果的不稳定性,显著提高了临床效率。此外,三位ro的评估显示,模型与高级医生手工描绘的差异无统计学意义(CTV为p>;0.98, OARs为p>;0.59)。该模型显著缩短了分割时间,平均每位患者仅为12.53 s。
{"title":"MSBC-Segformer: An automatic segmentation model of clinical target volume and organs at risk in CT images for radiotherapy after breast-conserving surgery","authors":"Yadi Gao ,&nbsp;Qian Sun ,&nbsp;Lan Ye ,&nbsp;Chengliang Li ,&nbsp;Peipei Dang ,&nbsp;Min Han","doi":"10.1016/j.imavis.2025.105878","DOIUrl":"10.1016/j.imavis.2025.105878","url":null,"abstract":"<div><div>In breast-conserving surgery (BCS) radiotherapy for breast cancer (BC), clinical target volume (CTV) and organs at risk (OARs) on CT images are mainly manually delineated layer by layer by radiation oncologists (RO), a time-consuming process prone to variability due to clinical experience differences and inter- and intra-observer variations. To address this, we developed a new automatic delineation model aimed at medical CT images, specifically for computer-assisted medical detection and diagnosis. The CT scans of 100 patients who underwent BCS and radiotherapy were collected. These data were used to create, train, and validate a new deep-learning (DL) model, the MSBC-Segformer (Multi-Scale Boundary-Constrained Segmentation Model Based on Transformer) model, which was proposed to automatically segment the CTV and OARs. The Dice Similarity Coefficient (DSC) and 95th percentile Hausdorff Distance (95HD) were used to evaluate the effectiveness of the proposed model. In result, the MSBC-Segformer model can provide accurate and efficient delineation of CTV and OARs for BC patients underwent radiotherapy after BCS, outperforming both junior doctors and almost all other existing CNN models, and reducing the instability of segmentation results due to observer differences, thus significantly enhancing clinical efficiency. Moreover, evaluation by three ROs revealed no significant difference between the model and manual delineation by the senior doctors (<span><math><mrow><mi>p</mi><mo>&gt;</mo><mn>0</mn><mo>.</mo><mn>98</mn></mrow></math></span> for CTV and <span><math><mrow><mi>p</mi><mo>&gt;</mo><mn>0</mn><mo>.</mo><mn>59</mn></mrow></math></span> for OARs). The model significantly reduced segmentation time, with an average of only 12.53 s per patient.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"166 ","pages":"Article 105878"},"PeriodicalIF":4.2,"publicationDate":"2025-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Image and Vision Computing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1