首页 > 最新文献

Image and Vision Computing最新文献

英文 中文
W-MambaFuse: A wavelet decomposition and adaptive state-space modeling approach for anatomical and functional image fusion W-MambaFuse:一种用于解剖和功能图像融合的小波分解和自适应状态空间建模方法
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-01 Epub Date: 2025-11-08 DOI: 10.1016/j.imavis.2025.105796
Bowen Zhong , Shijie Li , Xuan Deng , Zheng Li
Anatomical-functional image fusion plays a critical role in a variety of medical and biological applications. Current convolutional neural network-based fusion algorithms are constrained by their limited receptive fields, impeding the effective modeling of long-range dependencies in medical images. While transformer-based architectures possess global modeling capabilities, they face computational challenges due to the quadratic complexity of their self-attention mechanisms. To address these limitations, we propose a network based on wavelet-domain decomposition and an adaptive selectively structured state space model, termed as W-MambaFuse, for anatomical and functional image fusion. Specifically, the network first applies a wavelet transform to enlarge the receptive field of the convolutional layers, facilitating the capture of low-frequency structural outlines and high-frequency textural primitives. Furthermore, we develop an adaptive gated fusion module, referred to as CNN-Mamba Gated (MCG), which leverages the dynamic modeling capability of state space models and the local feature extraction strengths of convolutional neural networks. This design facilitates the effective extraction of both intra-modal and inter-modal features, thereby enhancing multimodal image fusion. Experimental results on benchmark datasets demonstrate that W-MambaFuse consistently outperforms pure CNN-based models, transformer-based models, and CNN-transformer hybrid approaches in terms of both visual quality and quantitative evaluations. Our code is publicly available at https://github.com/Bowen-Zhong/W-Mamba.
解剖功能图像融合在各种医学和生物学应用中起着至关重要的作用。当前基于卷积神经网络的融合算法受其接收域的限制,阻碍了医学图像中远程依赖关系的有效建模。虽然基于变压器的体系结构具有全局建模能力,但由于其自关注机制的二次复杂度,它们面临着计算方面的挑战。为了解决这些限制,我们提出了一个基于小波域分解和自适应选择性结构化状态空间模型的网络,称为W-MambaFuse,用于解剖和功能图像融合。具体来说,该网络首先应用小波变换来扩大卷积层的接受场,促进低频结构轮廓和高频纹理基元的捕获。此外,我们开发了一种自适应门控融合模块,称为cnn -曼巴门控(MCG),它利用了状态空间模型的动态建模能力和卷积神经网络的局部特征提取优势。这种设计有利于有效地提取模态内和模态间的特征,从而增强多模态图像融合。在基准数据集上的实验结果表明,W-MambaFuse在视觉质量和定量评估方面始终优于纯基于cnn的模型、基于变压器的模型和cnn -变压器混合方法。我们的代码可以在https://github.com/Bowen-Zhong/W-Mamba上公开获得。
{"title":"W-MambaFuse: A wavelet decomposition and adaptive state-space modeling approach for anatomical and functional image fusion","authors":"Bowen Zhong ,&nbsp;Shijie Li ,&nbsp;Xuan Deng ,&nbsp;Zheng Li","doi":"10.1016/j.imavis.2025.105796","DOIUrl":"10.1016/j.imavis.2025.105796","url":null,"abstract":"<div><div>Anatomical-functional image fusion plays a critical role in a variety of medical and biological applications. Current convolutional neural network-based fusion algorithms are constrained by their limited receptive fields, impeding the effective modeling of long-range dependencies in medical images. While transformer-based architectures possess global modeling capabilities, they face computational challenges due to the quadratic complexity of their self-attention mechanisms. To address these limitations, we propose a network based on wavelet-domain decomposition and an adaptive selectively structured state space model, termed as W-MambaFuse, for anatomical and functional image fusion. Specifically, the network first applies a wavelet transform to enlarge the receptive field of the convolutional layers, facilitating the capture of low-frequency structural outlines and high-frequency textural primitives. Furthermore, we develop an adaptive gated fusion module, referred to as CNN-Mamba Gated (MCG), which leverages the dynamic modeling capability of state space models and the local feature extraction strengths of convolutional neural networks. This design facilitates the effective extraction of both intra-modal and inter-modal features, thereby enhancing multimodal image fusion. Experimental results on benchmark datasets demonstrate that W-MambaFuse consistently outperforms pure CNN-based models, transformer-based models, and CNN-transformer hybrid approaches in terms of both visual quality and quantitative evaluations. Our code is publicly available at <span><span>https://github.com/Bowen-Zhong/W-Mamba</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105796"},"PeriodicalIF":4.2,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145521232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A human layout consistency framework for image-based virtual try-on 基于图像的虚拟试戴的人工布局一致性框架
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-01 Epub Date: 2025-11-19 DOI: 10.1016/j.imavis.2025.105831
Rong Huang , Zhicheng Wang , Hao Liu , Aihua Dong
Image-based virtual try-on, commonly framed as a generative image-to-image translation task, has garnered significant research interest due to its elimination of the need for costly 3D scanning devices. In this field, image inpainting and cycle-consistency have been the dominant frameworks, but they still face challenges in cross-attribute adaptation and parameter sharing between try-on networks. This paper proposes a new framework, termed human layout consistency, based on the intuitive insight that a high-quality try-on result should align with a coherent human layout. Under the proposed framework, a try-on network is equipped with an upstream Human Layout Generator (HLG) and a downstream Human Layout Parser (HLP). The former generates an expected human layout as if the person were wearing the selected target garment, while the latter extracts an actual human layout parsed from the try-on result. The supervisory signals, free from the ground-truth image pairs, are constructed by assessing the consistencies between the expected and actual human layouts. We design a dual-phase training strategy, first warming up HLG and HLP, then training try-on network by incorporating the supervisory signals based on human layout consistency. On this basis, the proposed framework enables arbitrary selection of target garments during training, thereby endowing the try-on network with the cross-attribute adaptation. Moreover, the proposed framework operates with a single try-on network, rather than two physically separate ones, thereby avoiding the parameter-sharing issue. We conducted both qualitative and quantitative experiments on the benchmark VITON dataset. Experimental results demonstrate that our proposal can generate high-quality try-on results, outperforming baselines by a margin of 0.75% to 10.58%. Ablation and visualization results further reveal that the proposed method exhibits superior adaptability to cross-attribute translations, showcasing its potential for practical application.
基于图像的虚拟试戴,通常被认为是一种生成图像到图像的转换任务,由于它消除了对昂贵的3D扫描设备的需求,已经获得了重要的研究兴趣。在该领域中,图像绘制和循环一致性一直是占主导地位的框架,但它们在试戴网络之间的跨属性自适应和参数共享方面仍然面临挑战。本文提出了一个新的框架,称为人类布局一致性,基于直观的洞察力,高质量的试戴结果应该与连贯的人类布局保持一致。在该框架下,试戴网络配备了上游人工布局生成器(HLG)和下游人工布局解析器(HLP)。前者生成预期的人体布局,就好像这个人穿着所选择的目标服装一样,而后者则从试穿结果中提取解析后的实际人体布局。监控信号不依赖于真实图像对,通过评估预期布局和实际布局之间的一致性来构建。我们设计了一种双阶段训练策略,首先预热HLG和HLP,然后结合基于人类布局一致性的监控信号来训练试戴网络。在此基础上,提出的框架可以在训练过程中任意选择目标服装,从而使试穿网络具有跨属性自适应能力。此外,所提出的框架在单个试上线网络中运行,而不是两个物理上分开的网络,从而避免了参数共享问题。我们对基准的VITON数据集进行了定性和定量实验。实验结果表明,我们的提议可以产生高质量的试戴结果,比基线高出0.75%至10.58%。烧蚀和可视化结果进一步表明,该方法对跨属性翻译具有良好的适应性,显示了其实际应用潜力。
{"title":"A human layout consistency framework for image-based virtual try-on","authors":"Rong Huang ,&nbsp;Zhicheng Wang ,&nbsp;Hao Liu ,&nbsp;Aihua Dong","doi":"10.1016/j.imavis.2025.105831","DOIUrl":"10.1016/j.imavis.2025.105831","url":null,"abstract":"<div><div>Image-based virtual try-on, commonly framed as a generative image-to-image translation task, has garnered significant research interest due to its elimination of the need for costly 3D scanning devices. In this field, image inpainting and cycle-consistency have been the dominant frameworks, but they still face challenges in cross-attribute adaptation and parameter sharing between try-on networks. This paper proposes a new framework, termed human layout consistency, based on the intuitive insight that a high-quality try-on result should align with a coherent human layout. Under the proposed framework, a try-on network is equipped with an upstream Human Layout Generator (HLG) and a downstream Human Layout Parser (HLP). The former generates an expected human layout as if the person were wearing the selected target garment, while the latter extracts an actual human layout parsed from the try-on result. The supervisory signals, free from the ground-truth image pairs, are constructed by assessing the consistencies between the expected and actual human layouts. We design a dual-phase training strategy, first warming up HLG and HLP, then training try-on network by incorporating the supervisory signals based on human layout consistency. On this basis, the proposed framework enables arbitrary selection of target garments during training, thereby endowing the try-on network with the cross-attribute adaptation. Moreover, the proposed framework operates with a single try-on network, rather than two physically separate ones, thereby avoiding the parameter-sharing issue. We conducted both qualitative and quantitative experiments on the benchmark VITON dataset. Experimental results demonstrate that our proposal can generate high-quality try-on results, outperforming baselines by a margin of 0.75% to 10.58%. Ablation and visualization results further reveal that the proposed method exhibits superior adaptability to cross-attribute translations, showcasing its potential for practical application.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105831"},"PeriodicalIF":4.2,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145579501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Advanced fusion of IoT and AI technologies for smart environments: Enhancing environmental perception and mobility solutions for visually impaired individuals 智能环境中物联网和人工智能技术的先进融合:增强视障人士的环境感知和移动解决方案
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-01 Epub Date: 2025-11-19 DOI: 10.1016/j.imavis.2025.105827
Nouf Nawar Alotaibi , Mrim M. Alnfiai , Mona Mohammed Alnahari , Salma Mohsen M. Alnefaie , Faiz Abdullah Alotaibi

Objective

To develop a robust proposed model that integrates multiple sensor modalities to enhance environmental perception and mobility for visually impaired individuals, improving their autonomy and safety in both indoor and outdoor settings.

Methods

The proposed system utilizes advanced IoT and AI technologies, integrating data from proximity, ambient light, and motion sensors through recursive Bayesian filtering, kernel-based fusion algorithms, and probabilistic graphical models. A comprehensive dataset was collected across diverse environments to train and evaluate the model's accuracy in real-time environmental context estimation and motion activity detection. This study employed a multidisciplinary approach, integrating the Internet of Things (IoT) and Artificial Intelligence (AI), to develop a proposed model for assisting visually impaired individuals. The study was conducted over six months (April 2024 to September 2024) in Saudi Arabia, utilizing resources from Najran University. Data collection involved deploying IoT devices across various indoor and outdoor environments, including residential areas, commercial spaces, and urban streets, to ensure diversity and real-world applicability. The system utilized proximity sensors, ambient light sensors, and motion detectors to gather data under different lighting, weather, and dynamic conditions. Recursive Bayesian filtering, kernel-based fusion algorithms, and probabilistic graphical models were employed to process the sensor inputs and provide real-time environmental context and motion detection. The study followed a rigorous training and validation process using the collected dataset, ensuring reliability and scalability across diverse scenarios. Ethical considerations were adhered to throughout the project, with no direct interaction with human subjects.

Results

The Proposed model demonstrated an accuracy of 85% in predicting environmental context and 82% in motion detection, achieving precision and F1-scores of 88% and 85%, respectively. Real-time implementation provided reliable, dynamic feedback on environmental changes and motion activities, significantly enhancing situational awareness.

Conclusion

The Proposed model effectively combines sensor data to deliver real-time, context-aware assistance for visually impaired individuals, improving their ability to navigate complex environments. The system offers a significant advancement in assistive technology and holds promise for broader applications with further enhancements.
目的开发一种集成多种传感器模式的鲁棒模型,以增强视障人士的环境感知和行动能力,提高他们在室内和室外环境中的自主性和安全性。该系统利用先进的物联网和人工智能技术,通过递归贝叶斯滤波、基于核的融合算法和概率图模型集成来自邻近、环境光和运动传感器的数据。在不同的环境中收集了一个全面的数据集,以训练和评估模型在实时环境上下文估计和运动活动检测中的准确性。本研究采用多学科方法,将物联网(IoT)和人工智能(AI)相结合,开发了一种辅助视障人士的拟议模型。该研究在沙特阿拉伯进行了六个月(2024年4月至2024年9月),利用了Najran大学的资源。数据收集涉及在各种室内和室外环境中部署物联网设备,包括住宅区、商业空间和城市街道,以确保多样性和现实世界的适用性。该系统利用接近传感器、环境光传感器和运动探测器来收集不同照明、天气和动态条件下的数据。采用递归贝叶斯滤波、基于核的融合算法和概率图形模型来处理传感器输入,并提供实时环境上下文和运动检测。该研究使用收集的数据集进行了严格的训练和验证过程,确保了在不同场景下的可靠性和可扩展性。整个项目都坚持伦理考虑,没有与人类受试者直接互动。结果该模型预测环境背景的准确率为85%,运动检测的准确率为82%,精度和f1分数分别达到88%和85%。实时实现为环境变化和运动活动提供了可靠的动态反馈,显著增强了态势感知能力。该模型有效地结合传感器数据,为视障人士提供实时、情境感知的帮助,提高他们在复杂环境中导航的能力。该系统提供了辅助技术的重大进步,并有望进一步增强更广泛的应用。
{"title":"Advanced fusion of IoT and AI technologies for smart environments: Enhancing environmental perception and mobility solutions for visually impaired individuals","authors":"Nouf Nawar Alotaibi ,&nbsp;Mrim M. Alnfiai ,&nbsp;Mona Mohammed Alnahari ,&nbsp;Salma Mohsen M. Alnefaie ,&nbsp;Faiz Abdullah Alotaibi","doi":"10.1016/j.imavis.2025.105827","DOIUrl":"10.1016/j.imavis.2025.105827","url":null,"abstract":"<div><h3>Objective</h3><div>To develop a robust proposed model that integrates multiple sensor modalities to enhance environmental perception and mobility for visually impaired individuals, improving their autonomy and safety in both indoor and outdoor settings.</div></div><div><h3>Methods</h3><div>The proposed system utilizes advanced IoT and AI technologies, integrating data from proximity, ambient light, and motion sensors through recursive Bayesian filtering, kernel-based fusion algorithms, and probabilistic graphical models. A comprehensive dataset was collected across diverse environments to train and evaluate the model's accuracy in real-time environmental context estimation and motion activity detection. This study employed a multidisciplinary approach, integrating the Internet of Things (IoT) and Artificial Intelligence (AI), to develop a proposed model for assisting visually impaired individuals. The study was conducted over six months (April 2024 to September 2024) in Saudi Arabia, utilizing resources from Najran University. Data collection involved deploying IoT devices across various indoor and outdoor environments, including residential areas, commercial spaces, and urban streets, to ensure diversity and real-world applicability. The system utilized proximity sensors, ambient light sensors, and motion detectors to gather data under different lighting, weather, and dynamic conditions. Recursive Bayesian filtering, kernel-based fusion algorithms, and probabilistic graphical models were employed to process the sensor inputs and provide real-time environmental context and motion detection. The study followed a rigorous training and validation process using the collected dataset, ensuring reliability and scalability across diverse scenarios. Ethical considerations were adhered to throughout the project, with no direct interaction with human subjects.</div></div><div><h3>Results</h3><div>The Proposed model demonstrated an accuracy of 85% in predicting environmental context and 82% in motion detection, achieving precision and F1-scores of 88% and 85%, respectively. Real-time implementation provided reliable, dynamic feedback on environmental changes and motion activities, significantly enhancing situational awareness.</div></div><div><h3>Conclusion</h3><div>The Proposed model effectively combines sensor data to deliver real-time, context-aware assistance for visually impaired individuals, improving their ability to navigate complex environments. The system offers a significant advancement in assistive technology and holds promise for broader applications with further enhancements.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105827"},"PeriodicalIF":4.2,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145579498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LET-CViT: A low-light enhanced two-stream CNN and vision transformer for Deepfake detection LET-CViT:一种用于Deepfake检测的弱光增强双流CNN和视觉变压器
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-01 Epub Date: 2025-11-20 DOI: 10.1016/j.imavis.2025.105828
Gaoming Yang , Yifan Song , Xiangyu Yang , Ji Zhang
With the development of generative technologies, fake faces have become increasingly realistic. Unknown forgery methods and complex generation environments make Deepfake detection challenging. While existing detectors can identify most forged images under normal lighting conditions, their performance deteriorates in different lighting environments, especially under low-light conditions. In this paper, to address the challenges of forged face detection performance in low-light environments, we present a novel Low-light Enhanced Two-stream CNN and Vision Transformer (LET-CViT) framework, which contains our improved ReLU-CBAM Depthwise Separable Convolution (RC-DSC) block and Dynamic Sigmoid-Gated Multi-Head Attention (DSG-MHA) block. At the same time, the LET-CViT incorporates two innovative modules, namely Low-light Enhancement with Denoising (LED) and Wavelet Transform high-frequency Fusion (WTF). Specifically, the premier LED module is capable of improving low-light image quality and capturing fake textures with light enhancement technology and directional denoising. Subsequently, the proposed WTF module captures multi-scale features and focuses on high-frequency information by multiple fusions of high-frequency sub-bands after discrete wavelet transformation, while reducing the interference of low-frequency information. Extensive experiments on several datasets show that our framework is able to reliably detect forged videos under low-light conditions. The AUCs for the unseen DeeperForensics-1.0 and DFD datasets reach 95.73% and 95.24% respectively, significantly outperforming other mainstream models. The code for reproducing our results is publicly available here: https://github.com/SYF-code/LET-CViT.
随着生成技术的发展,假脸变得越来越逼真。未知的伪造方法和复杂的生成环境使Deepfake检测具有挑战性。虽然现有的检测器可以在正常照明条件下识别大多数伪造图像,但在不同的照明环境下,特别是在低光条件下,它们的性能会下降。在本文中,为了解决在低光环境下伪造人脸检测性能的挑战,我们提出了一种新的低光增强双流CNN和视觉变压器(let - cit)框架,其中包含我们改进的ReLU-CBAM深度可分离卷积(RC-DSC)块和动态s型门控多头部注意(DSG-MHA)块。同时,LET-CViT集成了两个创新模块,即低光增强降噪(LED)和小波变换高频融合(WTF)。具体来说,首要的LED模块能够提高低光图像质量,并通过光增强技术和定向去噪捕捉假纹理。随后,本文提出的WTF模块捕获多尺度特征,通过离散小波变换后高频子带的多次融合来聚焦高频信息,同时降低低频信息的干扰。在多个数据集上的大量实验表明,我们的框架能够在弱光条件下可靠地检测伪造视频。对于未见过的DeeperForensics-1.0和DFD数据集,auc分别达到95.73%和95.24%,显著优于其他主流模型。复制我们的结果的代码可以在这里公开获得:https://github.com/SYF-code/LET-CViT。
{"title":"LET-CViT: A low-light enhanced two-stream CNN and vision transformer for Deepfake detection","authors":"Gaoming Yang ,&nbsp;Yifan Song ,&nbsp;Xiangyu Yang ,&nbsp;Ji Zhang","doi":"10.1016/j.imavis.2025.105828","DOIUrl":"10.1016/j.imavis.2025.105828","url":null,"abstract":"<div><div>With the development of generative technologies, fake faces have become increasingly realistic. Unknown forgery methods and complex generation environments make Deepfake detection challenging. While existing detectors can identify most forged images under normal lighting conditions, their performance deteriorates in different lighting environments, especially under low-light conditions. In this paper, to address the challenges of forged face detection performance in low-light environments, we present a novel Low-light Enhanced Two-stream CNN and Vision Transformer (LET-CViT) framework, which contains our improved ReLU-CBAM Depthwise Separable Convolution (RC-DSC) block and Dynamic Sigmoid-Gated Multi-Head Attention (DSG-MHA) block. At the same time, the LET-CViT incorporates two innovative modules, namely Low-light Enhancement with Denoising (LED) and Wavelet Transform high-frequency Fusion (WTF). Specifically, the premier LED module is capable of improving low-light image quality and capturing fake textures with light enhancement technology and directional denoising. Subsequently, the proposed WTF module captures multi-scale features and focuses on high-frequency information by multiple fusions of high-frequency sub-bands after discrete wavelet transformation, while reducing the interference of low-frequency information. Extensive experiments on several datasets show that our framework is able to reliably detect forged videos under low-light conditions. The AUCs for the unseen DeeperForensics-1.0 and DFD datasets reach 95.73% and 95.24% respectively, significantly outperforming other mainstream models. The code for reproducing our results is publicly available here: <span><span>https://github.com/SYF-code/LET-CViT</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105828"},"PeriodicalIF":4.2,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145624272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LSBE-Net: Semantic segmentation of large-scale point cloud scenes via local boundary feature and spatial attention aggregation LSBE-Net:基于局部边界特征和空间注意力聚合的大规模点云场景语义分割
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-01 Epub Date: 2025-11-03 DOI: 10.1016/j.imavis.2025.105798
Hailang Wang, Keke Duan, Mingzi Zhang, Li Ma
3D point cloud semantic segmentation plays a pivotal role in comprehending 3D scenes and facilitating environmental perception. Existing studies predominantly emphasize the extraction of local geometric structures, but they often overlook the incorporation of local boundary cues and long-range spatial relationships. This limitation hampers precise delineation of object boundaries and impairs the distinction of long distance instances. To address these challenges, we propose LSBE-Net, a novel segmentation algorithm designed to extract local boundary features and integrate spatial context features. The Local Surface Representation (LSR) module is introduced to capture local geometric shapes by encoding both surface and positional features, thereby providing critical structural information. The Local Boundary Enhancement (LBE) module extracts boundary features and fuses them with geometric and semantic features through a transformer mechanism within local neighborhoods, enabling the learning of contextual relationships and refinement of boundary delineation. These features are aggregated through the Spatial Encoding Attention (SEA) module, which facilitates the learning of long-range dependencies and spatial relationship across the point cloud. The proposed LSBE-Net is extensively evaluated on three large-scale benchmark datasets: S3DIS, Toronto3D, and Semantic3D. Our method achieves competitive mean Intersection over Union (mIoU) scores of 66.1%, 82.3%, and 78.0%, respectively, demonstrating its effectiveness and robustness in diverse real-world scenarios.
三维点云语义分割在理解三维场景和促进环境感知方面起着关键作用。现有的研究主要强调局部几何结构的提取,但往往忽略了局部边界线索和长期空间关系的结合。这种限制妨碍了对象边界的精确描绘,并损害了长距离实例的区分。为了解决这些问题,我们提出了一种新的分割算法LSBE-Net,该算法旨在提取局部边界特征并整合空间上下文特征。引入局部表面表示(LSR)模块,通过对表面和位置特征进行编码来捕获局部几何形状,从而提供关键的结构信息。局部边界增强(LBE)模块提取边界特征,并通过局部邻域内的转换机制将其与几何和语义特征融合,实现上下文关系的学习和边界描绘的细化。这些特征通过空间编码注意(SEA)模块进行聚合,该模块有助于学习点云之间的远程依赖关系和空间关系。提出的LSBE-Net在三个大规模基准数据集上进行了广泛的评估:S3DIS、Toronto3D和Semantic3D。我们的方法分别实现了66.1%、82.3%和78.0%的竞争平均交汇(Intersection over Union, mIoU)得分,证明了它在不同现实场景下的有效性和鲁棒性。
{"title":"LSBE-Net: Semantic segmentation of large-scale point cloud scenes via local boundary feature and spatial attention aggregation","authors":"Hailang Wang,&nbsp;Keke Duan,&nbsp;Mingzi Zhang,&nbsp;Li Ma","doi":"10.1016/j.imavis.2025.105798","DOIUrl":"10.1016/j.imavis.2025.105798","url":null,"abstract":"<div><div>3D point cloud semantic segmentation plays a pivotal role in comprehending 3D scenes and facilitating environmental perception. Existing studies predominantly emphasize the extraction of local geometric structures, but they often overlook the incorporation of local boundary cues and long-range spatial relationships. This limitation hampers precise delineation of object boundaries and impairs the distinction of long distance instances. To address these challenges, we propose LSBE-Net, a novel segmentation algorithm designed to extract local boundary features and integrate spatial context features. The Local Surface Representation (LSR) module is introduced to capture local geometric shapes by encoding both surface and positional features, thereby providing critical structural information. The Local Boundary Enhancement (LBE) module extracts boundary features and fuses them with geometric and semantic features through a transformer mechanism within local neighborhoods, enabling the learning of contextual relationships and refinement of boundary delineation. These features are aggregated through the Spatial Encoding Attention (SEA) module, which facilitates the learning of long-range dependencies and spatial relationship across the point cloud. The proposed LSBE-Net is extensively evaluated on three large-scale benchmark datasets: S3DIS, Toronto3D, and Semantic3D. Our method achieves competitive mean Intersection over Union (mIoU) scores of 66.1%, 82.3%, and 78.0%, respectively, demonstrating its effectiveness and robustness in diverse real-world scenarios.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105798"},"PeriodicalIF":4.2,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145468939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CI-TransCNN: A class imbalance hybrid CNN-Transformer Network for facial attribute recognition 一类不平衡混合CNN-Transformer网络用于人脸属性识别
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-01 Epub Date: 2025-11-10 DOI: 10.1016/j.imavis.2025.105823
Yanfei Liu , Youchang Shi , Yufei Long , Miaosen Xu , Junhua Chen , Yuanqian Li , Hao Wen
Recent facial attribute recognition (FAR) methods often struggle to capture global dependencies and are further challenged by severe class imbalance, large intra-class variations, and high inter-class similarity, ultimately limiting their overall performance. To address these challenges, we propose a network combining CNN and Transformer, termed Class Imbalance Transformer-CNN (CI-TransCNN), for facial attribute recognition, which mainly consists of a TransCNN backbone and a Dual Attention Feature Fusion (DAFF) module. In TransCNN, we incorporate a Structure Self-Attention (StructSA) to improve the utilization of structural patterns in images and propose an Inverted Residual Convolutional GLU (IRC-GLU) to enhance model robustness. This design enables TransCNN to effectively capture multi-level and multi-scale features while integrating both global and local information. DAFF is presented to fuse the features extracted from TransCNN to further improve the feature’s discriminability by using spatial attention and channel attention. Moreover, a Class-Imbalance Binary Cross-Entropy (CIBCE) loss is proposed to improve the model performance on datasets with class imbalance, large intra-class variation, and high inter-class similarity. Experimental results on the CelebA and LFWA datasets show that our method effectively addresses issues such as class imbalance and achieves superior performance compared to existing state-of-the-art CNN- and Transformer-based FAR approaches.
最近的面部属性识别(FAR)方法往往难以捕获全局依赖关系,并且受到严重的类不平衡、类内大变化和类间高相似性的进一步挑战,最终限制了它们的整体性能。为了解决这些挑战,我们提出了一个结合CNN和Transformer的网络,称为Class Imbalance Transformer-CNN (CI-TransCNN),用于人脸属性识别,该网络主要由TransCNN主干和双注意特征融合(Dual Attention Feature Fusion, DAFF)模块组成。在TransCNN中,我们引入了结构自注意(StructSA)来提高图像中结构模式的利用率,并提出了倒残差卷积GLU (IRC-GLU)来增强模型的鲁棒性。这种设计使TransCNN能够有效地捕获多层次和多尺度的特征,同时集成全局和局部信息。DAFF利用空间注意和通道注意,融合从TransCNN中提取的特征,进一步提高特征的可分辨性。此外,为了提高模型在类不平衡、类内变化大、类间相似度高的数据集上的性能,提出了类不平衡二元交叉熵(CIBCE)损失。在CelebA和LFWA数据集上的实验结果表明,与现有的基于CNN和transformer的FAR方法相比,我们的方法有效地解决了类不平衡等问题,并取得了更好的性能。
{"title":"CI-TransCNN: A class imbalance hybrid CNN-Transformer Network for facial attribute recognition","authors":"Yanfei Liu ,&nbsp;Youchang Shi ,&nbsp;Yufei Long ,&nbsp;Miaosen Xu ,&nbsp;Junhua Chen ,&nbsp;Yuanqian Li ,&nbsp;Hao Wen","doi":"10.1016/j.imavis.2025.105823","DOIUrl":"10.1016/j.imavis.2025.105823","url":null,"abstract":"<div><div>Recent facial attribute recognition (FAR) methods often struggle to capture global dependencies and are further challenged by severe class imbalance, large intra-class variations, and high inter-class similarity, ultimately limiting their overall performance. To address these challenges, we propose a network combining CNN and Transformer, termed Class Imbalance Transformer-CNN (CI-TransCNN), for facial attribute recognition, which mainly consists of a TransCNN backbone and a Dual Attention Feature Fusion (DAFF) module. In TransCNN, we incorporate a Structure Self-Attention (StructSA) to improve the utilization of structural patterns in images and propose an Inverted Residual Convolutional GLU (IRC-GLU) to enhance model robustness. This design enables TransCNN to effectively capture multi-level and multi-scale features while integrating both global and local information. DAFF is presented to fuse the features extracted from TransCNN to further improve the feature’s discriminability by using spatial attention and channel attention. Moreover, a Class-Imbalance Binary Cross-Entropy (CIBCE) loss is proposed to improve the model performance on datasets with class imbalance, large intra-class variation, and high inter-class similarity. Experimental results on the CelebA and LFWA datasets show that our method effectively addresses issues such as class imbalance and achieves superior performance compared to existing state-of-the-art CNN- and Transformer-based FAR approaches.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105823"},"PeriodicalIF":4.2,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145520878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Semantic-assisted unpaired image dehazing 语义辅助非配对图像去雾
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-01 Epub Date: 2025-11-06 DOI: 10.1016/j.imavis.2025.105818
Yang Yang, Lei Zhang, Ke Pang, Tongtong Chen, Xiaodong Yue
Recently, a series of innovative unpaired image dehazing techniques have been introduced, they have relieved pressure from collecting paired data, yet these methods typically overlook the integration of semantic information, which is essential for a more comprehensive dehazing process. Our research aims to bridge this gap by proposing a novel method that fully integrates feature information into unpaired image dehazing. Specifically, we propose a semantic information-guided feature enhancement and fusion block, which selectively fuses the refined features guided by the semantic result layer and semantic feature layer based on the uncertainty of semantic information. Besides, our method adopts semantic information to guide the generation of haze in the training process. This approach results in the creation of a more diverse set of hazy images, which in turn enhances the dehazing performance. Furthermore, in terms of the loss function, we introduce a loss term that constrains the semantic information entropy of the dehazing results. This constraint ensures that the dehazed images not only achieve clarity but also retain semantic accuracy and integrity. Extensive experiments validate our superiority over other methods and the effectiveness of our designs. The code is available at .
最近,一系列创新的非配对图像去雾技术被引入,它们缓解了收集配对数据的压力,但这些方法通常忽略了语义信息的整合,而语义信息对于更全面的去雾过程至关重要。我们的研究旨在通过提出一种新颖的方法来弥补这一差距,该方法将特征信息完全集成到非成对图像去雾中。具体来说,我们提出了一种语义信息导向的特征增强与融合块,基于语义信息的不确定性,对语义结果层和语义特征层导向的细化特征进行选择性融合。此外,我们的方法在训练过程中采用语义信息来指导雾霾的产生。这种方法产生了一组更多样化的模糊图像,这反过来又增强了除雾性能。此外,在损失函数方面,我们引入了一个损失项来约束消雾结果的语义信息熵。这种约束保证了去雾后的图像在保持清晰的同时,还能保持语义的准确性和完整性。大量的实验验证了我们的方法优于其他方法和我们的设计的有效性。代码可在。
{"title":"Semantic-assisted unpaired image dehazing","authors":"Yang Yang,&nbsp;Lei Zhang,&nbsp;Ke Pang,&nbsp;Tongtong Chen,&nbsp;Xiaodong Yue","doi":"10.1016/j.imavis.2025.105818","DOIUrl":"10.1016/j.imavis.2025.105818","url":null,"abstract":"<div><div>Recently, a series of innovative unpaired image dehazing techniques have been introduced, they have relieved pressure from collecting paired data, yet these methods typically overlook the integration of semantic information, which is essential for a more comprehensive dehazing process. Our research aims to bridge this gap by proposing a novel method that fully integrates feature information into unpaired image dehazing. Specifically, we propose a semantic information-guided feature enhancement and fusion block, which selectively fuses the refined features guided by the semantic result layer and semantic feature layer based on the uncertainty of semantic information. Besides, our method adopts semantic information to guide the generation of haze in the training process. This approach results in the creation of a more diverse set of hazy images, which in turn enhances the dehazing performance. Furthermore, in terms of the loss function, we introduce a loss term that constrains the semantic information entropy of the dehazing results. This constraint ensures that the dehazed images not only achieve clarity but also retain semantic accuracy and integrity. Extensive experiments validate our superiority over other methods and the effectiveness of our designs. The code is available at .</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105818"},"PeriodicalIF":4.2,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145528664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Single stage weakly supervised semantic segmentation via enhanced patch affinity 基于增强贴片亲和力的单阶段弱监督语义分割
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-01 Epub Date: 2025-10-15 DOI: 10.1016/j.imavis.2025.105791
Jingjie Jiang , Yuhui Zheng , Guoqing Zhang
Weakly supervised semantic segmentation (WSSS) with image-level labels typically employs class activation maps (CAMs) to generate pseudo-labels. Existing WSSS methods, whether based on CNN or Transformer frameworks, predominantly adopt multi-stage pipelines that entail stage-wise training and disparate strategies, resulting in complex inter-stage interactions. Furthermore, prior approaches frequently optimize CAMs directly via patch affinity in Vision Transformer (ViT), a computationally intensive process and may lead to excessive background activation and blurred object boundaries. To address these limitations, we propose a single-stage WSSS method called SSEPA (Single Stage WSSS with Enhanced Patch Affinity), which integrates end-to-end optimization of initial CAMs by patch affinity. To further enhance patch affinity in attention maps, we propose the Adaptive Layer Attention Fusion (ALAF) module. ALAF assesses the importance of attention from different depth layers by assigning weights and fusing them through dynamic weight vectors. Experiments on the PASCAL VOC and MS COCO datasets show that our method can significantly improve the quality of CAM and segmentation models. Compared to previous single-stage methods, SSEPA exhibits lower misclassification probability and produces more precise object boundaries, fully verifying the effectiveness of our approach.
带有图像级标签的弱监督语义分割(WSSS)通常使用类激活映射(CAMs)来生成伪标签。现有的WSSS方法,无论是基于CNN还是Transformer框架,主要采用多级管道,需要分阶段训练和不同的策略,导致复杂的阶段间交互。此外,先前的方法经常直接通过视觉变压器(ViT)中的补丁亲和力来优化凸轮,这是一个计算密集型的过程,可能导致过度的背景激活和模糊的目标边界。为了解决这些限制,我们提出了一种称为SSEPA (Single Stage WSSS with Enhanced Patch Affinity)的单阶段WSSS方法,该方法通过Patch Affinity集成了初始cam的端到端优化。为了进一步增强注意图中的斑块亲和力,我们提出了自适应层注意融合(ALAF)模块。ALAF通过分配权重并通过动态权重向量进行融合来评估不同深度层的注意力重要性。在PASCAL VOC和MS COCO数据集上的实验表明,我们的方法可以显著提高CAM和分割模型的质量。与以往的单阶段方法相比,SSEPA的误分类概率更低,生成的目标边界更精确,充分验证了我们方法的有效性。
{"title":"Single stage weakly supervised semantic segmentation via enhanced patch affinity","authors":"Jingjie Jiang ,&nbsp;Yuhui Zheng ,&nbsp;Guoqing Zhang","doi":"10.1016/j.imavis.2025.105791","DOIUrl":"10.1016/j.imavis.2025.105791","url":null,"abstract":"<div><div>Weakly supervised semantic segmentation (WSSS) with image-level labels typically employs class activation maps (CAMs) to generate pseudo-labels. Existing WSSS methods, whether based on CNN or Transformer frameworks, predominantly adopt multi-stage pipelines that entail stage-wise training and disparate strategies, resulting in complex inter-stage interactions. Furthermore, prior approaches frequently optimize CAMs directly via patch affinity in Vision Transformer (ViT), a computationally intensive process and may lead to excessive background activation and blurred object boundaries. To address these limitations, we propose a single-stage WSSS method called SSEPA (Single Stage WSSS with Enhanced Patch Affinity), which integrates end-to-end optimization of initial CAMs by patch affinity. To further enhance patch affinity in attention maps, we propose the Adaptive Layer Attention Fusion (ALAF) module. ALAF assesses the importance of attention from different depth layers by assigning weights and fusing them through dynamic weight vectors. Experiments on the PASCAL VOC and MS COCO datasets show that our method can significantly improve the quality of CAM and segmentation models. Compared to previous single-stage methods, SSEPA exhibits lower misclassification probability and produces more precise object boundaries, fully verifying the effectiveness of our approach.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"164 ","pages":"Article 105791"},"PeriodicalIF":4.2,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145419296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Simultaneous acquisition of geometry and material for translucent objects 同时获取半透明物体的几何形状和材料
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-01 Epub Date: 2025-10-24 DOI: 10.1016/j.imavis.2025.105793
Chenhao Li , Trung Thanh Ngo , Hajime Nagahara
Reconstructing the geometry and material properties of translucent objects from images is a challenging problem due to the complex light propagation of translucent media and the inherent ambiguity of inverse rendering. Therefore, previous works often make the assumption that the objects are opaque or use a simplified model to describe translucent objects, which significantly affects the reconstruction quality and limits the downstream tasks such as relighting or material editing. We present a novel framework that tackles this challenge through a combination of physically grounded and data-driven strategies. At the core of our approach is a hybrid rendering supervision scheme that fuses a differentiable physical renderer with a learned neural renderer to guide reconstruction. To further enhance supervision, we introduce an augmented loss tailored to the neural renderer. Our system takes as input a flash/no-flash image pair, enabling it to disambiguate complex light propagation that happens inside translucent objects. We train our model on a large-scale synthetic dataset of 117 K scenes and evaluate across both synthetic benchmarks and real-world captures. To mitigate the domain gap between synthetic and real data, we contribute a new real-world dataset with ground-truth surface normals and fine-tune our model accordingly. Extensive experiments validate the robustness and accuracy of our method across diverse scenarios.
由于半透明介质复杂的光传播和逆向渲染固有的模糊性,从图像中重建半透明物体的几何和材料属性是一个具有挑战性的问题。因此,以往的作品往往假设物体是不透明的,或者使用简化的模型来描述半透明的物体,这严重影响了重建质量,限制了下游的任务,如重光照或材料编辑。我们提出了一个新的框架,通过物理基础和数据驱动策略的结合来解决这一挑战。我们方法的核心是一个混合渲染监督方案,它融合了一个可微分的物理渲染器和一个学习的神经渲染器来指导重建。为了进一步加强监督,我们引入了针对神经渲染器的增强损失。我们的系统将闪光/无闪光图像对作为输入,使其能够消除半透明物体内部发生的复杂光传播。我们在117k个场景的大规模合成数据集上训练我们的模型,并在合成基准和真实世界的捕获中进行评估。为了减轻合成数据和真实数据之间的领域差距,我们提供了一个新的真实世界数据集,其中包含真实表面法线,并相应地微调我们的模型。大量的实验验证了我们的方法在不同场景下的鲁棒性和准确性。
{"title":"Simultaneous acquisition of geometry and material for translucent objects","authors":"Chenhao Li ,&nbsp;Trung Thanh Ngo ,&nbsp;Hajime Nagahara","doi":"10.1016/j.imavis.2025.105793","DOIUrl":"10.1016/j.imavis.2025.105793","url":null,"abstract":"<div><div>Reconstructing the geometry and material properties of translucent objects from images is a challenging problem due to the complex light propagation of translucent media and the inherent ambiguity of inverse rendering. Therefore, previous works often make the assumption that the objects are opaque or use a simplified model to describe translucent objects, which significantly affects the reconstruction quality and limits the downstream tasks such as relighting or material editing. We present a novel framework that tackles this challenge through a combination of physically grounded and data-driven strategies. At the core of our approach is a hybrid rendering supervision scheme that fuses a differentiable physical renderer with a learned neural renderer to guide reconstruction. To further enhance supervision, we introduce an augmented loss tailored to the neural renderer. Our system takes as input a flash/no-flash image pair, enabling it to disambiguate complex light propagation that happens inside translucent objects. We train our model on a large-scale synthetic dataset of 117 K scenes and evaluate across both synthetic benchmarks and real-world captures. To mitigate the domain gap between synthetic and real data, we contribute a new real-world dataset with ground-truth surface normals and fine-tune our model accordingly. Extensive experiments validate the robustness and accuracy of our method across diverse scenarios.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"164 ","pages":"Article 105793"},"PeriodicalIF":4.2,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145366055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SPM-CyViT: A self-supervised pre-trained cycle-consistent vision transformer with multi-branch for contrast-enhanced CT synthesis SPM-CyViT:一种自监督预训练周期一致视觉变压器,具有多分支,用于对比增强CT合成
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-01 Epub Date: 2025-10-31 DOI: 10.1016/j.imavis.2025.105802
Hongwei Yang , Wen Zeng , Ke Chen , Zhan Hua , Yan Zhuang , Lin Han , Guoliang Liao , Yiteng Zhang , Hanyu Li , Zhenlin Li , Jiangli Lin
Contrast-enhanced computed tomography (CECT) is crucial for assessing vascular anatomy and pathology. However, the use of iodine contrast medium poses risks, including anaphylactic shock and acute kidney injury. To address this, we propose SPM-CyViT, a self-supervised pre-trained, multi-branch, cycle-consistent vision transformer that synthesizes high-quality virtual CECT from non-contrast CT (NCCT). Its generator employs a parallel encoding approach, combining vision transformer blocks with convolutional downsampling layers. Their encoded outputs are fused through a tailored cross-attention module, producing feature maps with multi-scale complementary properties. Employing masked reconstruction, the ViT global encoder enables self-supervised pre-training on diverse unlabeled CT slices. This overcomes fixed-dataset limitations and significantly improves generalization. Additionally, the model features a multi-branch decoder-discriminator design tailored to specific labels. It incorporates 40 keV monoenergetic enhanced CT (MonoE) as an auxiliary label to optimize contrast-sensitive regions. Results from the dual-center internal test set demonstrate that SPM-CyViT outperforms existing CECT synthesis models across all quantitative metrics. Furthermore, based on real CECT as a benchmark, three radiologists awarded SPM-CyViT an average clinical evaluation score of 4.215.00 across multiple perspectives. Additionally, SPM-CyViT exhibits robust generalization on the external test set, achieving a mean CNR of 10.96 for synthesized CECT, nearing the 12.38 value of real CECT, collectively underscoring its clinical application potential.
对比增强计算机断层扫描(CECT)是评估血管解剖和病理的关键。然而,碘造影剂的使用存在风险,包括过敏性休克和急性肾损伤。为了解决这个问题,我们提出了SPM-CyViT,这是一个自我监督的预训练,多分支,周期一致的视觉转换器,它从非对比CT (NCCT)合成高质量的虚拟CECT。其生成器采用并行编码方法,将视觉变换块与卷积下采样层相结合。它们的编码输出通过定制的交叉注意模块融合,产生具有多尺度互补属性的特征图。采用掩模重建,ViT全局编码器能够对各种未标记的CT切片进行自监督预训练。这克服了固定数据集的限制,并显著提高了泛化。此外,该模型还具有针对特定标签量身定制的多分支解码识别器设计。它采用40 keV单能增强CT (MonoE)作为辅助标签来优化对比度敏感区域。双中心内部测试集的结果表明,SPM-CyViT在所有定量指标上都优于现有的CECT综合模型。此外,基于真实CECT作为基准,三位放射科医生在多个角度上给予SPM-CyViT的平均临床评估分数为4.215.00。此外,SPM-CyViT在外部测试集上表现出稳健的泛化,合成CECT的平均CNR为10.96,接近真实CECT的12.38,共同强调了其临床应用潜力。
{"title":"SPM-CyViT: A self-supervised pre-trained cycle-consistent vision transformer with multi-branch for contrast-enhanced CT synthesis","authors":"Hongwei Yang ,&nbsp;Wen Zeng ,&nbsp;Ke Chen ,&nbsp;Zhan Hua ,&nbsp;Yan Zhuang ,&nbsp;Lin Han ,&nbsp;Guoliang Liao ,&nbsp;Yiteng Zhang ,&nbsp;Hanyu Li ,&nbsp;Zhenlin Li ,&nbsp;Jiangli Lin","doi":"10.1016/j.imavis.2025.105802","DOIUrl":"10.1016/j.imavis.2025.105802","url":null,"abstract":"<div><div>Contrast-enhanced computed tomography (CECT) is crucial for assessing vascular anatomy and pathology. However, the use of iodine contrast medium poses risks, including anaphylactic shock and acute kidney injury. To address this, we propose SPM-CyViT, a self-supervised pre-trained, multi-branch, cycle-consistent vision transformer that synthesizes high-quality virtual CECT from non-contrast CT (NCCT). Its generator employs a parallel encoding approach, combining vision transformer blocks with convolutional downsampling layers. Their encoded outputs are fused through a tailored cross-attention module, producing feature maps with multi-scale complementary properties. Employing masked reconstruction, the ViT global encoder enables self-supervised pre-training on diverse unlabeled CT slices. This overcomes fixed-dataset limitations and significantly improves generalization. Additionally, the model features a multi-branch decoder-discriminator design tailored to specific labels. It incorporates 40 keV monoenergetic enhanced CT (MonoE) as an auxiliary label to optimize contrast-sensitive regions. Results from the dual-center internal test set demonstrate that SPM-CyViT outperforms existing CECT synthesis models across all quantitative metrics. Furthermore, based on real CECT as a benchmark, three radiologists awarded SPM-CyViT an average clinical evaluation score of 4.215.00 across multiple perspectives. Additionally, SPM-CyViT exhibits robust generalization on the external test set, achieving a mean CNR of 10.96 for synthesized CECT, nearing the 12.38 value of real CECT, collectively underscoring its clinical application potential.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"164 ","pages":"Article 105802"},"PeriodicalIF":4.2,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145467697","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Image and Vision Computing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1