Pub Date : 2026-01-01Epub Date: 2025-11-08DOI: 10.1016/j.imavis.2025.105796
Bowen Zhong , Shijie Li , Xuan Deng , Zheng Li
Anatomical-functional image fusion plays a critical role in a variety of medical and biological applications. Current convolutional neural network-based fusion algorithms are constrained by their limited receptive fields, impeding the effective modeling of long-range dependencies in medical images. While transformer-based architectures possess global modeling capabilities, they face computational challenges due to the quadratic complexity of their self-attention mechanisms. To address these limitations, we propose a network based on wavelet-domain decomposition and an adaptive selectively structured state space model, termed as W-MambaFuse, for anatomical and functional image fusion. Specifically, the network first applies a wavelet transform to enlarge the receptive field of the convolutional layers, facilitating the capture of low-frequency structural outlines and high-frequency textural primitives. Furthermore, we develop an adaptive gated fusion module, referred to as CNN-Mamba Gated (MCG), which leverages the dynamic modeling capability of state space models and the local feature extraction strengths of convolutional neural networks. This design facilitates the effective extraction of both intra-modal and inter-modal features, thereby enhancing multimodal image fusion. Experimental results on benchmark datasets demonstrate that W-MambaFuse consistently outperforms pure CNN-based models, transformer-based models, and CNN-transformer hybrid approaches in terms of both visual quality and quantitative evaluations. Our code is publicly available at https://github.com/Bowen-Zhong/W-Mamba.
{"title":"W-MambaFuse: A wavelet decomposition and adaptive state-space modeling approach for anatomical and functional image fusion","authors":"Bowen Zhong , Shijie Li , Xuan Deng , Zheng Li","doi":"10.1016/j.imavis.2025.105796","DOIUrl":"10.1016/j.imavis.2025.105796","url":null,"abstract":"<div><div>Anatomical-functional image fusion plays a critical role in a variety of medical and biological applications. Current convolutional neural network-based fusion algorithms are constrained by their limited receptive fields, impeding the effective modeling of long-range dependencies in medical images. While transformer-based architectures possess global modeling capabilities, they face computational challenges due to the quadratic complexity of their self-attention mechanisms. To address these limitations, we propose a network based on wavelet-domain decomposition and an adaptive selectively structured state space model, termed as W-MambaFuse, for anatomical and functional image fusion. Specifically, the network first applies a wavelet transform to enlarge the receptive field of the convolutional layers, facilitating the capture of low-frequency structural outlines and high-frequency textural primitives. Furthermore, we develop an adaptive gated fusion module, referred to as CNN-Mamba Gated (MCG), which leverages the dynamic modeling capability of state space models and the local feature extraction strengths of convolutional neural networks. This design facilitates the effective extraction of both intra-modal and inter-modal features, thereby enhancing multimodal image fusion. Experimental results on benchmark datasets demonstrate that W-MambaFuse consistently outperforms pure CNN-based models, transformer-based models, and CNN-transformer hybrid approaches in terms of both visual quality and quantitative evaluations. Our code is publicly available at <span><span>https://github.com/Bowen-Zhong/W-Mamba</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105796"},"PeriodicalIF":4.2,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145521232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01Epub Date: 2025-11-19DOI: 10.1016/j.imavis.2025.105831
Rong Huang , Zhicheng Wang , Hao Liu , Aihua Dong
Image-based virtual try-on, commonly framed as a generative image-to-image translation task, has garnered significant research interest due to its elimination of the need for costly 3D scanning devices. In this field, image inpainting and cycle-consistency have been the dominant frameworks, but they still face challenges in cross-attribute adaptation and parameter sharing between try-on networks. This paper proposes a new framework, termed human layout consistency, based on the intuitive insight that a high-quality try-on result should align with a coherent human layout. Under the proposed framework, a try-on network is equipped with an upstream Human Layout Generator (HLG) and a downstream Human Layout Parser (HLP). The former generates an expected human layout as if the person were wearing the selected target garment, while the latter extracts an actual human layout parsed from the try-on result. The supervisory signals, free from the ground-truth image pairs, are constructed by assessing the consistencies between the expected and actual human layouts. We design a dual-phase training strategy, first warming up HLG and HLP, then training try-on network by incorporating the supervisory signals based on human layout consistency. On this basis, the proposed framework enables arbitrary selection of target garments during training, thereby endowing the try-on network with the cross-attribute adaptation. Moreover, the proposed framework operates with a single try-on network, rather than two physically separate ones, thereby avoiding the parameter-sharing issue. We conducted both qualitative and quantitative experiments on the benchmark VITON dataset. Experimental results demonstrate that our proposal can generate high-quality try-on results, outperforming baselines by a margin of 0.75% to 10.58%. Ablation and visualization results further reveal that the proposed method exhibits superior adaptability to cross-attribute translations, showcasing its potential for practical application.
{"title":"A human layout consistency framework for image-based virtual try-on","authors":"Rong Huang , Zhicheng Wang , Hao Liu , Aihua Dong","doi":"10.1016/j.imavis.2025.105831","DOIUrl":"10.1016/j.imavis.2025.105831","url":null,"abstract":"<div><div>Image-based virtual try-on, commonly framed as a generative image-to-image translation task, has garnered significant research interest due to its elimination of the need for costly 3D scanning devices. In this field, image inpainting and cycle-consistency have been the dominant frameworks, but they still face challenges in cross-attribute adaptation and parameter sharing between try-on networks. This paper proposes a new framework, termed human layout consistency, based on the intuitive insight that a high-quality try-on result should align with a coherent human layout. Under the proposed framework, a try-on network is equipped with an upstream Human Layout Generator (HLG) and a downstream Human Layout Parser (HLP). The former generates an expected human layout as if the person were wearing the selected target garment, while the latter extracts an actual human layout parsed from the try-on result. The supervisory signals, free from the ground-truth image pairs, are constructed by assessing the consistencies between the expected and actual human layouts. We design a dual-phase training strategy, first warming up HLG and HLP, then training try-on network by incorporating the supervisory signals based on human layout consistency. On this basis, the proposed framework enables arbitrary selection of target garments during training, thereby endowing the try-on network with the cross-attribute adaptation. Moreover, the proposed framework operates with a single try-on network, rather than two physically separate ones, thereby avoiding the parameter-sharing issue. We conducted both qualitative and quantitative experiments on the benchmark VITON dataset. Experimental results demonstrate that our proposal can generate high-quality try-on results, outperforming baselines by a margin of 0.75% to 10.58%. Ablation and visualization results further reveal that the proposed method exhibits superior adaptability to cross-attribute translations, showcasing its potential for practical application.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105831"},"PeriodicalIF":4.2,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145579501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01Epub Date: 2025-11-19DOI: 10.1016/j.imavis.2025.105827
Nouf Nawar Alotaibi , Mrim M. Alnfiai , Mona Mohammed Alnahari , Salma Mohsen M. Alnefaie , Faiz Abdullah Alotaibi
Objective
To develop a robust proposed model that integrates multiple sensor modalities to enhance environmental perception and mobility for visually impaired individuals, improving their autonomy and safety in both indoor and outdoor settings.
Methods
The proposed system utilizes advanced IoT and AI technologies, integrating data from proximity, ambient light, and motion sensors through recursive Bayesian filtering, kernel-based fusion algorithms, and probabilistic graphical models. A comprehensive dataset was collected across diverse environments to train and evaluate the model's accuracy in real-time environmental context estimation and motion activity detection. This study employed a multidisciplinary approach, integrating the Internet of Things (IoT) and Artificial Intelligence (AI), to develop a proposed model for assisting visually impaired individuals. The study was conducted over six months (April 2024 to September 2024) in Saudi Arabia, utilizing resources from Najran University. Data collection involved deploying IoT devices across various indoor and outdoor environments, including residential areas, commercial spaces, and urban streets, to ensure diversity and real-world applicability. The system utilized proximity sensors, ambient light sensors, and motion detectors to gather data under different lighting, weather, and dynamic conditions. Recursive Bayesian filtering, kernel-based fusion algorithms, and probabilistic graphical models were employed to process the sensor inputs and provide real-time environmental context and motion detection. The study followed a rigorous training and validation process using the collected dataset, ensuring reliability and scalability across diverse scenarios. Ethical considerations were adhered to throughout the project, with no direct interaction with human subjects.
Results
The Proposed model demonstrated an accuracy of 85% in predicting environmental context and 82% in motion detection, achieving precision and F1-scores of 88% and 85%, respectively. Real-time implementation provided reliable, dynamic feedback on environmental changes and motion activities, significantly enhancing situational awareness.
Conclusion
The Proposed model effectively combines sensor data to deliver real-time, context-aware assistance for visually impaired individuals, improving their ability to navigate complex environments. The system offers a significant advancement in assistive technology and holds promise for broader applications with further enhancements.
{"title":"Advanced fusion of IoT and AI technologies for smart environments: Enhancing environmental perception and mobility solutions for visually impaired individuals","authors":"Nouf Nawar Alotaibi , Mrim M. Alnfiai , Mona Mohammed Alnahari , Salma Mohsen M. Alnefaie , Faiz Abdullah Alotaibi","doi":"10.1016/j.imavis.2025.105827","DOIUrl":"10.1016/j.imavis.2025.105827","url":null,"abstract":"<div><h3>Objective</h3><div>To develop a robust proposed model that integrates multiple sensor modalities to enhance environmental perception and mobility for visually impaired individuals, improving their autonomy and safety in both indoor and outdoor settings.</div></div><div><h3>Methods</h3><div>The proposed system utilizes advanced IoT and AI technologies, integrating data from proximity, ambient light, and motion sensors through recursive Bayesian filtering, kernel-based fusion algorithms, and probabilistic graphical models. A comprehensive dataset was collected across diverse environments to train and evaluate the model's accuracy in real-time environmental context estimation and motion activity detection. This study employed a multidisciplinary approach, integrating the Internet of Things (IoT) and Artificial Intelligence (AI), to develop a proposed model for assisting visually impaired individuals. The study was conducted over six months (April 2024 to September 2024) in Saudi Arabia, utilizing resources from Najran University. Data collection involved deploying IoT devices across various indoor and outdoor environments, including residential areas, commercial spaces, and urban streets, to ensure diversity and real-world applicability. The system utilized proximity sensors, ambient light sensors, and motion detectors to gather data under different lighting, weather, and dynamic conditions. Recursive Bayesian filtering, kernel-based fusion algorithms, and probabilistic graphical models were employed to process the sensor inputs and provide real-time environmental context and motion detection. The study followed a rigorous training and validation process using the collected dataset, ensuring reliability and scalability across diverse scenarios. Ethical considerations were adhered to throughout the project, with no direct interaction with human subjects.</div></div><div><h3>Results</h3><div>The Proposed model demonstrated an accuracy of 85% in predicting environmental context and 82% in motion detection, achieving precision and F1-scores of 88% and 85%, respectively. Real-time implementation provided reliable, dynamic feedback on environmental changes and motion activities, significantly enhancing situational awareness.</div></div><div><h3>Conclusion</h3><div>The Proposed model effectively combines sensor data to deliver real-time, context-aware assistance for visually impaired individuals, improving their ability to navigate complex environments. The system offers a significant advancement in assistive technology and holds promise for broader applications with further enhancements.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105827"},"PeriodicalIF":4.2,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145579498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01Epub Date: 2025-11-20DOI: 10.1016/j.imavis.2025.105828
Gaoming Yang , Yifan Song , Xiangyu Yang , Ji Zhang
With the development of generative technologies, fake faces have become increasingly realistic. Unknown forgery methods and complex generation environments make Deepfake detection challenging. While existing detectors can identify most forged images under normal lighting conditions, their performance deteriorates in different lighting environments, especially under low-light conditions. In this paper, to address the challenges of forged face detection performance in low-light environments, we present a novel Low-light Enhanced Two-stream CNN and Vision Transformer (LET-CViT) framework, which contains our improved ReLU-CBAM Depthwise Separable Convolution (RC-DSC) block and Dynamic Sigmoid-Gated Multi-Head Attention (DSG-MHA) block. At the same time, the LET-CViT incorporates two innovative modules, namely Low-light Enhancement with Denoising (LED) and Wavelet Transform high-frequency Fusion (WTF). Specifically, the premier LED module is capable of improving low-light image quality and capturing fake textures with light enhancement technology and directional denoising. Subsequently, the proposed WTF module captures multi-scale features and focuses on high-frequency information by multiple fusions of high-frequency sub-bands after discrete wavelet transformation, while reducing the interference of low-frequency information. Extensive experiments on several datasets show that our framework is able to reliably detect forged videos under low-light conditions. The AUCs for the unseen DeeperForensics-1.0 and DFD datasets reach 95.73% and 95.24% respectively, significantly outperforming other mainstream models. The code for reproducing our results is publicly available here: https://github.com/SYF-code/LET-CViT.
{"title":"LET-CViT: A low-light enhanced two-stream CNN and vision transformer for Deepfake detection","authors":"Gaoming Yang , Yifan Song , Xiangyu Yang , Ji Zhang","doi":"10.1016/j.imavis.2025.105828","DOIUrl":"10.1016/j.imavis.2025.105828","url":null,"abstract":"<div><div>With the development of generative technologies, fake faces have become increasingly realistic. Unknown forgery methods and complex generation environments make Deepfake detection challenging. While existing detectors can identify most forged images under normal lighting conditions, their performance deteriorates in different lighting environments, especially under low-light conditions. In this paper, to address the challenges of forged face detection performance in low-light environments, we present a novel Low-light Enhanced Two-stream CNN and Vision Transformer (LET-CViT) framework, which contains our improved ReLU-CBAM Depthwise Separable Convolution (RC-DSC) block and Dynamic Sigmoid-Gated Multi-Head Attention (DSG-MHA) block. At the same time, the LET-CViT incorporates two innovative modules, namely Low-light Enhancement with Denoising (LED) and Wavelet Transform high-frequency Fusion (WTF). Specifically, the premier LED module is capable of improving low-light image quality and capturing fake textures with light enhancement technology and directional denoising. Subsequently, the proposed WTF module captures multi-scale features and focuses on high-frequency information by multiple fusions of high-frequency sub-bands after discrete wavelet transformation, while reducing the interference of low-frequency information. Extensive experiments on several datasets show that our framework is able to reliably detect forged videos under low-light conditions. The AUCs for the unseen DeeperForensics-1.0 and DFD datasets reach 95.73% and 95.24% respectively, significantly outperforming other mainstream models. The code for reproducing our results is publicly available here: <span><span>https://github.com/SYF-code/LET-CViT</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105828"},"PeriodicalIF":4.2,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145624272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01Epub Date: 2025-11-03DOI: 10.1016/j.imavis.2025.105798
Hailang Wang, Keke Duan, Mingzi Zhang, Li Ma
3D point cloud semantic segmentation plays a pivotal role in comprehending 3D scenes and facilitating environmental perception. Existing studies predominantly emphasize the extraction of local geometric structures, but they often overlook the incorporation of local boundary cues and long-range spatial relationships. This limitation hampers precise delineation of object boundaries and impairs the distinction of long distance instances. To address these challenges, we propose LSBE-Net, a novel segmentation algorithm designed to extract local boundary features and integrate spatial context features. The Local Surface Representation (LSR) module is introduced to capture local geometric shapes by encoding both surface and positional features, thereby providing critical structural information. The Local Boundary Enhancement (LBE) module extracts boundary features and fuses them with geometric and semantic features through a transformer mechanism within local neighborhoods, enabling the learning of contextual relationships and refinement of boundary delineation. These features are aggregated through the Spatial Encoding Attention (SEA) module, which facilitates the learning of long-range dependencies and spatial relationship across the point cloud. The proposed LSBE-Net is extensively evaluated on three large-scale benchmark datasets: S3DIS, Toronto3D, and Semantic3D. Our method achieves competitive mean Intersection over Union (mIoU) scores of 66.1%, 82.3%, and 78.0%, respectively, demonstrating its effectiveness and robustness in diverse real-world scenarios.
三维点云语义分割在理解三维场景和促进环境感知方面起着关键作用。现有的研究主要强调局部几何结构的提取,但往往忽略了局部边界线索和长期空间关系的结合。这种限制妨碍了对象边界的精确描绘,并损害了长距离实例的区分。为了解决这些问题,我们提出了一种新的分割算法LSBE-Net,该算法旨在提取局部边界特征并整合空间上下文特征。引入局部表面表示(LSR)模块,通过对表面和位置特征进行编码来捕获局部几何形状,从而提供关键的结构信息。局部边界增强(LBE)模块提取边界特征,并通过局部邻域内的转换机制将其与几何和语义特征融合,实现上下文关系的学习和边界描绘的细化。这些特征通过空间编码注意(SEA)模块进行聚合,该模块有助于学习点云之间的远程依赖关系和空间关系。提出的LSBE-Net在三个大规模基准数据集上进行了广泛的评估:S3DIS、Toronto3D和Semantic3D。我们的方法分别实现了66.1%、82.3%和78.0%的竞争平均交汇(Intersection over Union, mIoU)得分,证明了它在不同现实场景下的有效性和鲁棒性。
{"title":"LSBE-Net: Semantic segmentation of large-scale point cloud scenes via local boundary feature and spatial attention aggregation","authors":"Hailang Wang, Keke Duan, Mingzi Zhang, Li Ma","doi":"10.1016/j.imavis.2025.105798","DOIUrl":"10.1016/j.imavis.2025.105798","url":null,"abstract":"<div><div>3D point cloud semantic segmentation plays a pivotal role in comprehending 3D scenes and facilitating environmental perception. Existing studies predominantly emphasize the extraction of local geometric structures, but they often overlook the incorporation of local boundary cues and long-range spatial relationships. This limitation hampers precise delineation of object boundaries and impairs the distinction of long distance instances. To address these challenges, we propose LSBE-Net, a novel segmentation algorithm designed to extract local boundary features and integrate spatial context features. The Local Surface Representation (LSR) module is introduced to capture local geometric shapes by encoding both surface and positional features, thereby providing critical structural information. The Local Boundary Enhancement (LBE) module extracts boundary features and fuses them with geometric and semantic features through a transformer mechanism within local neighborhoods, enabling the learning of contextual relationships and refinement of boundary delineation. These features are aggregated through the Spatial Encoding Attention (SEA) module, which facilitates the learning of long-range dependencies and spatial relationship across the point cloud. The proposed LSBE-Net is extensively evaluated on three large-scale benchmark datasets: S3DIS, Toronto3D, and Semantic3D. Our method achieves competitive mean Intersection over Union (mIoU) scores of 66.1%, 82.3%, and 78.0%, respectively, demonstrating its effectiveness and robustness in diverse real-world scenarios.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105798"},"PeriodicalIF":4.2,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145468939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01Epub Date: 2025-11-10DOI: 10.1016/j.imavis.2025.105823
Yanfei Liu , Youchang Shi , Yufei Long , Miaosen Xu , Junhua Chen , Yuanqian Li , Hao Wen
Recent facial attribute recognition (FAR) methods often struggle to capture global dependencies and are further challenged by severe class imbalance, large intra-class variations, and high inter-class similarity, ultimately limiting their overall performance. To address these challenges, we propose a network combining CNN and Transformer, termed Class Imbalance Transformer-CNN (CI-TransCNN), for facial attribute recognition, which mainly consists of a TransCNN backbone and a Dual Attention Feature Fusion (DAFF) module. In TransCNN, we incorporate a Structure Self-Attention (StructSA) to improve the utilization of structural patterns in images and propose an Inverted Residual Convolutional GLU (IRC-GLU) to enhance model robustness. This design enables TransCNN to effectively capture multi-level and multi-scale features while integrating both global and local information. DAFF is presented to fuse the features extracted from TransCNN to further improve the feature’s discriminability by using spatial attention and channel attention. Moreover, a Class-Imbalance Binary Cross-Entropy (CIBCE) loss is proposed to improve the model performance on datasets with class imbalance, large intra-class variation, and high inter-class similarity. Experimental results on the CelebA and LFWA datasets show that our method effectively addresses issues such as class imbalance and achieves superior performance compared to existing state-of-the-art CNN- and Transformer-based FAR approaches.
{"title":"CI-TransCNN: A class imbalance hybrid CNN-Transformer Network for facial attribute recognition","authors":"Yanfei Liu , Youchang Shi , Yufei Long , Miaosen Xu , Junhua Chen , Yuanqian Li , Hao Wen","doi":"10.1016/j.imavis.2025.105823","DOIUrl":"10.1016/j.imavis.2025.105823","url":null,"abstract":"<div><div>Recent facial attribute recognition (FAR) methods often struggle to capture global dependencies and are further challenged by severe class imbalance, large intra-class variations, and high inter-class similarity, ultimately limiting their overall performance. To address these challenges, we propose a network combining CNN and Transformer, termed Class Imbalance Transformer-CNN (CI-TransCNN), for facial attribute recognition, which mainly consists of a TransCNN backbone and a Dual Attention Feature Fusion (DAFF) module. In TransCNN, we incorporate a Structure Self-Attention (StructSA) to improve the utilization of structural patterns in images and propose an Inverted Residual Convolutional GLU (IRC-GLU) to enhance model robustness. This design enables TransCNN to effectively capture multi-level and multi-scale features while integrating both global and local information. DAFF is presented to fuse the features extracted from TransCNN to further improve the feature’s discriminability by using spatial attention and channel attention. Moreover, a Class-Imbalance Binary Cross-Entropy (CIBCE) loss is proposed to improve the model performance on datasets with class imbalance, large intra-class variation, and high inter-class similarity. Experimental results on the CelebA and LFWA datasets show that our method effectively addresses issues such as class imbalance and achieves superior performance compared to existing state-of-the-art CNN- and Transformer-based FAR approaches.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105823"},"PeriodicalIF":4.2,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145520878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01Epub Date: 2025-11-06DOI: 10.1016/j.imavis.2025.105818
Yang Yang, Lei Zhang, Ke Pang, Tongtong Chen, Xiaodong Yue
Recently, a series of innovative unpaired image dehazing techniques have been introduced, they have relieved pressure from collecting paired data, yet these methods typically overlook the integration of semantic information, which is essential for a more comprehensive dehazing process. Our research aims to bridge this gap by proposing a novel method that fully integrates feature information into unpaired image dehazing. Specifically, we propose a semantic information-guided feature enhancement and fusion block, which selectively fuses the refined features guided by the semantic result layer and semantic feature layer based on the uncertainty of semantic information. Besides, our method adopts semantic information to guide the generation of haze in the training process. This approach results in the creation of a more diverse set of hazy images, which in turn enhances the dehazing performance. Furthermore, in terms of the loss function, we introduce a loss term that constrains the semantic information entropy of the dehazing results. This constraint ensures that the dehazed images not only achieve clarity but also retain semantic accuracy and integrity. Extensive experiments validate our superiority over other methods and the effectiveness of our designs. The code is available at .
{"title":"Semantic-assisted unpaired image dehazing","authors":"Yang Yang, Lei Zhang, Ke Pang, Tongtong Chen, Xiaodong Yue","doi":"10.1016/j.imavis.2025.105818","DOIUrl":"10.1016/j.imavis.2025.105818","url":null,"abstract":"<div><div>Recently, a series of innovative unpaired image dehazing techniques have been introduced, they have relieved pressure from collecting paired data, yet these methods typically overlook the integration of semantic information, which is essential for a more comprehensive dehazing process. Our research aims to bridge this gap by proposing a novel method that fully integrates feature information into unpaired image dehazing. Specifically, we propose a semantic information-guided feature enhancement and fusion block, which selectively fuses the refined features guided by the semantic result layer and semantic feature layer based on the uncertainty of semantic information. Besides, our method adopts semantic information to guide the generation of haze in the training process. This approach results in the creation of a more diverse set of hazy images, which in turn enhances the dehazing performance. Furthermore, in terms of the loss function, we introduce a loss term that constrains the semantic information entropy of the dehazing results. This constraint ensures that the dehazed images not only achieve clarity but also retain semantic accuracy and integrity. Extensive experiments validate our superiority over other methods and the effectiveness of our designs. The code is available at .</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105818"},"PeriodicalIF":4.2,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145528664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-01Epub Date: 2025-10-15DOI: 10.1016/j.imavis.2025.105791
Jingjie Jiang , Yuhui Zheng , Guoqing Zhang
Weakly supervised semantic segmentation (WSSS) with image-level labels typically employs class activation maps (CAMs) to generate pseudo-labels. Existing WSSS methods, whether based on CNN or Transformer frameworks, predominantly adopt multi-stage pipelines that entail stage-wise training and disparate strategies, resulting in complex inter-stage interactions. Furthermore, prior approaches frequently optimize CAMs directly via patch affinity in Vision Transformer (ViT), a computationally intensive process and may lead to excessive background activation and blurred object boundaries. To address these limitations, we propose a single-stage WSSS method called SSEPA (Single Stage WSSS with Enhanced Patch Affinity), which integrates end-to-end optimization of initial CAMs by patch affinity. To further enhance patch affinity in attention maps, we propose the Adaptive Layer Attention Fusion (ALAF) module. ALAF assesses the importance of attention from different depth layers by assigning weights and fusing them through dynamic weight vectors. Experiments on the PASCAL VOC and MS COCO datasets show that our method can significantly improve the quality of CAM and segmentation models. Compared to previous single-stage methods, SSEPA exhibits lower misclassification probability and produces more precise object boundaries, fully verifying the effectiveness of our approach.
{"title":"Single stage weakly supervised semantic segmentation via enhanced patch affinity","authors":"Jingjie Jiang , Yuhui Zheng , Guoqing Zhang","doi":"10.1016/j.imavis.2025.105791","DOIUrl":"10.1016/j.imavis.2025.105791","url":null,"abstract":"<div><div>Weakly supervised semantic segmentation (WSSS) with image-level labels typically employs class activation maps (CAMs) to generate pseudo-labels. Existing WSSS methods, whether based on CNN or Transformer frameworks, predominantly adopt multi-stage pipelines that entail stage-wise training and disparate strategies, resulting in complex inter-stage interactions. Furthermore, prior approaches frequently optimize CAMs directly via patch affinity in Vision Transformer (ViT), a computationally intensive process and may lead to excessive background activation and blurred object boundaries. To address these limitations, we propose a single-stage WSSS method called SSEPA (Single Stage WSSS with Enhanced Patch Affinity), which integrates end-to-end optimization of initial CAMs by patch affinity. To further enhance patch affinity in attention maps, we propose the Adaptive Layer Attention Fusion (ALAF) module. ALAF assesses the importance of attention from different depth layers by assigning weights and fusing them through dynamic weight vectors. Experiments on the PASCAL VOC and MS COCO datasets show that our method can significantly improve the quality of CAM and segmentation models. Compared to previous single-stage methods, SSEPA exhibits lower misclassification probability and produces more precise object boundaries, fully verifying the effectiveness of our approach.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"164 ","pages":"Article 105791"},"PeriodicalIF":4.2,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145419296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-01Epub Date: 2025-10-24DOI: 10.1016/j.imavis.2025.105793
Chenhao Li , Trung Thanh Ngo , Hajime Nagahara
Reconstructing the geometry and material properties of translucent objects from images is a challenging problem due to the complex light propagation of translucent media and the inherent ambiguity of inverse rendering. Therefore, previous works often make the assumption that the objects are opaque or use a simplified model to describe translucent objects, which significantly affects the reconstruction quality and limits the downstream tasks such as relighting or material editing. We present a novel framework that tackles this challenge through a combination of physically grounded and data-driven strategies. At the core of our approach is a hybrid rendering supervision scheme that fuses a differentiable physical renderer with a learned neural renderer to guide reconstruction. To further enhance supervision, we introduce an augmented loss tailored to the neural renderer. Our system takes as input a flash/no-flash image pair, enabling it to disambiguate complex light propagation that happens inside translucent objects. We train our model on a large-scale synthetic dataset of 117 K scenes and evaluate across both synthetic benchmarks and real-world captures. To mitigate the domain gap between synthetic and real data, we contribute a new real-world dataset with ground-truth surface normals and fine-tune our model accordingly. Extensive experiments validate the robustness and accuracy of our method across diverse scenarios.
{"title":"Simultaneous acquisition of geometry and material for translucent objects","authors":"Chenhao Li , Trung Thanh Ngo , Hajime Nagahara","doi":"10.1016/j.imavis.2025.105793","DOIUrl":"10.1016/j.imavis.2025.105793","url":null,"abstract":"<div><div>Reconstructing the geometry and material properties of translucent objects from images is a challenging problem due to the complex light propagation of translucent media and the inherent ambiguity of inverse rendering. Therefore, previous works often make the assumption that the objects are opaque or use a simplified model to describe translucent objects, which significantly affects the reconstruction quality and limits the downstream tasks such as relighting or material editing. We present a novel framework that tackles this challenge through a combination of physically grounded and data-driven strategies. At the core of our approach is a hybrid rendering supervision scheme that fuses a differentiable physical renderer with a learned neural renderer to guide reconstruction. To further enhance supervision, we introduce an augmented loss tailored to the neural renderer. Our system takes as input a flash/no-flash image pair, enabling it to disambiguate complex light propagation that happens inside translucent objects. We train our model on a large-scale synthetic dataset of 117 K scenes and evaluate across both synthetic benchmarks and real-world captures. To mitigate the domain gap between synthetic and real data, we contribute a new real-world dataset with ground-truth surface normals and fine-tune our model accordingly. Extensive experiments validate the robustness and accuracy of our method across diverse scenarios.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"164 ","pages":"Article 105793"},"PeriodicalIF":4.2,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145366055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-01Epub Date: 2025-10-31DOI: 10.1016/j.imavis.2025.105802
Hongwei Yang , Wen Zeng , Ke Chen , Zhan Hua , Yan Zhuang , Lin Han , Guoliang Liao , Yiteng Zhang , Hanyu Li , Zhenlin Li , Jiangli Lin
Contrast-enhanced computed tomography (CECT) is crucial for assessing vascular anatomy and pathology. However, the use of iodine contrast medium poses risks, including anaphylactic shock and acute kidney injury. To address this, we propose SPM-CyViT, a self-supervised pre-trained, multi-branch, cycle-consistent vision transformer that synthesizes high-quality virtual CECT from non-contrast CT (NCCT). Its generator employs a parallel encoding approach, combining vision transformer blocks with convolutional downsampling layers. Their encoded outputs are fused through a tailored cross-attention module, producing feature maps with multi-scale complementary properties. Employing masked reconstruction, the ViT global encoder enables self-supervised pre-training on diverse unlabeled CT slices. This overcomes fixed-dataset limitations and significantly improves generalization. Additionally, the model features a multi-branch decoder-discriminator design tailored to specific labels. It incorporates 40 keV monoenergetic enhanced CT (MonoE) as an auxiliary label to optimize contrast-sensitive regions. Results from the dual-center internal test set demonstrate that SPM-CyViT outperforms existing CECT synthesis models across all quantitative metrics. Furthermore, based on real CECT as a benchmark, three radiologists awarded SPM-CyViT an average clinical evaluation score of 4.215.00 across multiple perspectives. Additionally, SPM-CyViT exhibits robust generalization on the external test set, achieving a mean CNR of 10.96 for synthesized CECT, nearing the 12.38 value of real CECT, collectively underscoring its clinical application potential.
{"title":"SPM-CyViT: A self-supervised pre-trained cycle-consistent vision transformer with multi-branch for contrast-enhanced CT synthesis","authors":"Hongwei Yang , Wen Zeng , Ke Chen , Zhan Hua , Yan Zhuang , Lin Han , Guoliang Liao , Yiteng Zhang , Hanyu Li , Zhenlin Li , Jiangli Lin","doi":"10.1016/j.imavis.2025.105802","DOIUrl":"10.1016/j.imavis.2025.105802","url":null,"abstract":"<div><div>Contrast-enhanced computed tomography (CECT) is crucial for assessing vascular anatomy and pathology. However, the use of iodine contrast medium poses risks, including anaphylactic shock and acute kidney injury. To address this, we propose SPM-CyViT, a self-supervised pre-trained, multi-branch, cycle-consistent vision transformer that synthesizes high-quality virtual CECT from non-contrast CT (NCCT). Its generator employs a parallel encoding approach, combining vision transformer blocks with convolutional downsampling layers. Their encoded outputs are fused through a tailored cross-attention module, producing feature maps with multi-scale complementary properties. Employing masked reconstruction, the ViT global encoder enables self-supervised pre-training on diverse unlabeled CT slices. This overcomes fixed-dataset limitations and significantly improves generalization. Additionally, the model features a multi-branch decoder-discriminator design tailored to specific labels. It incorporates 40 keV monoenergetic enhanced CT (MonoE) as an auxiliary label to optimize contrast-sensitive regions. Results from the dual-center internal test set demonstrate that SPM-CyViT outperforms existing CECT synthesis models across all quantitative metrics. Furthermore, based on real CECT as a benchmark, three radiologists awarded SPM-CyViT an average clinical evaluation score of 4.215.00 across multiple perspectives. Additionally, SPM-CyViT exhibits robust generalization on the external test set, achieving a mean CNR of 10.96 for synthesized CECT, nearing the 12.38 value of real CECT, collectively underscoring its clinical application potential.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"164 ","pages":"Article 105802"},"PeriodicalIF":4.2,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145467697","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}