Pub Date : 2025-11-19DOI: 10.1016/j.imavis.2025.105819
Wujin Li , Qian Xing , Wei He , Longyuan Guo , Jianhui Wu , Minzhi Zhao , Siyuan Chen
Single-image dehazing plays a critical role in various autonomous vision systems. Early methods relied on hand-crafted optimization techniques, whereas recent approaches leverage deep neural networks trained on synthetic data, owing to the scarcity of real-world paired datasets. However, this often results in domain bias when applied to outdoor scenes. In this paper, we present BSA-Dehaze, an unsupervised single-image dehazing framework that integrates a Multi-Scale Bitemporal Fusion Module (MBFM) and a Size-Aware Decoder (SA-Decoder). The method operates without requiring ground-truth images. Our method reformulates dehazing as a haze-to-clear image translation task. BSA-Dehaze incorporates a novel Encoder-SA-Decoder built with ResNet blocks, designed to better preserve image details and edge sharpness. To enhance feature fusion and training efficiency, we introduce the MBFM. A multi-scale discriminator (MSD) is proposed, along with Hinge Loss and Dynamic Block-wise Contrastive Loss, to improve training stability and emphasize challenging samples. Ablation studies verify the contribution of each component. Experimental results on SOTS outdoor, BeDDE, and a real-world dataset demonstrate that our method surpasses existing approaches in both performance and efficiency, despite being trained on significantly less data.
单图像去雾在各种自主视觉系统中起着至关重要的作用。早期的方法依赖于手工优化技术,而由于现实世界配对数据集的稀缺性,最近的方法利用了在合成数据上训练的深度神经网络。然而,当应用于户外场景时,这通常会导致域偏差。在本文中,我们提出了BSA-Dehaze,一种集成了多尺度双时间融合模块(MBFM)和尺寸感知解码器(SA-Decoder)的无监督单图像去雾框架。该方法不需要地面真值图像。我们的方法将除雾重新定义为从雾到清晰的图像转换任务。BSA-Dehaze采用了一种新颖的编码器- sa -解码器,采用ResNet块构建,旨在更好地保留图像细节和边缘清晰度。为了提高特征融合和训练效率,我们引入了MBFM。提出了一种多尺度判别器(MSD),结合Hinge Loss和Dynamic Block-wise contrast Loss来提高训练稳定性和强调挑战性样本。消融研究证实了每个组成部分的贡献。在SOTS户外、BeDDE和真实数据集上的实验结果表明,我们的方法在性能和效率上都超过了现有的方法,尽管训练的数据要少得多。
{"title":"BSA-Dehaze: Multi-Scale Bitemporal Fusion and Size-Aware Decoder for Unsupervised Image Dehazing","authors":"Wujin Li , Qian Xing , Wei He , Longyuan Guo , Jianhui Wu , Minzhi Zhao , Siyuan Chen","doi":"10.1016/j.imavis.2025.105819","DOIUrl":"10.1016/j.imavis.2025.105819","url":null,"abstract":"<div><div>Single-image dehazing plays a critical role in various autonomous vision systems. Early methods relied on hand-crafted optimization techniques, whereas recent approaches leverage deep neural networks trained on synthetic data, owing to the scarcity of real-world paired datasets. However, this often results in domain bias when applied to outdoor scenes. In this paper, we present BSA-Dehaze, an unsupervised single-image dehazing framework that integrates a Multi-Scale Bitemporal Fusion Module (MBFM) and a Size-Aware Decoder (SA-Decoder). The method operates without requiring ground-truth images. Our method reformulates dehazing as a haze-to-clear image translation task. BSA-Dehaze incorporates a novel Encoder-SA-Decoder built with ResNet blocks, designed to better preserve image details and edge sharpness. To enhance feature fusion and training efficiency, we introduce the MBFM. A multi-scale discriminator (MSD) is proposed, along with Hinge Loss and Dynamic Block-wise Contrastive Loss, to improve training stability and emphasize challenging samples. Ablation studies verify the contribution of each component. Experimental results on SOTS outdoor, BeDDE, and a real-world dataset demonstrate that our method surpasses existing approaches in both performance and efficiency, despite being trained on significantly less data.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"166 ","pages":"Article 105819"},"PeriodicalIF":4.2,"publicationDate":"2025-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145600398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-19DOI: 10.1016/j.imavis.2025.105832
Xiaoyan Kui , Zijie Fan , Zexin Ji , Qinsong Li , Chengtao Liu , Weixin Si , Beiji Zou
Magnetic resonance imaging (MRI) reconstruction is a fundamental task aimed at recovering high-quality images from undersampled or low-quality MRI data. This process enhances diagnostic accuracy and optimizes clinical applications. In recent years, deep learning-based MRI reconstruction has made significant progress. Advancements include single-modality feature extraction using different network architectures, the integration of multimodal information, and the adoption of unsupervised or semi-supervised learning strategies. However, despite extensive research, MRI reconstruction remains a challenging problem that has yet to be fully resolved. This survey provides a systematic review of MRI reconstruction methods, covering key aspects such as data acquisition and preprocessing, publicly available datasets, single and multi-modal reconstruction models, training strategies, and evaluation metrics based on image reconstruction and downstream tasks. Additionally, we analyze the major challenges in this field and explore potential future directions.
{"title":"A comprehensive survey on magnetic resonance image reconstruction","authors":"Xiaoyan Kui , Zijie Fan , Zexin Ji , Qinsong Li , Chengtao Liu , Weixin Si , Beiji Zou","doi":"10.1016/j.imavis.2025.105832","DOIUrl":"10.1016/j.imavis.2025.105832","url":null,"abstract":"<div><div>Magnetic resonance imaging (MRI) reconstruction is a fundamental task aimed at recovering high-quality images from undersampled or low-quality MRI data. This process enhances diagnostic accuracy and optimizes clinical applications. In recent years, deep learning-based MRI reconstruction has made significant progress. Advancements include single-modality feature extraction using different network architectures, the integration of multimodal information, and the adoption of unsupervised or semi-supervised learning strategies. However, despite extensive research, MRI reconstruction remains a challenging problem that has yet to be fully resolved. This survey provides a systematic review of MRI reconstruction methods, covering key aspects such as data acquisition and preprocessing, publicly available datasets, single and multi-modal reconstruction models, training strategies, and evaluation metrics based on image reconstruction and downstream tasks. Additionally, we analyze the major challenges in this field and explore potential future directions.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105832"},"PeriodicalIF":4.2,"publicationDate":"2025-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145579500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-19DOI: 10.1016/j.imavis.2025.105834
Qianhua Hu, Liantao Wang
Real-time detection in UAV-captured imagery remains a formidable challenge, primarily attributed to the inherent tension between high detection performance and strict computational economy. To address this dilemma, we introduce HF-D-FINE, a novel object-detection paradigm that builds upon the D-FINE architecture and comprises three effective innovations. The HF Hybrid Encoder alleviates the loss of fine-grained detail by selectively injecting high-resolution cues from the backbone’s feature pyramid into the encoder, thereby enriching the representation of minute instances. Complementarily, the CAF module performs cross-scale feature fusion by integrating channel-attentive mechanisms and dynamic upsampling, enabling more expressive interactions between multi-level semantics and spatial cues. Finally, Outer-SNWD introduces aspect ratio consistency penalty factor and auxiliary boxes based on the advantages of Shape-IoU and NWD, making it more suitable for tiny object detection tasks. Collectively, these components substantially elevate tiny object detection accuracy while preserving low computational overhead. Extensive experiments on the widely-adopted aerial benchmarks VisDrone, AI-TOD, and UAVDT demonstrate that HF-D-FINE achieves superior accuracy with a tiny increase in FLOPs. In the VisDrone dataset, the AP value is increased by 3.2% compared with D-FINE-S, and the AP50 value is increased by 4.3%, confirming its efficacy and superiority for tiny object detection in UAV image.
{"title":"HF-D-FINE: High-resolution features enhanced D-FINE for tiny object detection in UAV image","authors":"Qianhua Hu, Liantao Wang","doi":"10.1016/j.imavis.2025.105834","DOIUrl":"10.1016/j.imavis.2025.105834","url":null,"abstract":"<div><div>Real-time detection in UAV-captured imagery remains a formidable challenge, primarily attributed to the inherent tension between high detection performance and strict computational economy. To address this dilemma, we introduce HF-D-FINE, a novel object-detection paradigm that builds upon the D-FINE architecture and comprises three effective innovations. The HF Hybrid Encoder alleviates the loss of fine-grained detail by selectively injecting high-resolution cues from the backbone’s feature pyramid into the encoder, thereby enriching the representation of minute instances. Complementarily, the CAF module performs cross-scale feature fusion by integrating channel-attentive mechanisms and dynamic upsampling, enabling more expressive interactions between multi-level semantics and spatial cues. Finally, Outer-SNWD introduces aspect ratio consistency penalty factor and auxiliary boxes based on the advantages of Shape-IoU and NWD, making it more suitable for tiny object detection tasks. Collectively, these components substantially elevate tiny object detection accuracy while preserving low computational overhead. Extensive experiments on the widely-adopted aerial benchmarks VisDrone, AI-TOD, and UAVDT demonstrate that HF-D-FINE achieves superior accuracy with a tiny increase in FLOPs. In the VisDrone dataset, the AP value is increased by 3.2% compared with D-FINE-S, and the AP<sub>50</sub> value is increased by 4.3%, confirming its efficacy and superiority for tiny object detection in UAV image.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105834"},"PeriodicalIF":4.2,"publicationDate":"2025-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145579497","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-19DOI: 10.1016/j.imavis.2025.105827
Nouf Nawar Alotaibi , Mrim M. Alnfiai , Mona Mohammed Alnahari , Salma Mohsen M. Alnefaie , Faiz Abdullah Alotaibi
Objective
To develop a robust proposed model that integrates multiple sensor modalities to enhance environmental perception and mobility for visually impaired individuals, improving their autonomy and safety in both indoor and outdoor settings.
Methods
The proposed system utilizes advanced IoT and AI technologies, integrating data from proximity, ambient light, and motion sensors through recursive Bayesian filtering, kernel-based fusion algorithms, and probabilistic graphical models. A comprehensive dataset was collected across diverse environments to train and evaluate the model's accuracy in real-time environmental context estimation and motion activity detection. This study employed a multidisciplinary approach, integrating the Internet of Things (IoT) and Artificial Intelligence (AI), to develop a proposed model for assisting visually impaired individuals. The study was conducted over six months (April 2024 to September 2024) in Saudi Arabia, utilizing resources from Najran University. Data collection involved deploying IoT devices across various indoor and outdoor environments, including residential areas, commercial spaces, and urban streets, to ensure diversity and real-world applicability. The system utilized proximity sensors, ambient light sensors, and motion detectors to gather data under different lighting, weather, and dynamic conditions. Recursive Bayesian filtering, kernel-based fusion algorithms, and probabilistic graphical models were employed to process the sensor inputs and provide real-time environmental context and motion detection. The study followed a rigorous training and validation process using the collected dataset, ensuring reliability and scalability across diverse scenarios. Ethical considerations were adhered to throughout the project, with no direct interaction with human subjects.
Results
The Proposed model demonstrated an accuracy of 85% in predicting environmental context and 82% in motion detection, achieving precision and F1-scores of 88% and 85%, respectively. Real-time implementation provided reliable, dynamic feedback on environmental changes and motion activities, significantly enhancing situational awareness.
Conclusion
The Proposed model effectively combines sensor data to deliver real-time, context-aware assistance for visually impaired individuals, improving their ability to navigate complex environments. The system offers a significant advancement in assistive technology and holds promise for broader applications with further enhancements.
{"title":"Advanced fusion of IoT and AI technologies for smart environments: Enhancing environmental perception and mobility solutions for visually impaired individuals","authors":"Nouf Nawar Alotaibi , Mrim M. Alnfiai , Mona Mohammed Alnahari , Salma Mohsen M. Alnefaie , Faiz Abdullah Alotaibi","doi":"10.1016/j.imavis.2025.105827","DOIUrl":"10.1016/j.imavis.2025.105827","url":null,"abstract":"<div><h3>Objective</h3><div>To develop a robust proposed model that integrates multiple sensor modalities to enhance environmental perception and mobility for visually impaired individuals, improving their autonomy and safety in both indoor and outdoor settings.</div></div><div><h3>Methods</h3><div>The proposed system utilizes advanced IoT and AI technologies, integrating data from proximity, ambient light, and motion sensors through recursive Bayesian filtering, kernel-based fusion algorithms, and probabilistic graphical models. A comprehensive dataset was collected across diverse environments to train and evaluate the model's accuracy in real-time environmental context estimation and motion activity detection. This study employed a multidisciplinary approach, integrating the Internet of Things (IoT) and Artificial Intelligence (AI), to develop a proposed model for assisting visually impaired individuals. The study was conducted over six months (April 2024 to September 2024) in Saudi Arabia, utilizing resources from Najran University. Data collection involved deploying IoT devices across various indoor and outdoor environments, including residential areas, commercial spaces, and urban streets, to ensure diversity and real-world applicability. The system utilized proximity sensors, ambient light sensors, and motion detectors to gather data under different lighting, weather, and dynamic conditions. Recursive Bayesian filtering, kernel-based fusion algorithms, and probabilistic graphical models were employed to process the sensor inputs and provide real-time environmental context and motion detection. The study followed a rigorous training and validation process using the collected dataset, ensuring reliability and scalability across diverse scenarios. Ethical considerations were adhered to throughout the project, with no direct interaction with human subjects.</div></div><div><h3>Results</h3><div>The Proposed model demonstrated an accuracy of 85% in predicting environmental context and 82% in motion detection, achieving precision and F1-scores of 88% and 85%, respectively. Real-time implementation provided reliable, dynamic feedback on environmental changes and motion activities, significantly enhancing situational awareness.</div></div><div><h3>Conclusion</h3><div>The Proposed model effectively combines sensor data to deliver real-time, context-aware assistance for visually impaired individuals, improving their ability to navigate complex environments. The system offers a significant advancement in assistive technology and holds promise for broader applications with further enhancements.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105827"},"PeriodicalIF":4.2,"publicationDate":"2025-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145579498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-19DOI: 10.1016/j.imavis.2025.105831
Rong Huang , Zhicheng Wang , Hao Liu , Aihua Dong
Image-based virtual try-on, commonly framed as a generative image-to-image translation task, has garnered significant research interest due to its elimination of the need for costly 3D scanning devices. In this field, image inpainting and cycle-consistency have been the dominant frameworks, but they still face challenges in cross-attribute adaptation and parameter sharing between try-on networks. This paper proposes a new framework, termed human layout consistency, based on the intuitive insight that a high-quality try-on result should align with a coherent human layout. Under the proposed framework, a try-on network is equipped with an upstream Human Layout Generator (HLG) and a downstream Human Layout Parser (HLP). The former generates an expected human layout as if the person were wearing the selected target garment, while the latter extracts an actual human layout parsed from the try-on result. The supervisory signals, free from the ground-truth image pairs, are constructed by assessing the consistencies between the expected and actual human layouts. We design a dual-phase training strategy, first warming up HLG and HLP, then training try-on network by incorporating the supervisory signals based on human layout consistency. On this basis, the proposed framework enables arbitrary selection of target garments during training, thereby endowing the try-on network with the cross-attribute adaptation. Moreover, the proposed framework operates with a single try-on network, rather than two physically separate ones, thereby avoiding the parameter-sharing issue. We conducted both qualitative and quantitative experiments on the benchmark VITON dataset. Experimental results demonstrate that our proposal can generate high-quality try-on results, outperforming baselines by a margin of 0.75% to 10.58%. Ablation and visualization results further reveal that the proposed method exhibits superior adaptability to cross-attribute translations, showcasing its potential for practical application.
{"title":"A human layout consistency framework for image-based virtual try-on","authors":"Rong Huang , Zhicheng Wang , Hao Liu , Aihua Dong","doi":"10.1016/j.imavis.2025.105831","DOIUrl":"10.1016/j.imavis.2025.105831","url":null,"abstract":"<div><div>Image-based virtual try-on, commonly framed as a generative image-to-image translation task, has garnered significant research interest due to its elimination of the need for costly 3D scanning devices. In this field, image inpainting and cycle-consistency have been the dominant frameworks, but they still face challenges in cross-attribute adaptation and parameter sharing between try-on networks. This paper proposes a new framework, termed human layout consistency, based on the intuitive insight that a high-quality try-on result should align with a coherent human layout. Under the proposed framework, a try-on network is equipped with an upstream Human Layout Generator (HLG) and a downstream Human Layout Parser (HLP). The former generates an expected human layout as if the person were wearing the selected target garment, while the latter extracts an actual human layout parsed from the try-on result. The supervisory signals, free from the ground-truth image pairs, are constructed by assessing the consistencies between the expected and actual human layouts. We design a dual-phase training strategy, first warming up HLG and HLP, then training try-on network by incorporating the supervisory signals based on human layout consistency. On this basis, the proposed framework enables arbitrary selection of target garments during training, thereby endowing the try-on network with the cross-attribute adaptation. Moreover, the proposed framework operates with a single try-on network, rather than two physically separate ones, thereby avoiding the parameter-sharing issue. We conducted both qualitative and quantitative experiments on the benchmark VITON dataset. Experimental results demonstrate that our proposal can generate high-quality try-on results, outperforming baselines by a margin of 0.75% to 10.58%. Ablation and visualization results further reveal that the proposed method exhibits superior adaptability to cross-attribute translations, showcasing its potential for practical application.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105831"},"PeriodicalIF":4.2,"publicationDate":"2025-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145579501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-19DOI: 10.1016/j.imavis.2025.105835
Jingyu Wu , Fuming Sun , Haojie Li , Mingyu Lu
Most existing RGB-D salient object detection tasks use convolution operations to design complex fusion modules for cross-modal information fusion. How to correctly integrate RGB and depth features into multi-modal features is important to salient object detection (SOD). Due to the differences between different modal features, the salient object detection model is seriously hindered in achieving better performance. To address the issues mentioned above, we design a multi-modal cooperative fusion network (MCFNet) to achieve RGB-D SOD. Firstly, we propose an edge feature refinement module to remove interference information in shallow features and improve the edge accuracy of SOD. Secondly, a depth optimization module is designed to optimize erroneous estimates in the depth maps, which effectively reduces the impact of noise and improves the performance of the model. Finally, we construct a progressive fusion module that gradually integrates RGB and depth features in a layered manner to achieve an efficient fusion of cross-modal features. Experimental results on six datasets show that our MCFNet performs better than other state-of-the-art (SOTA) methods, which provide new ideas for salient object detection tasks.
{"title":"Multi-modal cooperative fusion network for dual-stream RGB-D salient object detection","authors":"Jingyu Wu , Fuming Sun , Haojie Li , Mingyu Lu","doi":"10.1016/j.imavis.2025.105835","DOIUrl":"10.1016/j.imavis.2025.105835","url":null,"abstract":"<div><div>Most existing RGB-D salient object detection tasks use convolution operations to design complex fusion modules for cross-modal information fusion. How to correctly integrate RGB and depth features into multi-modal features is important to salient object detection (SOD). Due to the differences between different modal features, the salient object detection model is seriously hindered in achieving better performance. To address the issues mentioned above, we design a multi-modal cooperative fusion network (MCFNet) to achieve RGB-D SOD. Firstly, we propose an edge feature refinement module to remove interference information in shallow features and improve the edge accuracy of SOD. Secondly, a depth optimization module is designed to optimize erroneous estimates in the depth maps, which effectively reduces the impact of noise and improves the performance of the model. Finally, we construct a progressive fusion module that gradually integrates RGB and depth features in a layered manner to achieve an efficient fusion of cross-modal features. Experimental results on six datasets show that our MCFNet performs better than other state-of-the-art (SOTA) methods, which provide new ideas for salient object detection tasks.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"166 ","pages":"Article 105835"},"PeriodicalIF":4.2,"publicationDate":"2025-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145685264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-17DOI: 10.1016/j.imavis.2025.105824
Niu Guo, Yi Liu, Pengcheng Zhang, Jiaqi Kang, Zhiguo Gui, Lei Wang
Current polyp segmentation methods predominantly rely on either standalone Convolutional Neural Networks (CNNs) or Transformer architectures, which exhibit inherent limitations in balancing global–local contextual relationships and preserving high-frequency structural details. To address these challenges, this study proposes a Cross-domain Frequency-enhanced Pyramid Vision Transformer Segmentation Network (CFE-PVTSeg). In the encoder, the network achieves hierarchical feature enhancement by integrating Transformer encoders with wavelet transforms: it separately extracts multi-scale spatial features (based on Pyramid Vision Transformer) and frequency-domain features (based on Discrete Wavelet Transform), reinforcing high-frequency components through a cross-domain fusion mechanism. Simultaneously, deformable convolutions with enhanced adaptability are combined with regular convolutions for stability to aggregate boundary-sensitive features that accommodate the irregular morphological variations of polyps. In the decoder, an innovative Multi-Scale Feature Uncertainty Enhancement (MS-FUE) module is designed, which leverages an uncertainty map derived from the encoder to adaptively weight and refine upsampled features, thereby effectively suppressing uncertain components while enhancing the propagation of reliable information. Finally, through a multi-level fusion strategy, the model outputs refined features that deeply integrate high-level semantics with low-level spatial details. Extensive experiments on five public benchmark datasets (Kvasir-SEG, CVC-ClinicDB, CVC-ColonDB, ETIS, and CVC-300) demonstrate that CFE-PVTSeg achieves superior robustness and segmentation accuracy compared to existing methods when handling challenging scenarios such as scale variations and blurred boundaries. Ablation studies further validate the effectiveness of both the proposed cross-domain enhanced encoder and the uncertainty-driven decoder, particularly in suppressing feature noise and improving morphological adaptability to polyps with heterogeneous appearance characteristics.
{"title":"CFE-PVTSeg:Cross-domain frequency-enhanced pyramid vision transformer segmentation network","authors":"Niu Guo, Yi Liu, Pengcheng Zhang, Jiaqi Kang, Zhiguo Gui, Lei Wang","doi":"10.1016/j.imavis.2025.105824","DOIUrl":"10.1016/j.imavis.2025.105824","url":null,"abstract":"<div><div>Current polyp segmentation methods predominantly rely on either standalone Convolutional Neural Networks (CNNs) or Transformer architectures, which exhibit inherent limitations in balancing global–local contextual relationships and preserving high-frequency structural details. To address these challenges, this study proposes a Cross-domain Frequency-enhanced Pyramid Vision Transformer Segmentation Network (CFE-PVTSeg). In the encoder, the network achieves hierarchical feature enhancement by integrating Transformer encoders with wavelet transforms: it separately extracts multi-scale spatial features (based on Pyramid Vision Transformer) and frequency-domain features (based on Discrete Wavelet Transform), reinforcing high-frequency components through a cross-domain fusion mechanism. Simultaneously, deformable convolutions with enhanced adaptability are combined with regular convolutions for stability to aggregate boundary-sensitive features that accommodate the irregular morphological variations of polyps. In the decoder, an innovative Multi-Scale Feature Uncertainty Enhancement (MS-FUE) module is designed, which leverages an uncertainty map derived from the encoder to adaptively weight and refine upsampled features, thereby effectively suppressing uncertain components while enhancing the propagation of reliable information. Finally, through a multi-level fusion strategy, the model outputs refined features that deeply integrate high-level semantics with low-level spatial details. Extensive experiments on five public benchmark datasets (Kvasir-SEG, CVC-ClinicDB, CVC-ColonDB, ETIS, and CVC-300) demonstrate that CFE-PVTSeg achieves superior robustness and segmentation accuracy compared to existing methods when handling challenging scenarios such as scale variations and blurred boundaries. Ablation studies further validate the effectiveness of both the proposed cross-domain enhanced encoder and the uncertainty-driven decoder, particularly in suppressing feature noise and improving morphological adaptability to polyps with heterogeneous appearance characteristics.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105824"},"PeriodicalIF":4.2,"publicationDate":"2025-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145579499","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-17DOI: 10.1016/j.imavis.2025.105829
Bin Hu , Bencheng Liao , Jiyang Qi , Shusheng Yang , Wenyu Liu
Transformers are revolutionizing the landscape of artificial intelligence, unifying the architecture for natural language processing, computer vision, and more. In this paper, we explore how far a Transformer-based architecture can go for object detection - a fundamental task in computer vision and applicable across a range of engineering applications. We found that introducing an early detector can improve the performance of detection transformers, allowing them to know where to focus. To this end, we propose a novel attention map to feature map auxiliary loss and a novel local bipartite matching strategy to cost-freely get a BEtter early detector for high-performance detection TRansformer (BETR). On the COCO dataset, BETR adds no more than 6 million parameters to the Swin Transformer backbone, achieving the highest AP and latency among existing fully Transformer-based detectors across different model scales. As a Transformer detector, BETR also demonstrates accuracy, speed, and parameters on par with previous state-of-the-art CNN-based GFLV2 framework for the first time.
{"title":"Better early detector for high-performance detection transformer","authors":"Bin Hu , Bencheng Liao , Jiyang Qi , Shusheng Yang , Wenyu Liu","doi":"10.1016/j.imavis.2025.105829","DOIUrl":"10.1016/j.imavis.2025.105829","url":null,"abstract":"<div><div>Transformers are revolutionizing the landscape of artificial intelligence, unifying the architecture for natural language processing, computer vision, and more. In this paper, we explore how far a Transformer-based architecture can go for object detection - a fundamental task in computer vision and applicable across a range of engineering applications. We found that introducing an early detector can improve the performance of detection transformers, allowing them to know where to focus. To this end, we propose a novel attention map to feature map auxiliary loss and a novel local bipartite matching strategy to cost-freely get a BEtter early detector for high-performance detection TRansformer (BETR). On the COCO dataset, BETR adds no more than 6 million parameters to the Swin Transformer backbone, achieving the highest AP and latency among existing fully Transformer-based detectors across different model scales. As a Transformer detector, BETR also demonstrates accuracy, speed, and parameters on par with previous state-of-the-art CNN-based GFLV2 framework for the first time.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105829"},"PeriodicalIF":4.2,"publicationDate":"2025-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145579502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-15DOI: 10.1016/j.imavis.2025.105825
Ankit Kumar Titoriya, Maheshwari Prasad Singh, Amit Kumar Singh
Few-shot learning (FSL) has emerged as a promising solution to address the challenge of limited annotated data in medical image classification. However, traditional FSL methods often extract features from only one convolutional layer. This limits their ability to capture detailed spatial, semantic, and contextual information, which is important for accurate classification in complex medical scenarios. To overcome these limitations, this study introduces MedSetFeat++, an improved set feature learning framework with enhanced attention mechanisms, tailored for few-shot medical image classification. It extends the SetFeat architecture by incorporating several key innovations. It uses a multi-head attention mechanism with projections at multiple scales for the query, key, and value, allowing for more detailed feature interactions across different levels. It also includes learnable positional embeddings to preserve spatial information. An adaptive head gating method is added to control the flow of attention in a dynamic way. Additionally, a Convolutional Block Attention Module (CBAM) based attention module is used to improve focus on the most relevant regions in the data. To evaluate the performance and generalization of MedSetFeat++, extensive experiments were conducted using three different medical imaging datasets: HAM10000, BreakHis at 400 magnification, and Kvasir. Under a 2-way 10-shot 15-query setting, the model achieves 92.17% accuracy on HAM10000, 70.89% on BreakHis, and 73.46% on Kvasir. The proposed model outperforms state-of-the-art methods in multiple 2-way classification tasks under 1-shot, 5-shot, and 10-shot settings. These results establish MedSetFeat++ as a strong and adaptable framework for improving performance in few-shot medical image classification.
{"title":"MedSetFeat++: An attention-enriched set feature framework for few-shot medical image classification","authors":"Ankit Kumar Titoriya, Maheshwari Prasad Singh, Amit Kumar Singh","doi":"10.1016/j.imavis.2025.105825","DOIUrl":"10.1016/j.imavis.2025.105825","url":null,"abstract":"<div><div>Few-shot learning (FSL) has emerged as a promising solution to address the challenge of limited annotated data in medical image classification. However, traditional FSL methods often extract features from only one convolutional layer. This limits their ability to capture detailed spatial, semantic, and contextual information, which is important for accurate classification in complex medical scenarios. To overcome these limitations, this study introduces MedSetFeat++, an improved set feature learning framework with enhanced attention mechanisms, tailored for few-shot medical image classification. It extends the SetFeat architecture by incorporating several key innovations. It uses a multi-head attention mechanism with projections at multiple scales for the query, key, and value, allowing for more detailed feature interactions across different levels. It also includes learnable positional embeddings to preserve spatial information. An adaptive head gating method is added to control the flow of attention in a dynamic way. Additionally, a Convolutional Block Attention Module (CBAM) based attention module is used to improve focus on the most relevant regions in the data. To evaluate the performance and generalization of MedSetFeat++, extensive experiments were conducted using three different medical imaging datasets: HAM10000, BreakHis at 400<span><math><mo>×</mo></math></span> magnification, and Kvasir. Under a 2-way 10-shot 15-query setting, the model achieves 92.17% accuracy on HAM10000, 70.89% on BreakHis, and 73.46% on Kvasir. The proposed model outperforms state-of-the-art methods in multiple 2-way classification tasks under 1-shot, 5-shot, and 10-shot settings. These results establish MedSetFeat++ as a strong and adaptable framework for improving performance in few-shot medical image classification.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105825"},"PeriodicalIF":4.2,"publicationDate":"2025-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145579503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-12DOI: 10.1016/j.imavis.2025.105826
Vinayak S. Nageli , Arshad Jamal , Puneet Goyal , Rama Krishna Sai S Gorthi
Object Tracking and Re-Identification (Re-ID) in maritime environments using drone video streams presents significant challenges, especially in search and rescue operations. These challenges mainly arise from the small size of objects from high drone altitudes, sudden movements of the drone’s gimbal and limited appearance diversity of objects. The frequent occlusion in these challenging conditions makes Re-ID difficult in long-term tracking.
In this work, we present a novel framework, Maritime Object Tracking with Spatial–Temporal and Metadata-based modeling (MOT-STM), designed for robust tracking and re-identification of maritime objects in challenging environments. The proposed framework adapts multi-resolution spatial feature extraction using Cross-Stage Partial with Full-Stage (C2FDark) backbone combined with temporal modeling via Video Swin Transformer (VST), enabling effective spatio-temporal representation. This design enhances detection and significantly improves tracking performance in the maritime domain.
We also propose a metadata-driven Re-ID module named Metadata-Assisted Re-ID (MARe-ID), which leverages drone’s metadata such as Global Positioning System (GPS) coordinates, altitude and camera orientation to enhance long-term tracking. Unlike traditional appearance-based Re-ID, MARe-ID remains effective even in scenarios with limited visual diversity among the tracked objects and is generic enough to be integrated into any State-of-the-Art (SotA) multi-object tracking framework as a Re-ID module.
Through extensive experiments on the challenging SeaDronesSee dataset, we demonstrate that MOT-STM significantly outperforms existing methods in maritime object tracking. Our approach achieves a state-of-the-art performance attaining a HOTA score of 70.14% and an IDF1 score of 88.70%, showing the effectiveness and robustness of the proposed MOT-STM framework.
{"title":"MOT-STM: Maritime Object Tracking: A Spatial-Temporal and Metadata-based approach","authors":"Vinayak S. Nageli , Arshad Jamal , Puneet Goyal , Rama Krishna Sai S Gorthi","doi":"10.1016/j.imavis.2025.105826","DOIUrl":"10.1016/j.imavis.2025.105826","url":null,"abstract":"<div><div>Object Tracking and Re-Identification (Re-ID) in maritime environments using drone video streams presents significant challenges, especially in search and rescue operations. These challenges mainly arise from the small size of objects from high drone altitudes, sudden movements of the drone’s gimbal and limited appearance diversity of objects. The frequent occlusion in these challenging conditions makes Re-ID difficult in long-term tracking.</div><div>In this work, we present a novel framework, Maritime Object Tracking with Spatial–Temporal and Metadata-based modeling (MOT-STM), designed for robust tracking and re-identification of maritime objects in challenging environments. The proposed framework adapts multi-resolution spatial feature extraction using Cross-Stage Partial with Full-Stage (C2FDark) backbone combined with temporal modeling via Video Swin Transformer (VST), enabling effective spatio-temporal representation. This design enhances detection and significantly improves tracking performance in the maritime domain.</div><div>We also propose a metadata-driven Re-ID module named Metadata-Assisted Re-ID (MARe-ID), which leverages drone’s metadata such as Global Positioning System (GPS) coordinates, altitude and camera orientation to enhance long-term tracking. Unlike traditional appearance-based Re-ID, MARe-ID remains effective even in scenarios with limited visual diversity among the tracked objects and is generic enough to be integrated into any State-of-the-Art (SotA) multi-object tracking framework as a Re-ID module.</div><div>Through extensive experiments on the challenging SeaDronesSee dataset, we demonstrate that MOT-STM significantly outperforms existing methods in maritime object tracking. Our approach achieves a state-of-the-art performance attaining a HOTA score of 70.14% and an IDF1 score of 88.70%, showing the effectiveness and robustness of the proposed MOT-STM framework.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105826"},"PeriodicalIF":4.2,"publicationDate":"2025-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145520880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}