Pub Date : 2025-11-19DOI: 10.1016/j.imavis.2025.105834
Qianhua Hu, Liantao Wang
Real-time detection in UAV-captured imagery remains a formidable challenge, primarily attributed to the inherent tension between high detection performance and strict computational economy. To address this dilemma, we introduce HF-D-FINE, a novel object-detection paradigm that builds upon the D-FINE architecture and comprises three effective innovations. The HF Hybrid Encoder alleviates the loss of fine-grained detail by selectively injecting high-resolution cues from the backbone’s feature pyramid into the encoder, thereby enriching the representation of minute instances. Complementarily, the CAF module performs cross-scale feature fusion by integrating channel-attentive mechanisms and dynamic upsampling, enabling more expressive interactions between multi-level semantics and spatial cues. Finally, Outer-SNWD introduces aspect ratio consistency penalty factor and auxiliary boxes based on the advantages of Shape-IoU and NWD, making it more suitable for tiny object detection tasks. Collectively, these components substantially elevate tiny object detection accuracy while preserving low computational overhead. Extensive experiments on the widely-adopted aerial benchmarks VisDrone, AI-TOD, and UAVDT demonstrate that HF-D-FINE achieves superior accuracy with a tiny increase in FLOPs. In the VisDrone dataset, the AP value is increased by 3.2% compared with D-FINE-S, and the AP50 value is increased by 4.3%, confirming its efficacy and superiority for tiny object detection in UAV image.
{"title":"HF-D-FINE: High-resolution features enhanced D-FINE for tiny object detection in UAV image","authors":"Qianhua Hu, Liantao Wang","doi":"10.1016/j.imavis.2025.105834","DOIUrl":"10.1016/j.imavis.2025.105834","url":null,"abstract":"<div><div>Real-time detection in UAV-captured imagery remains a formidable challenge, primarily attributed to the inherent tension between high detection performance and strict computational economy. To address this dilemma, we introduce HF-D-FINE, a novel object-detection paradigm that builds upon the D-FINE architecture and comprises three effective innovations. The HF Hybrid Encoder alleviates the loss of fine-grained detail by selectively injecting high-resolution cues from the backbone’s feature pyramid into the encoder, thereby enriching the representation of minute instances. Complementarily, the CAF module performs cross-scale feature fusion by integrating channel-attentive mechanisms and dynamic upsampling, enabling more expressive interactions between multi-level semantics and spatial cues. Finally, Outer-SNWD introduces aspect ratio consistency penalty factor and auxiliary boxes based on the advantages of Shape-IoU and NWD, making it more suitable for tiny object detection tasks. Collectively, these components substantially elevate tiny object detection accuracy while preserving low computational overhead. Extensive experiments on the widely-adopted aerial benchmarks VisDrone, AI-TOD, and UAVDT demonstrate that HF-D-FINE achieves superior accuracy with a tiny increase in FLOPs. In the VisDrone dataset, the AP value is increased by 3.2% compared with D-FINE-S, and the AP<sub>50</sub> value is increased by 4.3%, confirming its efficacy and superiority for tiny object detection in UAV image.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105834"},"PeriodicalIF":4.2,"publicationDate":"2025-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145579497","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-19DOI: 10.1016/j.imavis.2025.105827
Nouf Nawar Alotaibi , Mrim M. Alnfiai , Mona Mohammed Alnahari , Salma Mohsen M. Alnefaie , Faiz Abdullah Alotaibi
Objective
To develop a robust proposed model that integrates multiple sensor modalities to enhance environmental perception and mobility for visually impaired individuals, improving their autonomy and safety in both indoor and outdoor settings.
Methods
The proposed system utilizes advanced IoT and AI technologies, integrating data from proximity, ambient light, and motion sensors through recursive Bayesian filtering, kernel-based fusion algorithms, and probabilistic graphical models. A comprehensive dataset was collected across diverse environments to train and evaluate the model's accuracy in real-time environmental context estimation and motion activity detection. This study employed a multidisciplinary approach, integrating the Internet of Things (IoT) and Artificial Intelligence (AI), to develop a proposed model for assisting visually impaired individuals. The study was conducted over six months (April 2024 to September 2024) in Saudi Arabia, utilizing resources from Najran University. Data collection involved deploying IoT devices across various indoor and outdoor environments, including residential areas, commercial spaces, and urban streets, to ensure diversity and real-world applicability. The system utilized proximity sensors, ambient light sensors, and motion detectors to gather data under different lighting, weather, and dynamic conditions. Recursive Bayesian filtering, kernel-based fusion algorithms, and probabilistic graphical models were employed to process the sensor inputs and provide real-time environmental context and motion detection. The study followed a rigorous training and validation process using the collected dataset, ensuring reliability and scalability across diverse scenarios. Ethical considerations were adhered to throughout the project, with no direct interaction with human subjects.
Results
The Proposed model demonstrated an accuracy of 85% in predicting environmental context and 82% in motion detection, achieving precision and F1-scores of 88% and 85%, respectively. Real-time implementation provided reliable, dynamic feedback on environmental changes and motion activities, significantly enhancing situational awareness.
Conclusion
The Proposed model effectively combines sensor data to deliver real-time, context-aware assistance for visually impaired individuals, improving their ability to navigate complex environments. The system offers a significant advancement in assistive technology and holds promise for broader applications with further enhancements.
{"title":"Advanced fusion of IoT and AI technologies for smart environments: Enhancing environmental perception and mobility solutions for visually impaired individuals","authors":"Nouf Nawar Alotaibi , Mrim M. Alnfiai , Mona Mohammed Alnahari , Salma Mohsen M. Alnefaie , Faiz Abdullah Alotaibi","doi":"10.1016/j.imavis.2025.105827","DOIUrl":"10.1016/j.imavis.2025.105827","url":null,"abstract":"<div><h3>Objective</h3><div>To develop a robust proposed model that integrates multiple sensor modalities to enhance environmental perception and mobility for visually impaired individuals, improving their autonomy and safety in both indoor and outdoor settings.</div></div><div><h3>Methods</h3><div>The proposed system utilizes advanced IoT and AI technologies, integrating data from proximity, ambient light, and motion sensors through recursive Bayesian filtering, kernel-based fusion algorithms, and probabilistic graphical models. A comprehensive dataset was collected across diverse environments to train and evaluate the model's accuracy in real-time environmental context estimation and motion activity detection. This study employed a multidisciplinary approach, integrating the Internet of Things (IoT) and Artificial Intelligence (AI), to develop a proposed model for assisting visually impaired individuals. The study was conducted over six months (April 2024 to September 2024) in Saudi Arabia, utilizing resources from Najran University. Data collection involved deploying IoT devices across various indoor and outdoor environments, including residential areas, commercial spaces, and urban streets, to ensure diversity and real-world applicability. The system utilized proximity sensors, ambient light sensors, and motion detectors to gather data under different lighting, weather, and dynamic conditions. Recursive Bayesian filtering, kernel-based fusion algorithms, and probabilistic graphical models were employed to process the sensor inputs and provide real-time environmental context and motion detection. The study followed a rigorous training and validation process using the collected dataset, ensuring reliability and scalability across diverse scenarios. Ethical considerations were adhered to throughout the project, with no direct interaction with human subjects.</div></div><div><h3>Results</h3><div>The Proposed model demonstrated an accuracy of 85% in predicting environmental context and 82% in motion detection, achieving precision and F1-scores of 88% and 85%, respectively. Real-time implementation provided reliable, dynamic feedback on environmental changes and motion activities, significantly enhancing situational awareness.</div></div><div><h3>Conclusion</h3><div>The Proposed model effectively combines sensor data to deliver real-time, context-aware assistance for visually impaired individuals, improving their ability to navigate complex environments. The system offers a significant advancement in assistive technology and holds promise for broader applications with further enhancements.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105827"},"PeriodicalIF":4.2,"publicationDate":"2025-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145579498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-19DOI: 10.1016/j.imavis.2025.105831
Rong Huang , Zhicheng Wang , Hao Liu , Aihua Dong
Image-based virtual try-on, commonly framed as a generative image-to-image translation task, has garnered significant research interest due to its elimination of the need for costly 3D scanning devices. In this field, image inpainting and cycle-consistency have been the dominant frameworks, but they still face challenges in cross-attribute adaptation and parameter sharing between try-on networks. This paper proposes a new framework, termed human layout consistency, based on the intuitive insight that a high-quality try-on result should align with a coherent human layout. Under the proposed framework, a try-on network is equipped with an upstream Human Layout Generator (HLG) and a downstream Human Layout Parser (HLP). The former generates an expected human layout as if the person were wearing the selected target garment, while the latter extracts an actual human layout parsed from the try-on result. The supervisory signals, free from the ground-truth image pairs, are constructed by assessing the consistencies between the expected and actual human layouts. We design a dual-phase training strategy, first warming up HLG and HLP, then training try-on network by incorporating the supervisory signals based on human layout consistency. On this basis, the proposed framework enables arbitrary selection of target garments during training, thereby endowing the try-on network with the cross-attribute adaptation. Moreover, the proposed framework operates with a single try-on network, rather than two physically separate ones, thereby avoiding the parameter-sharing issue. We conducted both qualitative and quantitative experiments on the benchmark VITON dataset. Experimental results demonstrate that our proposal can generate high-quality try-on results, outperforming baselines by a margin of 0.75% to 10.58%. Ablation and visualization results further reveal that the proposed method exhibits superior adaptability to cross-attribute translations, showcasing its potential for practical application.
{"title":"A human layout consistency framework for image-based virtual try-on","authors":"Rong Huang , Zhicheng Wang , Hao Liu , Aihua Dong","doi":"10.1016/j.imavis.2025.105831","DOIUrl":"10.1016/j.imavis.2025.105831","url":null,"abstract":"<div><div>Image-based virtual try-on, commonly framed as a generative image-to-image translation task, has garnered significant research interest due to its elimination of the need for costly 3D scanning devices. In this field, image inpainting and cycle-consistency have been the dominant frameworks, but they still face challenges in cross-attribute adaptation and parameter sharing between try-on networks. This paper proposes a new framework, termed human layout consistency, based on the intuitive insight that a high-quality try-on result should align with a coherent human layout. Under the proposed framework, a try-on network is equipped with an upstream Human Layout Generator (HLG) and a downstream Human Layout Parser (HLP). The former generates an expected human layout as if the person were wearing the selected target garment, while the latter extracts an actual human layout parsed from the try-on result. The supervisory signals, free from the ground-truth image pairs, are constructed by assessing the consistencies between the expected and actual human layouts. We design a dual-phase training strategy, first warming up HLG and HLP, then training try-on network by incorporating the supervisory signals based on human layout consistency. On this basis, the proposed framework enables arbitrary selection of target garments during training, thereby endowing the try-on network with the cross-attribute adaptation. Moreover, the proposed framework operates with a single try-on network, rather than two physically separate ones, thereby avoiding the parameter-sharing issue. We conducted both qualitative and quantitative experiments on the benchmark VITON dataset. Experimental results demonstrate that our proposal can generate high-quality try-on results, outperforming baselines by a margin of 0.75% to 10.58%. Ablation and visualization results further reveal that the proposed method exhibits superior adaptability to cross-attribute translations, showcasing its potential for practical application.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105831"},"PeriodicalIF":4.2,"publicationDate":"2025-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145579501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-19DOI: 10.1016/j.imavis.2025.105835
Jingyu Wu , Fuming Sun , Haojie Li , Mingyu Lu
Most existing RGB-D salient object detection tasks use convolution operations to design complex fusion modules for cross-modal information fusion. How to correctly integrate RGB and depth features into multi-modal features is important to salient object detection (SOD). Due to the differences between different modal features, the salient object detection model is seriously hindered in achieving better performance. To address the issues mentioned above, we design a multi-modal cooperative fusion network (MCFNet) to achieve RGB-D SOD. Firstly, we propose an edge feature refinement module to remove interference information in shallow features and improve the edge accuracy of SOD. Secondly, a depth optimization module is designed to optimize erroneous estimates in the depth maps, which effectively reduces the impact of noise and improves the performance of the model. Finally, we construct a progressive fusion module that gradually integrates RGB and depth features in a layered manner to achieve an efficient fusion of cross-modal features. Experimental results on six datasets show that our MCFNet performs better than other state-of-the-art (SOTA) methods, which provide new ideas for salient object detection tasks.
{"title":"Multi-modal cooperative fusion network for dual-stream RGB-D salient object detection","authors":"Jingyu Wu , Fuming Sun , Haojie Li , Mingyu Lu","doi":"10.1016/j.imavis.2025.105835","DOIUrl":"10.1016/j.imavis.2025.105835","url":null,"abstract":"<div><div>Most existing RGB-D salient object detection tasks use convolution operations to design complex fusion modules for cross-modal information fusion. How to correctly integrate RGB and depth features into multi-modal features is important to salient object detection (SOD). Due to the differences between different modal features, the salient object detection model is seriously hindered in achieving better performance. To address the issues mentioned above, we design a multi-modal cooperative fusion network (MCFNet) to achieve RGB-D SOD. Firstly, we propose an edge feature refinement module to remove interference information in shallow features and improve the edge accuracy of SOD. Secondly, a depth optimization module is designed to optimize erroneous estimates in the depth maps, which effectively reduces the impact of noise and improves the performance of the model. Finally, we construct a progressive fusion module that gradually integrates RGB and depth features in a layered manner to achieve an efficient fusion of cross-modal features. Experimental results on six datasets show that our MCFNet performs better than other state-of-the-art (SOTA) methods, which provide new ideas for salient object detection tasks.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"166 ","pages":"Article 105835"},"PeriodicalIF":4.2,"publicationDate":"2025-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145685264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-17DOI: 10.1016/j.imavis.2025.105824
Niu Guo, Yi Liu, Pengcheng Zhang, Jiaqi Kang, Zhiguo Gui, Lei Wang
Current polyp segmentation methods predominantly rely on either standalone Convolutional Neural Networks (CNNs) or Transformer architectures, which exhibit inherent limitations in balancing global–local contextual relationships and preserving high-frequency structural details. To address these challenges, this study proposes a Cross-domain Frequency-enhanced Pyramid Vision Transformer Segmentation Network (CFE-PVTSeg). In the encoder, the network achieves hierarchical feature enhancement by integrating Transformer encoders with wavelet transforms: it separately extracts multi-scale spatial features (based on Pyramid Vision Transformer) and frequency-domain features (based on Discrete Wavelet Transform), reinforcing high-frequency components through a cross-domain fusion mechanism. Simultaneously, deformable convolutions with enhanced adaptability are combined with regular convolutions for stability to aggregate boundary-sensitive features that accommodate the irregular morphological variations of polyps. In the decoder, an innovative Multi-Scale Feature Uncertainty Enhancement (MS-FUE) module is designed, which leverages an uncertainty map derived from the encoder to adaptively weight and refine upsampled features, thereby effectively suppressing uncertain components while enhancing the propagation of reliable information. Finally, through a multi-level fusion strategy, the model outputs refined features that deeply integrate high-level semantics with low-level spatial details. Extensive experiments on five public benchmark datasets (Kvasir-SEG, CVC-ClinicDB, CVC-ColonDB, ETIS, and CVC-300) demonstrate that CFE-PVTSeg achieves superior robustness and segmentation accuracy compared to existing methods when handling challenging scenarios such as scale variations and blurred boundaries. Ablation studies further validate the effectiveness of both the proposed cross-domain enhanced encoder and the uncertainty-driven decoder, particularly in suppressing feature noise and improving morphological adaptability to polyps with heterogeneous appearance characteristics.
{"title":"CFE-PVTSeg:Cross-domain frequency-enhanced pyramid vision transformer segmentation network","authors":"Niu Guo, Yi Liu, Pengcheng Zhang, Jiaqi Kang, Zhiguo Gui, Lei Wang","doi":"10.1016/j.imavis.2025.105824","DOIUrl":"10.1016/j.imavis.2025.105824","url":null,"abstract":"<div><div>Current polyp segmentation methods predominantly rely on either standalone Convolutional Neural Networks (CNNs) or Transformer architectures, which exhibit inherent limitations in balancing global–local contextual relationships and preserving high-frequency structural details. To address these challenges, this study proposes a Cross-domain Frequency-enhanced Pyramid Vision Transformer Segmentation Network (CFE-PVTSeg). In the encoder, the network achieves hierarchical feature enhancement by integrating Transformer encoders with wavelet transforms: it separately extracts multi-scale spatial features (based on Pyramid Vision Transformer) and frequency-domain features (based on Discrete Wavelet Transform), reinforcing high-frequency components through a cross-domain fusion mechanism. Simultaneously, deformable convolutions with enhanced adaptability are combined with regular convolutions for stability to aggregate boundary-sensitive features that accommodate the irregular morphological variations of polyps. In the decoder, an innovative Multi-Scale Feature Uncertainty Enhancement (MS-FUE) module is designed, which leverages an uncertainty map derived from the encoder to adaptively weight and refine upsampled features, thereby effectively suppressing uncertain components while enhancing the propagation of reliable information. Finally, through a multi-level fusion strategy, the model outputs refined features that deeply integrate high-level semantics with low-level spatial details. Extensive experiments on five public benchmark datasets (Kvasir-SEG, CVC-ClinicDB, CVC-ColonDB, ETIS, and CVC-300) demonstrate that CFE-PVTSeg achieves superior robustness and segmentation accuracy compared to existing methods when handling challenging scenarios such as scale variations and blurred boundaries. Ablation studies further validate the effectiveness of both the proposed cross-domain enhanced encoder and the uncertainty-driven decoder, particularly in suppressing feature noise and improving morphological adaptability to polyps with heterogeneous appearance characteristics.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105824"},"PeriodicalIF":4.2,"publicationDate":"2025-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145579499","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-17DOI: 10.1016/j.imavis.2025.105829
Bin Hu , Bencheng Liao , Jiyang Qi , Shusheng Yang , Wenyu Liu
Transformers are revolutionizing the landscape of artificial intelligence, unifying the architecture for natural language processing, computer vision, and more. In this paper, we explore how far a Transformer-based architecture can go for object detection - a fundamental task in computer vision and applicable across a range of engineering applications. We found that introducing an early detector can improve the performance of detection transformers, allowing them to know where to focus. To this end, we propose a novel attention map to feature map auxiliary loss and a novel local bipartite matching strategy to cost-freely get a BEtter early detector for high-performance detection TRansformer (BETR). On the COCO dataset, BETR adds no more than 6 million parameters to the Swin Transformer backbone, achieving the highest AP and latency among existing fully Transformer-based detectors across different model scales. As a Transformer detector, BETR also demonstrates accuracy, speed, and parameters on par with previous state-of-the-art CNN-based GFLV2 framework for the first time.
{"title":"Better early detector for high-performance detection transformer","authors":"Bin Hu , Bencheng Liao , Jiyang Qi , Shusheng Yang , Wenyu Liu","doi":"10.1016/j.imavis.2025.105829","DOIUrl":"10.1016/j.imavis.2025.105829","url":null,"abstract":"<div><div>Transformers are revolutionizing the landscape of artificial intelligence, unifying the architecture for natural language processing, computer vision, and more. In this paper, we explore how far a Transformer-based architecture can go for object detection - a fundamental task in computer vision and applicable across a range of engineering applications. We found that introducing an early detector can improve the performance of detection transformers, allowing them to know where to focus. To this end, we propose a novel attention map to feature map auxiliary loss and a novel local bipartite matching strategy to cost-freely get a BEtter early detector for high-performance detection TRansformer (BETR). On the COCO dataset, BETR adds no more than 6 million parameters to the Swin Transformer backbone, achieving the highest AP and latency among existing fully Transformer-based detectors across different model scales. As a Transformer detector, BETR also demonstrates accuracy, speed, and parameters on par with previous state-of-the-art CNN-based GFLV2 framework for the first time.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105829"},"PeriodicalIF":4.2,"publicationDate":"2025-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145579502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-15DOI: 10.1016/j.imavis.2025.105825
Ankit Kumar Titoriya, Maheshwari Prasad Singh, Amit Kumar Singh
Few-shot learning (FSL) has emerged as a promising solution to address the challenge of limited annotated data in medical image classification. However, traditional FSL methods often extract features from only one convolutional layer. This limits their ability to capture detailed spatial, semantic, and contextual information, which is important for accurate classification in complex medical scenarios. To overcome these limitations, this study introduces MedSetFeat++, an improved set feature learning framework with enhanced attention mechanisms, tailored for few-shot medical image classification. It extends the SetFeat architecture by incorporating several key innovations. It uses a multi-head attention mechanism with projections at multiple scales for the query, key, and value, allowing for more detailed feature interactions across different levels. It also includes learnable positional embeddings to preserve spatial information. An adaptive head gating method is added to control the flow of attention in a dynamic way. Additionally, a Convolutional Block Attention Module (CBAM) based attention module is used to improve focus on the most relevant regions in the data. To evaluate the performance and generalization of MedSetFeat++, extensive experiments were conducted using three different medical imaging datasets: HAM10000, BreakHis at 400 magnification, and Kvasir. Under a 2-way 10-shot 15-query setting, the model achieves 92.17% accuracy on HAM10000, 70.89% on BreakHis, and 73.46% on Kvasir. The proposed model outperforms state-of-the-art methods in multiple 2-way classification tasks under 1-shot, 5-shot, and 10-shot settings. These results establish MedSetFeat++ as a strong and adaptable framework for improving performance in few-shot medical image classification.
{"title":"MedSetFeat++: An attention-enriched set feature framework for few-shot medical image classification","authors":"Ankit Kumar Titoriya, Maheshwari Prasad Singh, Amit Kumar Singh","doi":"10.1016/j.imavis.2025.105825","DOIUrl":"10.1016/j.imavis.2025.105825","url":null,"abstract":"<div><div>Few-shot learning (FSL) has emerged as a promising solution to address the challenge of limited annotated data in medical image classification. However, traditional FSL methods often extract features from only one convolutional layer. This limits their ability to capture detailed spatial, semantic, and contextual information, which is important for accurate classification in complex medical scenarios. To overcome these limitations, this study introduces MedSetFeat++, an improved set feature learning framework with enhanced attention mechanisms, tailored for few-shot medical image classification. It extends the SetFeat architecture by incorporating several key innovations. It uses a multi-head attention mechanism with projections at multiple scales for the query, key, and value, allowing for more detailed feature interactions across different levels. It also includes learnable positional embeddings to preserve spatial information. An adaptive head gating method is added to control the flow of attention in a dynamic way. Additionally, a Convolutional Block Attention Module (CBAM) based attention module is used to improve focus on the most relevant regions in the data. To evaluate the performance and generalization of MedSetFeat++, extensive experiments were conducted using three different medical imaging datasets: HAM10000, BreakHis at 400<span><math><mo>×</mo></math></span> magnification, and Kvasir. Under a 2-way 10-shot 15-query setting, the model achieves 92.17% accuracy on HAM10000, 70.89% on BreakHis, and 73.46% on Kvasir. The proposed model outperforms state-of-the-art methods in multiple 2-way classification tasks under 1-shot, 5-shot, and 10-shot settings. These results establish MedSetFeat++ as a strong and adaptable framework for improving performance in few-shot medical image classification.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105825"},"PeriodicalIF":4.2,"publicationDate":"2025-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145579503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-12DOI: 10.1016/j.imavis.2025.105826
Vinayak S. Nageli , Arshad Jamal , Puneet Goyal , Rama Krishna Sai S Gorthi
Object Tracking and Re-Identification (Re-ID) in maritime environments using drone video streams presents significant challenges, especially in search and rescue operations. These challenges mainly arise from the small size of objects from high drone altitudes, sudden movements of the drone’s gimbal and limited appearance diversity of objects. The frequent occlusion in these challenging conditions makes Re-ID difficult in long-term tracking.
In this work, we present a novel framework, Maritime Object Tracking with Spatial–Temporal and Metadata-based modeling (MOT-STM), designed for robust tracking and re-identification of maritime objects in challenging environments. The proposed framework adapts multi-resolution spatial feature extraction using Cross-Stage Partial with Full-Stage (C2FDark) backbone combined with temporal modeling via Video Swin Transformer (VST), enabling effective spatio-temporal representation. This design enhances detection and significantly improves tracking performance in the maritime domain.
We also propose a metadata-driven Re-ID module named Metadata-Assisted Re-ID (MARe-ID), which leverages drone’s metadata such as Global Positioning System (GPS) coordinates, altitude and camera orientation to enhance long-term tracking. Unlike traditional appearance-based Re-ID, MARe-ID remains effective even in scenarios with limited visual diversity among the tracked objects and is generic enough to be integrated into any State-of-the-Art (SotA) multi-object tracking framework as a Re-ID module.
Through extensive experiments on the challenging SeaDronesSee dataset, we demonstrate that MOT-STM significantly outperforms existing methods in maritime object tracking. Our approach achieves a state-of-the-art performance attaining a HOTA score of 70.14% and an IDF1 score of 88.70%, showing the effectiveness and robustness of the proposed MOT-STM framework.
{"title":"MOT-STM: Maritime Object Tracking: A Spatial-Temporal and Metadata-based approach","authors":"Vinayak S. Nageli , Arshad Jamal , Puneet Goyal , Rama Krishna Sai S Gorthi","doi":"10.1016/j.imavis.2025.105826","DOIUrl":"10.1016/j.imavis.2025.105826","url":null,"abstract":"<div><div>Object Tracking and Re-Identification (Re-ID) in maritime environments using drone video streams presents significant challenges, especially in search and rescue operations. These challenges mainly arise from the small size of objects from high drone altitudes, sudden movements of the drone’s gimbal and limited appearance diversity of objects. The frequent occlusion in these challenging conditions makes Re-ID difficult in long-term tracking.</div><div>In this work, we present a novel framework, Maritime Object Tracking with Spatial–Temporal and Metadata-based modeling (MOT-STM), designed for robust tracking and re-identification of maritime objects in challenging environments. The proposed framework adapts multi-resolution spatial feature extraction using Cross-Stage Partial with Full-Stage (C2FDark) backbone combined with temporal modeling via Video Swin Transformer (VST), enabling effective spatio-temporal representation. This design enhances detection and significantly improves tracking performance in the maritime domain.</div><div>We also propose a metadata-driven Re-ID module named Metadata-Assisted Re-ID (MARe-ID), which leverages drone’s metadata such as Global Positioning System (GPS) coordinates, altitude and camera orientation to enhance long-term tracking. Unlike traditional appearance-based Re-ID, MARe-ID remains effective even in scenarios with limited visual diversity among the tracked objects and is generic enough to be integrated into any State-of-the-Art (SotA) multi-object tracking framework as a Re-ID module.</div><div>Through extensive experiments on the challenging SeaDronesSee dataset, we demonstrate that MOT-STM significantly outperforms existing methods in maritime object tracking. Our approach achieves a state-of-the-art performance attaining a HOTA score of 70.14% and an IDF1 score of 88.70%, showing the effectiveness and robustness of the proposed MOT-STM framework.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105826"},"PeriodicalIF":4.2,"publicationDate":"2025-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145520880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-12DOI: 10.1016/j.imavis.2025.105807
Sotirios Papadopoulos , Emmanouil Patsiouras , Konstantinos Ioannidis , Stefanos Vrochidis , Ioannis Kompatsiaris , Ioannis Patras
Object localization is a fundamental task in computer vision that traditionally requires labeled datasets for accurate results. Recent progress in self-supervised learning has enabled unsupervised object localization, reducing reliance on manual annotations. Unlike supervised encoders, which depend on annotated training data, self-supervised encoders learn semantic representations directly from large collections of unlabeled images. This makes them the natural foundation for unsupervised object localization, as they capture object-relevant features while eliminating the need for costly manual labels. These encoders produce semantically coherent patch embeddings. Grouping these embeddings reveals sets of patches that correspond to objects in an image. These patch sets can be converted into object masks or bounding boxes, enabling tasks such as single-object discovery, multi-object detection, and instance segmentation. By applying off-line mask clustering or using pre-trained vision-language models, unsupervised localization methods can assign semantic labels to discovered objects. This transforms initially class-agnostic objects (objects without class labels) into class-aware ones (objects with class labels), aligning these tasks with their supervised counterparts. This paper provides a structured review of unsupervised object localization methods in both class-agnostic and class-aware settings. In contrast, previous surveys have focused only on class-agnostic localization. We discuss state-of-the-art object discovery strategies based on self-supervised features and provide a detailed comparison of experimental results across a wide range of tasks, datasets, and evaluation metrics.
{"title":"Unsupervised Object Localization driven by self-supervised foundation models: A comprehensive review","authors":"Sotirios Papadopoulos , Emmanouil Patsiouras , Konstantinos Ioannidis , Stefanos Vrochidis , Ioannis Kompatsiaris , Ioannis Patras","doi":"10.1016/j.imavis.2025.105807","DOIUrl":"10.1016/j.imavis.2025.105807","url":null,"abstract":"<div><div>Object localization is a fundamental task in computer vision that traditionally requires labeled datasets for accurate results. Recent progress in self-supervised learning has enabled unsupervised object localization, reducing reliance on manual annotations. Unlike supervised encoders, which depend on annotated training data, self-supervised encoders learn semantic representations directly from large collections of unlabeled images. This makes them the natural foundation for unsupervised object localization, as they capture object-relevant features while eliminating the need for costly manual labels. These encoders produce semantically coherent patch embeddings. Grouping these embeddings reveals sets of patches that correspond to objects in an image. These patch sets can be converted into object masks or bounding boxes, enabling tasks such as single-object discovery, multi-object detection, and instance segmentation. By applying off-line mask clustering or using pre-trained vision-language models, unsupervised localization methods can assign semantic labels to discovered objects. This transforms initially class-agnostic objects (objects without class labels) into class-aware ones (objects with class labels), aligning these tasks with their supervised counterparts. This paper provides a structured review of unsupervised object localization methods in both class-agnostic and class-aware settings. In contrast, previous surveys have focused only on class-agnostic localization. We discuss state-of-the-art object discovery strategies based on self-supervised features and provide a detailed comparison of experimental results across a wide range of tasks, datasets, and evaluation metrics.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105807"},"PeriodicalIF":4.2,"publicationDate":"2025-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145528590","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-10DOI: 10.1016/j.imavis.2025.105823
Yanfei Liu , Youchang Shi , Yufei Long , Miaosen Xu , Junhua Chen , Yuanqian Li , Hao Wen
Recent facial attribute recognition (FAR) methods often struggle to capture global dependencies and are further challenged by severe class imbalance, large intra-class variations, and high inter-class similarity, ultimately limiting their overall performance. To address these challenges, we propose a network combining CNN and Transformer, termed Class Imbalance Transformer-CNN (CI-TransCNN), for facial attribute recognition, which mainly consists of a TransCNN backbone and a Dual Attention Feature Fusion (DAFF) module. In TransCNN, we incorporate a Structure Self-Attention (StructSA) to improve the utilization of structural patterns in images and propose an Inverted Residual Convolutional GLU (IRC-GLU) to enhance model robustness. This design enables TransCNN to effectively capture multi-level and multi-scale features while integrating both global and local information. DAFF is presented to fuse the features extracted from TransCNN to further improve the feature’s discriminability by using spatial attention and channel attention. Moreover, a Class-Imbalance Binary Cross-Entropy (CIBCE) loss is proposed to improve the model performance on datasets with class imbalance, large intra-class variation, and high inter-class similarity. Experimental results on the CelebA and LFWA datasets show that our method effectively addresses issues such as class imbalance and achieves superior performance compared to existing state-of-the-art CNN- and Transformer-based FAR approaches.
{"title":"CI-TransCNN: A class imbalance hybrid CNN-Transformer Network for facial attribute recognition","authors":"Yanfei Liu , Youchang Shi , Yufei Long , Miaosen Xu , Junhua Chen , Yuanqian Li , Hao Wen","doi":"10.1016/j.imavis.2025.105823","DOIUrl":"10.1016/j.imavis.2025.105823","url":null,"abstract":"<div><div>Recent facial attribute recognition (FAR) methods often struggle to capture global dependencies and are further challenged by severe class imbalance, large intra-class variations, and high inter-class similarity, ultimately limiting their overall performance. To address these challenges, we propose a network combining CNN and Transformer, termed Class Imbalance Transformer-CNN (CI-TransCNN), for facial attribute recognition, which mainly consists of a TransCNN backbone and a Dual Attention Feature Fusion (DAFF) module. In TransCNN, we incorporate a Structure Self-Attention (StructSA) to improve the utilization of structural patterns in images and propose an Inverted Residual Convolutional GLU (IRC-GLU) to enhance model robustness. This design enables TransCNN to effectively capture multi-level and multi-scale features while integrating both global and local information. DAFF is presented to fuse the features extracted from TransCNN to further improve the feature’s discriminability by using spatial attention and channel attention. Moreover, a Class-Imbalance Binary Cross-Entropy (CIBCE) loss is proposed to improve the model performance on datasets with class imbalance, large intra-class variation, and high inter-class similarity. Experimental results on the CelebA and LFWA datasets show that our method effectively addresses issues such as class imbalance and achieves superior performance compared to existing state-of-the-art CNN- and Transformer-based FAR approaches.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105823"},"PeriodicalIF":4.2,"publicationDate":"2025-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145520878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}