Pub Date : 2026-02-01Epub Date: 2025-12-16DOI: 10.1016/j.imavis.2025.105882
Wen-Nung Lie , Lee Aing
Estimating 6 degrees of freedom poses for multiple objects from a single image and making it practical in industry is difficult since several metrics, like accuracy, speed and complexity must be traded. This study adopts a fast bottom-up approach to estimate poses for multi-instance objects in an image simultaneously. We design a convolutional neural network with simple end-to-end training to output 4 feature maps: error mask, semantic mask, center vector map and 6D coordinate map (6DCM). Specifically, 6DCM is capable of providing the rear-side 3D object point clouds information that are originally invisible from the camera's viewpoint. This procedure enriches shape information about target objects which can be used to construct each instance's 2D-3D correspondences for pose parameter estimation. Experimental results show that our proposed bottom-up approach is fast and can process a single image containing 7 objects at 25 frames per second with competitive accuracy to other top-down methods.
{"title":"Efficient 6DoF pose estimation for multi-instance objects from a single image","authors":"Wen-Nung Lie , Lee Aing","doi":"10.1016/j.imavis.2025.105882","DOIUrl":"10.1016/j.imavis.2025.105882","url":null,"abstract":"<div><div>Estimating 6 degrees of freedom poses for multiple objects from a single image and making it practical in industry is difficult since several metrics, like accuracy, speed and complexity must be traded. This study adopts a fast bottom-up approach to estimate poses for multi-instance objects in an image simultaneously. We design a convolutional neural network with simple end-to-end training to output 4 feature maps: error mask, semantic mask, center vector map and 6D coordinate map (6DCM). Specifically, 6DCM is capable of providing the rear-side 3D object point clouds information that are originally invisible from the camera's viewpoint. This procedure enriches shape information about target objects which can be used to construct each instance's 2D-3D correspondences for pose parameter estimation. Experimental results show that our proposed bottom-up approach is fast and can process a single image containing 7 objects at 25 frames per second with competitive accuracy to other top-down methods.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"166 ","pages":"Article 105882"},"PeriodicalIF":4.2,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-01Epub Date: 2025-12-16DOI: 10.1016/j.imavis.2025.105879
Lingyu Wu , Haiying Xia , Shuxiang Song , Yang Lan
Diabetic Retinopathy (DR) is the leading cause of blindness in adults with diabetes. Early automated detection of DR lesions is crucial for preventing vision loss and assisting ophthalmologists in treatment. However, accurately segmenting multiple types of DR lesions poses significant challenges due to their large diversity in size, shape, and location, as well as the conflict in feature modeling between local details and long-range dependencies. To address these issues, we propose a novel Heterogeneous Ensemble Learning Network (HEL-Net) specifically designed for four-lesion segmentation. HEL-Net comprises two ensemble stages: the first stage utilizes Mamba-UNet to generate coarse multi-lesion prediction results, which serve as contextual priors for the second stage, forming a multi-perspective lesion navigation strategy. The second stage employs a heterogeneous structure, integrating specialized networks (Mamba-UNet and U-Net) tailored to different lesion characteristics. Mamba-UNet excels in capturing large lesions by modeling long-range dependencies, while U-Net focuses on small lesions with significant local features. The heterogeneous ensemble framework leverages their complementary strengths to promote comprehensive lesion feature learning. Extensive quantitative and qualitative evaluations on two public datasets (IDRiD and DDR) demonstrate that our HEL-Net achieves competitive performance compared to state-of-the-art methods, achieving an mAUPR of 69.52%, mDice of 67.40%, and mIoU of 51.99% on the IDRiD dataset.
{"title":"HEL-Net: Heterogeneous Ensemble Learning for comprehensive diabetic retinopathy multi-lesion segmentation via Mamba-UNet","authors":"Lingyu Wu , Haiying Xia , Shuxiang Song , Yang Lan","doi":"10.1016/j.imavis.2025.105879","DOIUrl":"10.1016/j.imavis.2025.105879","url":null,"abstract":"<div><div>Diabetic Retinopathy (DR) is the leading cause of blindness in adults with diabetes. Early automated detection of DR lesions is crucial for preventing vision loss and assisting ophthalmologists in treatment. However, accurately segmenting multiple types of DR lesions poses significant challenges due to their large diversity in size, shape, and location, as well as the conflict in feature modeling between local details and long-range dependencies. To address these issues, we propose a novel Heterogeneous Ensemble Learning Network (<em>HEL-Net</em>) specifically designed for four-lesion segmentation. HEL-Net comprises two ensemble stages: the first stage utilizes Mamba-UNet to generate coarse multi-lesion prediction results, which serve as contextual priors for the second stage, forming a multi-perspective lesion navigation strategy. The second stage employs a heterogeneous structure, integrating specialized networks (Mamba-UNet and U-Net) tailored to different lesion characteristics. Mamba-UNet excels in capturing large lesions by modeling long-range dependencies, while U-Net focuses on small lesions with significant local features. The heterogeneous ensemble framework leverages their complementary strengths to promote comprehensive lesion feature learning. Extensive quantitative and qualitative evaluations on two public datasets (IDRiD and DDR) demonstrate that our HEL-Net achieves competitive performance compared to state-of-the-art methods, achieving an mAUPR of 69.52%, mDice of 67.40%, and mIoU of 51.99% on the IDRiD dataset.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"166 ","pages":"Article 105879"},"PeriodicalIF":4.2,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790365","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-01Epub Date: 2025-12-03DOI: 10.1016/j.imavis.2025.105856
Yitong Yang , Lei Zhu , Xinyang Yao , Hua Wang , Yang Pan , Bo Zhang
Infrared and visible light image fusion aims to generate integrated representations that synergistically preserve salient thermal targets in the infrared modality and high-resolution textural details in the visible light modality. However, existing methods face two core challenges: First, high-frequency noise in visible images, such as sensor noise and nonuniform illumination artifacts, is often highly coupled with effective textures. Traditional fusion paradigms readily amplify noise interference while enhancing details, leading to structural distortion and visual graininess in fusion results. Second, mainstream approaches predominantly rely on simple aggregation operations like feature stitching or linear weighting, lacking deep modeling of cross-modal semantic correlations. This prevents adaptive interaction and collaborative enhancement of complementary information between modalities, creating a significant trade-off between target saliency and detail preservation. To address these challenges, we propose a dual-branch fusion network based on local contour enhancement. Specifically, it distinguishes and enhances meaningful contour details in a learnable manner while suppressing meaningless noise, thereby purifying the detail information used for fusion at its source. Cross-attention weights are computed based on feature representations extracted from different modal branches, enabling a feature selection mechanism that facilitates dynamic cross-modal interaction between infrared and visible light information. We evaluate our method against 11 state-of-the-art deep learning-based fusion approaches across four benchmark datasets using both subjective assessments and objective metrics. The experimental results demonstrate superior performance on public datasets. Furthermore, YOLOv12-based detection tests reveal that our method achieves higher confidence scores and better overall detection performance compared to other fusion techniques.
{"title":"LCFusion: Infrared and visible image fusion network based on local contour enhancement","authors":"Yitong Yang , Lei Zhu , Xinyang Yao , Hua Wang , Yang Pan , Bo Zhang","doi":"10.1016/j.imavis.2025.105856","DOIUrl":"10.1016/j.imavis.2025.105856","url":null,"abstract":"<div><div>Infrared and visible light image fusion aims to generate integrated representations that synergistically preserve salient thermal targets in the infrared modality and high-resolution textural details in the visible light modality. However, existing methods face two core challenges: First, high-frequency noise in visible images, such as sensor noise and nonuniform illumination artifacts, is often highly coupled with effective textures. Traditional fusion paradigms readily amplify noise interference while enhancing details, leading to structural distortion and visual graininess in fusion results. Second, mainstream approaches predominantly rely on simple aggregation operations like feature stitching or linear weighting, lacking deep modeling of cross-modal semantic correlations. This prevents adaptive interaction and collaborative enhancement of complementary information between modalities, creating a significant trade-off between target saliency and detail preservation. To address these challenges, we propose a dual-branch fusion network based on local contour enhancement. Specifically, it distinguishes and enhances meaningful contour details in a learnable manner while suppressing meaningless noise, thereby purifying the detail information used for fusion at its source. Cross-attention weights are computed based on feature representations extracted from different modal branches, enabling a feature selection mechanism that facilitates dynamic cross-modal interaction between infrared and visible light information. We evaluate our method against 11 state-of-the-art deep learning-based fusion approaches across four benchmark datasets using both subjective assessments and objective metrics. The experimental results demonstrate superior performance on public datasets. Furthermore, YOLOv12-based detection tests reveal that our method achieves higher confidence scores and better overall detection performance compared to other fusion techniques.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"166 ","pages":"Article 105856"},"PeriodicalIF":4.2,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145685263","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-01Epub Date: 2025-11-26DOI: 10.1016/j.imavis.2025.105849
Keramat Allah Ghaffary, Mohsen Ebrahimi Moghaddam
The rapid increase in retinal scan datasets over the past several years has played a pivotal role in advancing the deep learning approaches for retinopathy detection. However, given the retinal vessel pattern as an accurate biometric identifier, sharing this sensitive data raises critical ethical and legal considerations pertaining to individual re-identification and privacy violations. To mitigate these concerns, we propose a knowledge distillation-based approach that aims to disrupt identity recognition models by applying imperceptible adversarial perturbations to retinal scans while simultaneously preserving their utility for medical purposes. A multi-objective loss function is utilized for this approach, consisting of several terms that effectively guide the generator part of the student network to learn the effective perturbation needed to balance these conflicting goals. Extensive experiments and explainable analysis are conducted on three public fundus datasets using five deep architectures for identity recognition and retinopathy detection. With an average identity fooling rate of 94.67% and average retinopathy detection accuracy of 97.52% for multiple unseen models, the proposed approach outperforms state-of-the-art methods in balancing between medical utility and enhancing patient privacy.
{"title":"Privacy-aware knowledge distillation for retinal scans de-identification through adversarial perturbations","authors":"Keramat Allah Ghaffary, Mohsen Ebrahimi Moghaddam","doi":"10.1016/j.imavis.2025.105849","DOIUrl":"10.1016/j.imavis.2025.105849","url":null,"abstract":"<div><div>The rapid increase in retinal scan datasets over the past several years has played a pivotal role in advancing the deep learning approaches for retinopathy detection. However, given the retinal vessel pattern as an accurate biometric identifier, sharing this sensitive data raises critical ethical and legal considerations pertaining to individual re-identification and privacy violations. To mitigate these concerns, we propose a knowledge distillation-based approach that aims to disrupt identity recognition models by applying imperceptible adversarial perturbations to retinal scans while simultaneously preserving their utility for medical purposes. A multi-objective loss function is utilized for this approach, consisting of several terms that effectively guide the generator part of the student network to learn the effective perturbation needed to balance these conflicting goals. Extensive experiments and explainable analysis are conducted on three public fundus datasets using five deep architectures for identity recognition and retinopathy detection. With an average identity fooling rate of 94.67% and average retinopathy detection accuracy of 97.52% for multiple unseen models, the proposed approach outperforms state-of-the-art methods in balancing between medical utility and enhancing patient privacy.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"166 ","pages":"Article 105849"},"PeriodicalIF":4.2,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145685262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Real-time and high-quality scene reconstruction remains a critical challenge for robotics applications. 3D Gaussian Splatting (3DGS) demonstrates remarkable capabilities in scene rendering. However, its integration with SLAM systems confronts two critical limitations: (1) slow pose tracking caused by full-frame rendering multiple times, and (2) susceptibility to environmental variations such as illumination variations and motion blur. To alleviate these issues, this paper proposes gaussian landmarks-based real-time reconstruction framework — GLT-SLAM, which composes of a ray casting-driven tracking module, a multi-modal keyframe selector, and an incremental geometric–photometric mapping module. To avoid redundant rendering computations, the tracking module achieves efficient 3D-2D correspondence by encoding Gaussian landmark-emitted rays and fusing attention scores. Furthermore, to enhance the framework’s robustness against complex environmental conditions, the keyframe selector balances multiple influencing factors including image quality, tracking uncertainty, information entropy, and feature overlap ratios. Finally, to achieve a compact map representation, the mapping module adds only Gaussian primitives of points, lines, and planes, and performs global map optimization through joint photometric–geometric constraints. Experimental results on the Replica, TUM RGB-D, and BJTU datasets demonstrate that the proposed method achieves a real-time processing rate of over 30 Hz on a platform with an NVIDIA RTX 3090, demonstrating a 19% higher efficiency than the fastest Photo-SLAM method while significantly outperforming other baseline methods in both localization and mapping accuracy. The source code will be available on GitHub.1
实时和高质量的场景重建仍然是机器人应用的关键挑战。三维高斯溅射(3DGS)在场景渲染中表现出非凡的能力。然而,它与SLAM系统的集成面临两个关键限制:(1)多次绘制全帧导致的姿态跟踪缓慢;(2)易受光照变化和运动模糊等环境变化的影响。为了解决这些问题,本文提出了基于高斯地标的实时重建框架GLT-SLAM,该框架由光线投射驱动的跟踪模块、多模态关键帧选择器和增量几何光度映射模块组成。为了避免冗余的渲染计算,跟踪模块通过对高斯地标发射射线进行编码并融合注意分数来实现高效的3D-2D对应。此外,为了增强框架对复杂环境条件的鲁棒性,关键帧选择器平衡了多个影响因素,包括图像质量、跟踪不确定性、信息熵和特征重叠率。最后,为了实现紧凑的地图表示,映射模块仅添加点、线、面高斯原语,并通过光度-几何联合约束进行全局地图优化。在Replica, TUM RGB-D和BJTU数据集上的实验结果表明,该方法在NVIDIA RTX 3090平台上实现了超过30 Hz的实时处理速率,比最快的Photo-SLAM方法效率高出19%,同时在定位和制图精度方面显着优于其他基准方法。源代码可以在GitHub.1上获得
{"title":"Gaussian landmarks tracking-based real-time splatting reconstruction model","authors":"Donglin Zhu, Zhongli Wang, Xiaoyang Fan, Miao Chen, Jiuyu Chen","doi":"10.1016/j.imavis.2025.105869","DOIUrl":"10.1016/j.imavis.2025.105869","url":null,"abstract":"<div><div>Real-time and high-quality scene reconstruction remains a critical challenge for robotics applications. 3D Gaussian Splatting (3DGS) demonstrates remarkable capabilities in scene rendering. However, its integration with SLAM systems confronts two critical limitations: (1) slow pose tracking caused by full-frame rendering multiple times, and (2) susceptibility to environmental variations such as illumination variations and motion blur. To alleviate these issues, this paper proposes gaussian landmarks-based real-time reconstruction framework — GLT-SLAM, which composes of a ray casting-driven tracking module, a multi-modal keyframe selector, and an incremental geometric–photometric mapping module. To avoid redundant rendering computations, the tracking module achieves efficient 3D-2D correspondence by encoding Gaussian landmark-emitted rays and fusing attention scores. Furthermore, to enhance the framework’s robustness against complex environmental conditions, the keyframe selector balances multiple influencing factors including image quality, tracking uncertainty, information entropy, and feature overlap ratios. Finally, to achieve a compact map representation, the mapping module adds only Gaussian primitives of points, lines, and planes, and performs global map optimization through joint photometric–geometric constraints. Experimental results on the Replica, TUM RGB-D, and BJTU datasets demonstrate that the proposed method achieves a real-time processing rate of over 30 Hz on a platform with an NVIDIA RTX 3090, demonstrating a 19% higher efficiency than the fastest Photo-SLAM method while significantly outperforming other baseline methods in both localization and mapping accuracy. The source code will be available on GitHub.<span><span><sup>1</sup></span></span></div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"166 ","pages":"Article 105869"},"PeriodicalIF":4.2,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145737142","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-01Epub Date: 2025-11-29DOI: 10.1016/j.imavis.2025.105854
Thurimerla Prasanth , Ram Prasad Padhy , B. Sivaselvan
Autonomous Vehicles (AVs) depend on sophisticated perception systems to serve as the vital component of intelligent transportation to ensure secure and smooth navigation. Perception is an essential component of AVs and enables real-time analysis and understanding of the environment for effective decision-making. 3D object detection (3D-OD) is crucial among perception tasks as it accurately determines the 3D geometry and spatial positioning of surrounding objects. The commonly used modalities for 3D-OD are camera, LiDAR, and sensor fusion. In this work, we propose a LiDAR-based 3D-OD approach using point cloud data. The proposed model achieves superior performance while maintaining computational efficiency. This approach utilizes Pillar-based LiDAR processing and uses only 2D convolutions. The model pipeline becomes simple and more efficient by employing only 2D convolutions. We propose a Cascaded Convolutional Backbone (CCB) integrated with 1 × 1 convolutions to improve detection accuracy. We combined the fast Pillar-based encoding with our lightweight backbone. The proposed model reduces complexity to make it well-suited for real-time navigation of an AV. We evaluated our model on the official KITTI test server. The model results are decent in 3D and Bird’s Eye View (BEV) detection benchmarks for the car and cyclist classes. The results of our proposed model are featured on the official KITTI leaderboard.
{"title":"PCNet3D++: A pillar-based cascaded 3D object detection model with an enhanced 2D backbone","authors":"Thurimerla Prasanth , Ram Prasad Padhy , B. Sivaselvan","doi":"10.1016/j.imavis.2025.105854","DOIUrl":"10.1016/j.imavis.2025.105854","url":null,"abstract":"<div><div>Autonomous Vehicles (AVs) depend on sophisticated perception systems to serve as the vital component of intelligent transportation to ensure secure and smooth navigation. Perception is an essential component of AVs and enables real-time analysis and understanding of the environment for effective decision-making. 3D object detection (3D-OD) is crucial among perception tasks as it accurately determines the 3D geometry and spatial positioning of surrounding objects. The commonly used modalities for 3D-OD are camera, LiDAR, and sensor fusion. In this work, we propose a LiDAR-based 3D-OD approach using point cloud data. The proposed model achieves superior performance while maintaining computational efficiency. This approach utilizes Pillar-based LiDAR processing and uses only 2D convolutions. The model pipeline becomes simple and more efficient by employing only 2D convolutions. We propose a Cascaded Convolutional Backbone (CCB) integrated with 1 × 1 convolutions to improve detection accuracy. We combined the fast Pillar-based encoding with our lightweight backbone. The proposed model reduces complexity to make it well-suited for real-time navigation of an AV. We evaluated our model on the official KITTI test server. The model results are decent in 3D and Bird’s Eye View (BEV) detection benchmarks for the car and cyclist classes. The results of our proposed model are featured on the official KITTI leaderboard.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"166 ","pages":"Article 105854"},"PeriodicalIF":4.2,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145685261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-01Epub Date: 2025-11-29DOI: 10.1016/j.imavis.2025.105847
Huan Wan , Jing Ai , Jing Liu , Xin Wei , Jinshan Zeng , Jianyi Wan
Accurate polyp segmentation from the colonoscopy images is crucial for diagnosing and treating colorectal diseases. Although many automatic polyp segmentation models have been proposed and achieved good progress, they still suffer from under-segmentation or over-segmentation problems caused by the characteristics of colonoscopy images: blurred boundaries and widely varied polyp sizes. To address these problems, we propose a novel model, the Distinct Polyp Generator Network (DPG-Net), for polyp segmentation. In DPG-Net, a Feature Progressive Enhancement Module (FPEM) and a Dynamical Aggregation Module (DAM) are developed. The proposed FPEM is responsible for enhancing the polyps and polyp boundaries by jointly utilizing the boundary information and global prior information. Simultaneously, a DAM is developed to integrate all decoding features based on their own traits and detect polyps with various sizes. Finally, accurate segmentation results are obtained. Extensive experiments on five widely used datasets demonstrate that the proposed DPG-Net model is superior to the state-of-the-art models. To evaluate the cross-domain generalization ability, we adopt the proposed DPG-Net for the skin lesion segmentation task. Again, experimental results show that our DPG-Net achieves advanced performance in this task, which verifies the strong generalizability of DPG-Net.
{"title":"Distinct Polyp Generator Network for polyp segmentation","authors":"Huan Wan , Jing Ai , Jing Liu , Xin Wei , Jinshan Zeng , Jianyi Wan","doi":"10.1016/j.imavis.2025.105847","DOIUrl":"10.1016/j.imavis.2025.105847","url":null,"abstract":"<div><div>Accurate polyp segmentation from the colonoscopy images is crucial for diagnosing and treating colorectal diseases. Although many automatic polyp segmentation models have been proposed and achieved good progress, they still suffer from under-segmentation or over-segmentation problems caused by the characteristics of colonoscopy images: blurred boundaries and widely varied polyp sizes. To address these problems, we propose a novel model, the Distinct Polyp Generator Network (DPG-Net), for polyp segmentation. In DPG-Net, a Feature Progressive Enhancement Module (FPEM) and a Dynamical Aggregation Module (DAM) are developed. The proposed FPEM is responsible for enhancing the polyps and polyp boundaries by jointly utilizing the boundary information and global prior information. Simultaneously, a DAM is developed to integrate all decoding features based on their own traits and detect polyps with various sizes. Finally, accurate segmentation results are obtained. Extensive experiments on five widely used datasets demonstrate that the proposed DPG-Net model is superior to the state-of-the-art models. To evaluate the cross-domain generalization ability, we adopt the proposed DPG-Net for the skin lesion segmentation task. Again, experimental results show that our DPG-Net achieves advanced performance in this task, which verifies the strong generalizability of DPG-Net.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"166 ","pages":"Article 105847"},"PeriodicalIF":4.2,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145685260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-01Epub Date: 2025-11-29DOI: 10.1016/j.imavis.2025.105850
Zifan Rui , Xiaoxiao Wang , Yiteng Yang , Guang Han
In visual object tracking, addressing challenges such as target appearance deformation and occlusion has attracted increasing attention. To this end, this paper proposes CSLMTrack, a multiple memory tracking model that more comprehensively reflects the human memory mechanism. It contains short-term and long-term memory modules, as well as a novel feed-forward network TFFN for temporal information aggregation. A dynamic memory update strategy including memory, information transfer, recall, and forgetting processes is also designed, which can effectively avoid memory explosion while integrating memory elements into the tracking network. Extensive experiments conducted on multiple challenging benchmarks demonstrate that CSLMTrack achieves impressive performance, reaching SOTA-level performance compared to state-of-the-art trackers.
{"title":"Combining short-term and long-term memory for robust visual tracking","authors":"Zifan Rui , Xiaoxiao Wang , Yiteng Yang , Guang Han","doi":"10.1016/j.imavis.2025.105850","DOIUrl":"10.1016/j.imavis.2025.105850","url":null,"abstract":"<div><div>In visual object tracking, addressing challenges such as target appearance deformation and occlusion has attracted increasing attention. To this end, this paper proposes CSLMTrack, a multiple memory tracking model that more comprehensively reflects the human memory mechanism. It contains short-term and long-term memory modules, as well as a novel feed-forward network TFFN for temporal information aggregation. A dynamic memory update strategy including memory, information transfer, recall, and forgetting processes is also designed, which can effectively avoid memory explosion while integrating memory elements into the tracking network. Extensive experiments conducted on multiple challenging benchmarks demonstrate that CSLMTrack achieves impressive performance, reaching SOTA-level performance compared to state-of-the-art trackers.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"166 ","pages":"Article 105850"},"PeriodicalIF":4.2,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145685266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-01Epub Date: 2025-12-07DOI: 10.1016/j.imavis.2025.105868
Liwei Deng , Songyu Chen , Xin Yang , Sijuan Huang , Jing Wang
Medical image registration has important applications in medical image analysis. Although deep learning-based registration methods are widely recognized, there is still performance improvement space for existing algorithms due to the complex physiological structure of brain images. In this paper, we aim to propose a deformable medical image registration method that is highly accurate and capable of handling complex physiological structures. Therefore, we propose DFMNet, a dual-channel fusion method based on GMamba, to achieve accurate brain MRI image registration. Compared with state-of-the-art networks like TransMorph, DFMNet has a dual-channel network structure with different fusion strategies. We propose the GMamba block to efficiently capture the remote dependencies in moving and fixed image features. Meanwhile, we propose a context extraction channel to enhance the texture structure of the image content. In addition, we designed a weighted fusion block to help the features of the two channels can be fused efficiently. Extensive experiments on three public brain datasets demonstrate the effectiveness of DFMNet. The experimental results demonstrate that DFMNet outperforms multiple current state-of-the-art deformable registration methods in structural registration of brain images.
{"title":"A deformable registration framework for brain MR images based on a dual-channel fusion strategy using GMamba","authors":"Liwei Deng , Songyu Chen , Xin Yang , Sijuan Huang , Jing Wang","doi":"10.1016/j.imavis.2025.105868","DOIUrl":"10.1016/j.imavis.2025.105868","url":null,"abstract":"<div><div>Medical image registration has important applications in medical image analysis. Although deep learning-based registration methods are widely recognized, there is still performance improvement space for existing algorithms due to the complex physiological structure of brain images. In this paper, we aim to propose a deformable medical image registration method that is highly accurate and capable of handling complex physiological structures. Therefore, we propose DFMNet, a dual-channel fusion method based on GMamba, to achieve accurate brain MRI image registration. Compared with state-of-the-art networks like TransMorph, DFMNet has a dual-channel network structure with different fusion strategies. We propose the GMamba block to efficiently capture the remote dependencies in moving and fixed image features. Meanwhile, we propose a context extraction channel to enhance the texture structure of the image content. In addition, we designed a weighted fusion block to help the features of the two channels can be fused efficiently. Extensive experiments on three public brain datasets demonstrate the effectiveness of DFMNet. The experimental results demonstrate that DFMNet outperforms multiple current state-of-the-art deformable registration methods in structural registration of brain images.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"166 ","pages":"Article 105868"},"PeriodicalIF":4.2,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145737137","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-01Epub Date: 2025-12-16DOI: 10.1016/j.imavis.2025.105877
Arpit Garg , Cuong Nguyen , Rafael Felix , Thanh-Toan Do , Gustavo Carneiro
Deep learning encounters significant challenges in the form of noisy-label samples, which can cause the overfitting of trained models. A primary challenge in learning with noisy-label (LNL) techniques is their ability to differentiate between hard samples (clean-label samples near the decision boundary) and instance-dependent noisy (IDN) label samples to allow these samples to be treated differently during training. Existing methodologies to identify IDN samples, including the small-loss hypothesis and feature-based selection, have demonstrated limited efficacy, thus impeding their effectiveness in dealing with real-world label noise. We present Peer-Agreement-based Sample Selection (PASS), a novel approach that utilises three classifiers, where a consensus-driven agreement between two models accurately differentiates between clean and noisy-label IDN samples to train the third model. In contrast to current techniques, PASS is specifically designed to address the complexities of IDN, where noise patterns are correlated with instance features. Our approach seamlessly integrates with existing LNL algorithms to enhance the accuracy of detecting both noisy and clean samples. Comprehensive experiments conducted on simulated benchmarks (CIFAR-100 and Red mini-ImageNet) and real-world datasets (Animal-10N, CIFAR-N, Clothing1M, and mini-WebVision) demonstrated that PASS substantially improved the performance of multiple state-of-the-art methods. This technique achieves superior classification accuracy, particularly in scenarios with high noise levels.1
{"title":"PASS: Peer-agreement based sample selection for training with instance dependent noisy labels","authors":"Arpit Garg , Cuong Nguyen , Rafael Felix , Thanh-Toan Do , Gustavo Carneiro","doi":"10.1016/j.imavis.2025.105877","DOIUrl":"10.1016/j.imavis.2025.105877","url":null,"abstract":"<div><div>Deep learning encounters significant challenges in the form of noisy-label samples, which can cause the overfitting of trained models. A primary challenge in learning with noisy-label (LNL) techniques is their ability to differentiate between hard samples (clean-label samples near the decision boundary) and instance-dependent noisy (IDN) label samples to allow these samples to be treated differently during training. Existing methodologies to identify IDN samples, including the small-loss hypothesis and feature-based selection, have demonstrated limited efficacy, thus impeding their effectiveness in dealing with real-world label noise. We present Peer-Agreement-based Sample Selection (PASS), a novel approach that utilises three classifiers, where a consensus-driven agreement between two models accurately differentiates between clean and noisy-label IDN samples to train the third model. In contrast to current techniques, PASS is specifically designed to address the complexities of IDN, where noise patterns are correlated with instance features. Our approach seamlessly integrates with existing LNL algorithms to enhance the accuracy of detecting both noisy and clean samples. Comprehensive experiments conducted on simulated benchmarks (CIFAR-100 and Red mini-ImageNet) and real-world datasets (Animal-10N, CIFAR-N, Clothing1M, and mini-WebVision) demonstrated that PASS substantially improved the performance of multiple state-of-the-art methods. This technique achieves superior classification accuracy, particularly in scenarios with high noise levels.<span><span><sup>1</sup></span></span></div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"166 ","pages":"Article 105877"},"PeriodicalIF":4.2,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}