Pub Date : 2025-12-13DOI: 10.1016/j.imavis.2025.105866
Xingzheng Wang, Haotian Zhang, Yuhang Lin, Yuanbo Huang, Jiahao Lin
In light field (LF) image super-resolution (SR), comprehensive learning of LF information is crucial for accurately recovering image details. Because 4D LF structures are so complex, current methods usually use special convolutions and modules to separately extract different LF characteristics (like spatial, angular, and EPI features) before combining them. But these methods focus too much on local LF information and not enough on global 4D LF features. This makes it hard for them to get better. To overcome this issue, we suggest a straightforward yet effective Global Feature Extraction Module (GFEM). This module extracts the global information from the entire 4D light field. Our method does this by using all of these features together. We also introduce a tool called the Progressive Angular Feature Extractor (PAFE), which gradually expands the area that extracts features to make sure it can extract features at different angles. We also designed a Spatial Gated Feed-forward Network (SGFN) to replace the standard feed-forward network in Transformer. This has resulted in our new Intra-Transformer architecture, which optimizes feature flow and enhances local detail extraction. We did a lot of experiments on different public datasets, and these showed that our method is better than other methods that are currently available.
{"title":"SEAGNet: Spatial–Epipolar–Angular–Global feature learning for light field super-resolution","authors":"Xingzheng Wang, Haotian Zhang, Yuhang Lin, Yuanbo Huang, Jiahao Lin","doi":"10.1016/j.imavis.2025.105866","DOIUrl":"10.1016/j.imavis.2025.105866","url":null,"abstract":"<div><div>In light field (LF) image super-resolution (SR), comprehensive learning of LF information is crucial for accurately recovering image details. Because 4D LF structures are so complex, current methods usually use special convolutions and modules to separately extract different LF characteristics (like spatial, angular, and EPI features) before combining them. But these methods focus too much on local LF information and not enough on global 4D LF features. This makes it hard for them to get better. To overcome this issue, we suggest a straightforward yet effective Global Feature Extraction Module (GFEM). This module extracts the global information from the entire 4D light field. Our method does this by using all of these features together. We also introduce a tool called the Progressive Angular Feature Extractor (PAFE), which gradually expands the area that extracts features to make sure it can extract features at different angles. We also designed a Spatial Gated Feed-forward Network (SGFN) to replace the standard feed-forward network in Transformer. This has resulted in our new Intra-Transformer architecture, which optimizes feature flow and enhances local detail extraction. We did a lot of experiments on different public datasets, and these showed that our method is better than other methods that are currently available.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"166 ","pages":"Article 105866"},"PeriodicalIF":4.2,"publicationDate":"2025-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-12DOI: 10.1016/j.imavis.2025.105872
Xiujin Zhu , Chee-Onn Chow , Joon Huang Chuah
Image shadow removal is a typical low-level vision task, as shadows introduce abrupt local brightness variations that degrade the performance of downstream tasks. Due to the quadratic complexity of Transformers, many existing methods adopt local attention to balance accuracy and efficiency. However, restricting attention to local windows prevents true long-range dependency modeling and limits shadow removal performance. Recently, Mamba has shown strong ability in vision tasks by achieving global modeling with linear complexity. Despite this advantage, existing scanning mechanisms in the Mamba architecture are not suitable for shadow removal because they ignore the semantic continuity within the same region. To address this, a boundary-region selective scanning mechanism is proposed that captures local details while enhancing continuity among semantically related pixels, effectively improving shadow removal performance. In addition, a shadow mask denoising preprocessing method is introduced to improve the accuracy of the scanning mechanism and further enhance the data quality. Based on this, this paper presents ShadowMamba, the first Mamba-based model for shadow removal. Experimental results show that the proposed method outperforms existing mainstream approaches on the AISTD, ISTD, SRD, and WSRD+ datasets, and demonstrates good generalization ability in cross-dataset testing on USR and SBU. Meanwhile, the model also has significant advantages in parameter efficiency and computational complexity. Code is available at: https://github.com/ZHUXIUJINChris/ShadowMamba.
{"title":"ShadowMamba: State-space model with boundary-region selective scan for shadow removal","authors":"Xiujin Zhu , Chee-Onn Chow , Joon Huang Chuah","doi":"10.1016/j.imavis.2025.105872","DOIUrl":"10.1016/j.imavis.2025.105872","url":null,"abstract":"<div><div>Image shadow removal is a typical low-level vision task, as shadows introduce abrupt local brightness variations that degrade the performance of downstream tasks. Due to the quadratic complexity of Transformers, many existing methods adopt local attention to balance accuracy and efficiency. However, restricting attention to local windows prevents true long-range dependency modeling and limits shadow removal performance. Recently, Mamba has shown strong ability in vision tasks by achieving global modeling with linear complexity. Despite this advantage, existing scanning mechanisms in the Mamba architecture are not suitable for shadow removal because they ignore the semantic continuity within the same region. To address this, a boundary-region selective scanning mechanism is proposed that captures local details while enhancing continuity among semantically related pixels, effectively improving shadow removal performance. In addition, a shadow mask denoising preprocessing method is introduced to improve the accuracy of the scanning mechanism and further enhance the data quality. Based on this, this paper presents ShadowMamba, the first Mamba-based model for shadow removal. Experimental results show that the proposed method outperforms existing mainstream approaches on the AISTD, ISTD, SRD, and WSRD+ datasets, and demonstrates good generalization ability in cross-dataset testing on USR and SBU. Meanwhile, the model also has significant advantages in parameter efficiency and computational complexity. Code is available at: <span><span>https://github.com/ZHUXIUJINChris/ShadowMamba</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"166 ","pages":"Article 105872"},"PeriodicalIF":4.2,"publicationDate":"2025-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145737140","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-11DOI: 10.1016/j.imavis.2025.105871
Jia Yuan , Jun Chen , Chongshou Li , Pedro Alonso , Xinke Li , Tianrui Li
Adversarial attacks on 3D point clouds are increasingly critical for safety-sensitive domains like autonomous driving. Most existing methods ignore local geometric structure, yielding perturbations that harm imperceptibility and geometric consistency. We introduce local geometry-aware adversarial attack (LoGA-Attack), a local geometry-aware approach that exploits topological and geometric cues to craft refined perturbations. A Neighborhood Centrality (NC) score partitions points into contour and flat points sets. Contour points receive gradient-based iterative updates to maximize attack strength, while flat points use an Optimal Neighborhood-based Attack (ONA) that projects gradients onto the most consistent local geometric direction. Experiments on ModelNet40 and ScanObjectNN show higher attack success with lower perceptual distortion, demonstrating superior performance and strong transferability. Our code is available at: https://github.com/yuanjiachn/LoGA-Attack.
{"title":"LoGA-Attack: Local geometry-aware adversarial attack on 3D point clouds","authors":"Jia Yuan , Jun Chen , Chongshou Li , Pedro Alonso , Xinke Li , Tianrui Li","doi":"10.1016/j.imavis.2025.105871","DOIUrl":"10.1016/j.imavis.2025.105871","url":null,"abstract":"<div><div>Adversarial attacks on 3D point clouds are increasingly critical for safety-sensitive domains like autonomous driving. Most existing methods ignore local geometric structure, yielding perturbations that harm imperceptibility and geometric consistency. We introduce local geometry-aware adversarial attack (LoGA-Attack), a local geometry-aware approach that exploits topological and geometric cues to craft refined perturbations. A Neighborhood Centrality (NC) score partitions points into contour and flat points sets. Contour points receive gradient-based iterative updates to maximize attack strength, while flat points use an Optimal Neighborhood-based Attack (ONA) that projects gradients onto the most consistent local geometric direction. Experiments on ModelNet40 and ScanObjectNN show higher attack success with lower perceptual distortion, demonstrating superior performance and strong transferability. Our code is available at: <span><span>https://github.com/yuanjiachn/LoGA-Attack</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"166 ","pages":"Article 105871"},"PeriodicalIF":4.2,"publicationDate":"2025-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790259","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Real-time and high-quality scene reconstruction remains a critical challenge for robotics applications. 3D Gaussian Splatting (3DGS) demonstrates remarkable capabilities in scene rendering. However, its integration with SLAM systems confronts two critical limitations: (1) slow pose tracking caused by full-frame rendering multiple times, and (2) susceptibility to environmental variations such as illumination variations and motion blur. To alleviate these issues, this paper proposes gaussian landmarks-based real-time reconstruction framework — GLT-SLAM, which composes of a ray casting-driven tracking module, a multi-modal keyframe selector, and an incremental geometric–photometric mapping module. To avoid redundant rendering computations, the tracking module achieves efficient 3D-2D correspondence by encoding Gaussian landmark-emitted rays and fusing attention scores. Furthermore, to enhance the framework’s robustness against complex environmental conditions, the keyframe selector balances multiple influencing factors including image quality, tracking uncertainty, information entropy, and feature overlap ratios. Finally, to achieve a compact map representation, the mapping module adds only Gaussian primitives of points, lines, and planes, and performs global map optimization through joint photometric–geometric constraints. Experimental results on the Replica, TUM RGB-D, and BJTU datasets demonstrate that the proposed method achieves a real-time processing rate of over 30 Hz on a platform with an NVIDIA RTX 3090, demonstrating a 19% higher efficiency than the fastest Photo-SLAM method while significantly outperforming other baseline methods in both localization and mapping accuracy. The source code will be available on GitHub.1
实时和高质量的场景重建仍然是机器人应用的关键挑战。三维高斯溅射(3DGS)在场景渲染中表现出非凡的能力。然而,它与SLAM系统的集成面临两个关键限制:(1)多次绘制全帧导致的姿态跟踪缓慢;(2)易受光照变化和运动模糊等环境变化的影响。为了解决这些问题,本文提出了基于高斯地标的实时重建框架GLT-SLAM,该框架由光线投射驱动的跟踪模块、多模态关键帧选择器和增量几何光度映射模块组成。为了避免冗余的渲染计算,跟踪模块通过对高斯地标发射射线进行编码并融合注意分数来实现高效的3D-2D对应。此外,为了增强框架对复杂环境条件的鲁棒性,关键帧选择器平衡了多个影响因素,包括图像质量、跟踪不确定性、信息熵和特征重叠率。最后,为了实现紧凑的地图表示,映射模块仅添加点、线、面高斯原语,并通过光度-几何联合约束进行全局地图优化。在Replica, TUM RGB-D和BJTU数据集上的实验结果表明,该方法在NVIDIA RTX 3090平台上实现了超过30 Hz的实时处理速率,比最快的Photo-SLAM方法效率高出19%,同时在定位和制图精度方面显着优于其他基准方法。源代码可以在GitHub.1上获得
{"title":"Gaussian landmarks tracking-based real-time splatting reconstruction model","authors":"Donglin Zhu, Zhongli Wang, Xiaoyang Fan, Miao Chen, Jiuyu Chen","doi":"10.1016/j.imavis.2025.105869","DOIUrl":"10.1016/j.imavis.2025.105869","url":null,"abstract":"<div><div>Real-time and high-quality scene reconstruction remains a critical challenge for robotics applications. 3D Gaussian Splatting (3DGS) demonstrates remarkable capabilities in scene rendering. However, its integration with SLAM systems confronts two critical limitations: (1) slow pose tracking caused by full-frame rendering multiple times, and (2) susceptibility to environmental variations such as illumination variations and motion blur. To alleviate these issues, this paper proposes gaussian landmarks-based real-time reconstruction framework — GLT-SLAM, which composes of a ray casting-driven tracking module, a multi-modal keyframe selector, and an incremental geometric–photometric mapping module. To avoid redundant rendering computations, the tracking module achieves efficient 3D-2D correspondence by encoding Gaussian landmark-emitted rays and fusing attention scores. Furthermore, to enhance the framework’s robustness against complex environmental conditions, the keyframe selector balances multiple influencing factors including image quality, tracking uncertainty, information entropy, and feature overlap ratios. Finally, to achieve a compact map representation, the mapping module adds only Gaussian primitives of points, lines, and planes, and performs global map optimization through joint photometric–geometric constraints. Experimental results on the Replica, TUM RGB-D, and BJTU datasets demonstrate that the proposed method achieves a real-time processing rate of over 30 Hz on a platform with an NVIDIA RTX 3090, demonstrating a 19% higher efficiency than the fastest Photo-SLAM method while significantly outperforming other baseline methods in both localization and mapping accuracy. The source code will be available on GitHub.<span><span><sup>1</sup></span></span></div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"166 ","pages":"Article 105869"},"PeriodicalIF":4.2,"publicationDate":"2025-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145737142","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-07DOI: 10.1016/j.imavis.2025.105868
Liwei Deng , Songyu Chen , Xin Yang , Sijuan Huang , Jing Wang
Medical image registration has important applications in medical image analysis. Although deep learning-based registration methods are widely recognized, there is still performance improvement space for existing algorithms due to the complex physiological structure of brain images. In this paper, we aim to propose a deformable medical image registration method that is highly accurate and capable of handling complex physiological structures. Therefore, we propose DFMNet, a dual-channel fusion method based on GMamba, to achieve accurate brain MRI image registration. Compared with state-of-the-art networks like TransMorph, DFMNet has a dual-channel network structure with different fusion strategies. We propose the GMamba block to efficiently capture the remote dependencies in moving and fixed image features. Meanwhile, we propose a context extraction channel to enhance the texture structure of the image content. In addition, we designed a weighted fusion block to help the features of the two channels can be fused efficiently. Extensive experiments on three public brain datasets demonstrate the effectiveness of DFMNet. The experimental results demonstrate that DFMNet outperforms multiple current state-of-the-art deformable registration methods in structural registration of brain images.
{"title":"A deformable registration framework for brain MR images based on a dual-channel fusion strategy using GMamba","authors":"Liwei Deng , Songyu Chen , Xin Yang , Sijuan Huang , Jing Wang","doi":"10.1016/j.imavis.2025.105868","DOIUrl":"10.1016/j.imavis.2025.105868","url":null,"abstract":"<div><div>Medical image registration has important applications in medical image analysis. Although deep learning-based registration methods are widely recognized, there is still performance improvement space for existing algorithms due to the complex physiological structure of brain images. In this paper, we aim to propose a deformable medical image registration method that is highly accurate and capable of handling complex physiological structures. Therefore, we propose DFMNet, a dual-channel fusion method based on GMamba, to achieve accurate brain MRI image registration. Compared with state-of-the-art networks like TransMorph, DFMNet has a dual-channel network structure with different fusion strategies. We propose the GMamba block to efficiently capture the remote dependencies in moving and fixed image features. Meanwhile, we propose a context extraction channel to enhance the texture structure of the image content. In addition, we designed a weighted fusion block to help the features of the two channels can be fused efficiently. Extensive experiments on three public brain datasets demonstrate the effectiveness of DFMNet. The experimental results demonstrate that DFMNet outperforms multiple current state-of-the-art deformable registration methods in structural registration of brain images.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"166 ","pages":"Article 105868"},"PeriodicalIF":4.2,"publicationDate":"2025-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145737137","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-03DOI: 10.1016/j.imavis.2025.105856
Yitong Yang , Lei Zhu , Xinyang Yao , Hua Wang , Yang Pan , Bo Zhang
Infrared and visible light image fusion aims to generate integrated representations that synergistically preserve salient thermal targets in the infrared modality and high-resolution textural details in the visible light modality. However, existing methods face two core challenges: First, high-frequency noise in visible images, such as sensor noise and nonuniform illumination artifacts, is often highly coupled with effective textures. Traditional fusion paradigms readily amplify noise interference while enhancing details, leading to structural distortion and visual graininess in fusion results. Second, mainstream approaches predominantly rely on simple aggregation operations like feature stitching or linear weighting, lacking deep modeling of cross-modal semantic correlations. This prevents adaptive interaction and collaborative enhancement of complementary information between modalities, creating a significant trade-off between target saliency and detail preservation. To address these challenges, we propose a dual-branch fusion network based on local contour enhancement. Specifically, it distinguishes and enhances meaningful contour details in a learnable manner while suppressing meaningless noise, thereby purifying the detail information used for fusion at its source. Cross-attention weights are computed based on feature representations extracted from different modal branches, enabling a feature selection mechanism that facilitates dynamic cross-modal interaction between infrared and visible light information. We evaluate our method against 11 state-of-the-art deep learning-based fusion approaches across four benchmark datasets using both subjective assessments and objective metrics. The experimental results demonstrate superior performance on public datasets. Furthermore, YOLOv12-based detection tests reveal that our method achieves higher confidence scores and better overall detection performance compared to other fusion techniques.
{"title":"LCFusion: Infrared and visible image fusion network based on local contour enhancement","authors":"Yitong Yang , Lei Zhu , Xinyang Yao , Hua Wang , Yang Pan , Bo Zhang","doi":"10.1016/j.imavis.2025.105856","DOIUrl":"10.1016/j.imavis.2025.105856","url":null,"abstract":"<div><div>Infrared and visible light image fusion aims to generate integrated representations that synergistically preserve salient thermal targets in the infrared modality and high-resolution textural details in the visible light modality. However, existing methods face two core challenges: First, high-frequency noise in visible images, such as sensor noise and nonuniform illumination artifacts, is often highly coupled with effective textures. Traditional fusion paradigms readily amplify noise interference while enhancing details, leading to structural distortion and visual graininess in fusion results. Second, mainstream approaches predominantly rely on simple aggregation operations like feature stitching or linear weighting, lacking deep modeling of cross-modal semantic correlations. This prevents adaptive interaction and collaborative enhancement of complementary information between modalities, creating a significant trade-off between target saliency and detail preservation. To address these challenges, we propose a dual-branch fusion network based on local contour enhancement. Specifically, it distinguishes and enhances meaningful contour details in a learnable manner while suppressing meaningless noise, thereby purifying the detail information used for fusion at its source. Cross-attention weights are computed based on feature representations extracted from different modal branches, enabling a feature selection mechanism that facilitates dynamic cross-modal interaction between infrared and visible light information. We evaluate our method against 11 state-of-the-art deep learning-based fusion approaches across four benchmark datasets using both subjective assessments and objective metrics. The experimental results demonstrate superior performance on public datasets. Furthermore, YOLOv12-based detection tests reveal that our method achieves higher confidence scores and better overall detection performance compared to other fusion techniques.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"166 ","pages":"Article 105856"},"PeriodicalIF":4.2,"publicationDate":"2025-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145685263","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-03DOI: 10.1016/j.imavis.2025.105855
Jirui Di, Zhengping Hu, Hehao Zhang, Qiming Zhang, Zhe Sun
Fine-grained actions often lack scene prior information, making strong temporal modeling particularly important. Since these actions primarily rely on subtle and localized motion differences, single-scale features are often insufficient to capture their complexity. In contrast, multi-scale features not only capture fine-grained patterns but also contain rich rhythmic information, which is crucial for modeling temporal dependencies. However, existing methods for processing multi-scale features suffer from two major limitations: they often rely on naive downsampling operations for scale alignment, causing significant structural information loss, and they treat features from different layers equally, without fully exploiting the complementary strengths across hierarchical levels. To address these issues, we propose a novel Wavelet-Based Adaptive Multi-scale Fusion Network (WAM-Net), which consists of three key components: (1) a Wavelet-based Fusion Module (WFM) that achieves feature alignment through wavelet reconstruction, avoiding the structural degradation typically introduced by direct downsampling, (2) an Adaptive Feature Selection Module (AFSM) that dynamically selects and fuses two levels of features based on global information, enabling the network to leverage their complementary advantages, and (3) a Duration Context Encoder (DCE) that extracts temporal duration representations from the overall video length to guide global dependency modeling. Extensive experiments on Diving48, FineGym, and Kinetics-400 demonstrate that our approach consistently outperforms existing state-of-the-art methods.
{"title":"WAM-Net: Wavelet-Based Adaptive Multi-scale Fusion Network for fine-grained action recognition","authors":"Jirui Di, Zhengping Hu, Hehao Zhang, Qiming Zhang, Zhe Sun","doi":"10.1016/j.imavis.2025.105855","DOIUrl":"10.1016/j.imavis.2025.105855","url":null,"abstract":"<div><div>Fine-grained actions often lack scene prior information, making strong temporal modeling particularly important. Since these actions primarily rely on subtle and localized motion differences, single-scale features are often insufficient to capture their complexity. In contrast, multi-scale features not only capture fine-grained patterns but also contain rich rhythmic information, which is crucial for modeling temporal dependencies. However, existing methods for processing multi-scale features suffer from two major limitations: they often rely on naive downsampling operations for scale alignment, causing significant structural information loss, and they treat features from different layers equally, without fully exploiting the complementary strengths across hierarchical levels. To address these issues, we propose a novel Wavelet-Based Adaptive Multi-scale Fusion Network (WAM-Net), which consists of three key components: (1) a Wavelet-based Fusion Module (WFM) that achieves feature alignment through wavelet reconstruction, avoiding the structural degradation typically introduced by direct downsampling, (2) an Adaptive Feature Selection Module (AFSM) that dynamically selects and fuses two levels of features based on global information, enabling the network to leverage their complementary advantages, and (3) a Duration Context Encoder (DCE) that extracts temporal duration representations from the overall video length to guide global dependency modeling. Extensive experiments on Diving48, FineGym, and Kinetics-400 demonstrate that our approach consistently outperforms existing state-of-the-art methods.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"166 ","pages":"Article 105855"},"PeriodicalIF":4.2,"publicationDate":"2025-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145685265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-03DOI: 10.1016/j.imavis.2025.105848
Vibhav Ranjan, Kuldeep Chaurasia, Jagendra Singh
Skin cancer detection is a critical task in dermatology, where early diagnosis can significantly improve patient outcomes. In this work, we propose a novel approach for skin cancer classification that combines three deep learning models—InceptionResNetV2 with Soft Attention (SA), ResNet50V2 with SA, and DenseNet201—optimized using a Genetic Algorithm (GA) to find the best ensemble weights. The approach integrates several key innovations: Sigmoid Focal Cross-entropy Loss to address class imbalance, Mish activation for improved gradient flow, and Cosine Annealing learning rate scheduling for enhanced convergence. The GA-based optimization fine-tunes the ensemble weights to maximize classification performance, especially for challenging skin cancer types like melanoma. Experimental results on the HAM10000 dataset demonstrate the effectiveness of the proposed ensemble model, achieving superior accuracy and precision compared to individual models. This work offers a robust framework for skin cancer detection, combining state-of-the-art deep learning techniques with an optimization strategy.
{"title":"Enhancing skin cancer classification with Soft Attention and genetic algorithm-optimized ensemble learning","authors":"Vibhav Ranjan, Kuldeep Chaurasia, Jagendra Singh","doi":"10.1016/j.imavis.2025.105848","DOIUrl":"10.1016/j.imavis.2025.105848","url":null,"abstract":"<div><div>Skin cancer detection is a critical task in dermatology, where early diagnosis can significantly improve patient outcomes. In this work, we propose a novel approach for skin cancer classification that combines three deep learning models—InceptionResNetV2 with Soft Attention (SA), ResNet50V2 with SA, and DenseNet201—optimized using a Genetic Algorithm (GA) to find the best ensemble weights. The approach integrates several key innovations: Sigmoid Focal Cross-entropy Loss to address class imbalance, Mish activation for improved gradient flow, and Cosine Annealing learning rate scheduling for enhanced convergence. The GA-based optimization fine-tunes the ensemble weights to maximize classification performance, especially for challenging skin cancer types like melanoma. Experimental results on the HAM10000 dataset demonstrate the effectiveness of the proposed ensemble model, achieving superior accuracy and precision compared to individual models. This work offers a robust framework for skin cancer detection, combining state-of-the-art deep learning techniques with an optimization strategy.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"166 ","pages":"Article 105848"},"PeriodicalIF":4.2,"publicationDate":"2025-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145737138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-03DOI: 10.1016/j.imavis.2025.105857
Pasquale Coscia, Antonio Fusillo, Angelo Genovese, Vincenzo Piuri, Fabio Scotti
Scene geometry estimation from images plays a key role in robotics, augmented reality, and autonomous systems. In particular, Monocular Depth Estimation (MDE) focuses on predicting depth using a single RGB image, avoiding the need for expensive sensors. State-of-the-art approaches use deep learning models for MDE while processing images as a whole, sub-optimally exploiting their spatial information. A recent research direction focuses on smaller image patches, as depth information varies across different regions of an image. This approach reduces model complexity and improves performance by capturing finer spatial details. From this perspective, we propose a novel warp patch-based extraction method which corrects perspective camera distortions, and employ it in tailored training and inference pipelines. Our experimental results show that our patch-based approach outperforms its full-image-trained counterpart and the classical crop patch-based extraction. With our technique, we obtain a general performance enhancements over recent state-of-the-art models. Code is available at https://github.com/AntonioFusillo/PatchMDE.
{"title":"On the relevance of patch-based extraction methods for monocular depth estimation","authors":"Pasquale Coscia, Antonio Fusillo, Angelo Genovese, Vincenzo Piuri, Fabio Scotti","doi":"10.1016/j.imavis.2025.105857","DOIUrl":"10.1016/j.imavis.2025.105857","url":null,"abstract":"<div><div>Scene geometry estimation from images plays a key role in robotics, augmented reality, and autonomous systems. In particular, Monocular Depth Estimation (MDE) focuses on predicting depth using a single RGB image, avoiding the need for expensive sensors. State-of-the-art approaches use deep learning models for MDE while processing images as a whole, sub-optimally exploiting their spatial information. A recent research direction focuses on smaller image patches, as depth information varies across different regions of an image. This approach reduces model complexity and improves performance by capturing finer spatial details. From this perspective, we propose a novel warp patch-based extraction method which corrects perspective camera distortions, and employ it in tailored training and inference pipelines. Our experimental results show that our patch-based approach outperforms its full-image-trained counterpart and the classical crop patch-based extraction. With our technique, we obtain a general performance enhancements over recent state-of-the-art models. Code is available at <span><span>https://github.com/AntonioFusillo/PatchMDE</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"166 ","pages":"Article 105857"},"PeriodicalIF":4.2,"publicationDate":"2025-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145737139","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-01DOI: 10.1016/j.imavis.2025.105853
Jagannath Sethi , Jaydeb Bhaumik , Ananda S. Chowdhury
In this paper, we propose a secure high quality video recovery scheme which can be useful for diverse applications like telemedicine and cloud-based surveillance. Our solution consists of deep learning-based video Compressive Sensing (CS) followed by a strategy for encrypting the compressed video. We split a video into a number of Groups Of Pictures (GOPs), where, each GOP consists of both keyframes and non-keyframes. The proposed video CS method uses a convolutional neural network (CNN) with a Structural Similarity Index Measure (SSIM) based loss function. Our recovery process has two stages. In the initial recovery stage, CNN is employed to make efficient use of spatial redundancy. In the deep recovery stage, non-keyframes are compensated by utilizing both keyframes and neighboring non-keyframes. Keyframes use multilevel feature compensation, and neighboring non-keyframes use single-level feature compensation. Additionally, we propose an unpredictable and complex chaotic map, with a broader chaotic range, termed as Sine Symbolic Chaotic Map (SSCM). For encrypting compressed features, we suggest a secure encryption scheme consisting of four operations: Forward Diffusion, Substitution, Backward Diffusion, and XORing with SSCM based chaotic sequence. Through extensive experimentation, we establish the efficacy of our combined solution over i) several state-of-the-art image and video CS methods, and ii) a number of video encryption techniques.
{"title":"A robust and secure video recovery scheme with deep compressive sensing","authors":"Jagannath Sethi , Jaydeb Bhaumik , Ananda S. Chowdhury","doi":"10.1016/j.imavis.2025.105853","DOIUrl":"10.1016/j.imavis.2025.105853","url":null,"abstract":"<div><div>In this paper, we propose a secure high quality video recovery scheme which can be useful for diverse applications like telemedicine and cloud-based surveillance. Our solution consists of deep learning-based video Compressive Sensing (CS) followed by a strategy for encrypting the compressed video. We split a video into a number of Groups Of Pictures (GOPs), where, each GOP consists of both keyframes and non-keyframes. The proposed video CS method uses a convolutional neural network (CNN) with a Structural Similarity Index Measure (SSIM) based loss function. Our recovery process has two stages. In the initial recovery stage, CNN is employed to make efficient use of spatial redundancy. In the deep recovery stage, non-keyframes are compensated by utilizing both keyframes and neighboring non-keyframes. Keyframes use multilevel feature compensation, and neighboring non-keyframes use single-level feature compensation. Additionally, we propose an unpredictable and complex chaotic map, with a broader chaotic range, termed as Sine Symbolic Chaotic Map (SSCM). For encrypting compressed features, we suggest a secure encryption scheme consisting of four operations: Forward Diffusion, Substitution, Backward Diffusion, and XORing with SSCM based chaotic sequence. Through extensive experimentation, we establish the efficacy of our combined solution over i) several state-of-the-art image and video CS methods, and ii) a number of video encryption techniques.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"166 ","pages":"Article 105853"},"PeriodicalIF":4.2,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145685309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}