Pub Date : 2026-03-01Epub Date: 2025-12-27DOI: 10.1016/j.imavis.2025.105890
Ziyue Wang , Xina Cheng , Takeshi Ikenaga
Reconstructing accurate 3D sizes of multiple objects from indoor monocular videos has gradually become a significant topic for robotics, smart homes, and wireless signal analysis. However, existing monocular reconstruction pipelines often focus on the surface or 3D bounding box reconstruction of objects, making unreliable size estimation due to occlusion, missing depth, and incomplete visibility. To accurately reconstruct the real size of objects in different shapes under complex indoor conditions, this work proposes statistic checking module with depth layering and spatial consistency checking for accurate object size reconstruction. First, by checking the frequency of feature points from the semantic information, statistic temporal checking is used to remove outliers around object region by checking the probability of foreground and background region. Secondly, depth layering provides depth prior, which helps to enhance the boundary of objects and increases the 3D reconstruction accuracy. Then, a semantic-guided spatial consistency checking module infers the hidden or occluded parts of objects by exploiting category-specific priors and spatial consistency. The inferred complete object boundaries are enclosed using surface fitting and volumetric filling, resulting in final volumetric occupancy estimates for each individual object. Extensive experiments demonstrate that the proposed method achieves 0.3137 error rate, which is approximately 0.5641 lower than the average.
{"title":"Statistic temporal checking and spatial consistency based 3D size reconstruction of multiple objects from indoor monocular videos","authors":"Ziyue Wang , Xina Cheng , Takeshi Ikenaga","doi":"10.1016/j.imavis.2025.105890","DOIUrl":"10.1016/j.imavis.2025.105890","url":null,"abstract":"<div><div>Reconstructing accurate 3D sizes of multiple objects from indoor monocular videos has gradually become a significant topic for robotics, smart homes, and wireless signal analysis. However, existing monocular reconstruction pipelines often focus on the surface or 3D bounding box reconstruction of objects, making unreliable size estimation due to occlusion, missing depth, and incomplete visibility. To accurately reconstruct the real size of objects in different shapes under complex indoor conditions, this work proposes statistic checking module with depth layering and spatial consistency checking for accurate object size reconstruction. First, by checking the frequency of feature points from the semantic information, statistic temporal checking is used to remove outliers around object region by checking the probability of foreground and background region. Secondly, depth layering provides depth prior, which helps to enhance the boundary of objects and increases the 3D reconstruction accuracy. Then, a semantic-guided spatial consistency checking module infers the hidden or occluded parts of objects by exploiting category-specific priors and spatial consistency. The inferred complete object boundaries are enclosed using surface fitting and volumetric filling, resulting in final volumetric occupancy estimates for each individual object. Extensive experiments demonstrate that the proposed method achieves 0.3137 error rate, which is approximately 0.5641 lower than the average.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"167 ","pages":"Article 105890"},"PeriodicalIF":4.2,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145885582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-01Epub Date: 2026-01-28DOI: 10.1016/j.imavis.2026.105918
Yuqi Zhang , Xiaoqian Zhang , Jiakai Wang , Baoyu Liang , Yuancheng Yang , Chao Tong
Computational pathology (CPath) has significantly advanced the clinical practice of pathology. Despite the progress made, Multiple Instance Learning (MIL), a promising paradigm within CPath, continues to face challenges, especially those related to structural fixation and incomplete information utilization. To address these limitations, we propose a novel MIL framework named Multi-order MIL (MoMIL). Our framework utilizes the SSD model to perform long-sequence modeling on multi-order WSI patches and combines lightweight feature fusion to achieve more comprehensive feature information utilization. This framework supports the fusion of a broader range of features and is highly flexible, allowing for expansion based on specific usage requirements. Additionally, we introduce a sequence transformation method specifically designed for WSIs. This method is not only adaptable to different WSI sizes but also captures additional feature expression, resulting in a more effective exploitation of sequential cues. Extensive experiments demonstrate that MoMIL surpasses state-of-the-art MIL methods, up to 0.027 AUC improvements for cancer sub-typing. We conducted extensive experiments on three downstream tasks with a total of five datasets, achieving improvements in all performance metrics. The code is available at https://github.com/YuqiZhang-Buaa/MoMIL.
{"title":"MoMIL: Multi-order enhanced multiple instance learning for computational pathology","authors":"Yuqi Zhang , Xiaoqian Zhang , Jiakai Wang , Baoyu Liang , Yuancheng Yang , Chao Tong","doi":"10.1016/j.imavis.2026.105918","DOIUrl":"10.1016/j.imavis.2026.105918","url":null,"abstract":"<div><div>Computational pathology (CPath) has significantly advanced the clinical practice of pathology. Despite the progress made, Multiple Instance Learning (MIL), a promising paradigm within CPath, continues to face challenges, especially those related to structural fixation and incomplete information utilization. To address these limitations, we propose a novel MIL framework named Multi-order MIL (MoMIL). Our framework utilizes the SSD model to perform long-sequence modeling on multi-order WSI patches and combines lightweight feature fusion to achieve more comprehensive feature information utilization. This framework supports the fusion of a broader range of features and is highly flexible, allowing for expansion based on specific usage requirements. Additionally, we introduce a sequence transformation method specifically designed for WSIs. This method is not only adaptable to different WSI sizes but also captures additional feature expression, resulting in a more effective exploitation of sequential cues. Extensive experiments demonstrate that MoMIL surpasses state-of-the-art MIL methods, up to 0.027 AUC improvements for cancer sub-typing. We conducted extensive experiments on three downstream tasks with a total of five datasets, achieving improvements in all performance metrics. The code is available at <span><span>https://github.com/YuqiZhang-Buaa/MoMIL</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"167 ","pages":"Article 105918"},"PeriodicalIF":4.2,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146078416","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-01Epub Date: 2026-01-14DOI: 10.1016/j.imavis.2026.105912
Tiandi Peng , Yanmin Luo , Jiancong Liang , Gonggeng Lin
Monocular 3D human pose estimation from video sequences requires effectively capturing both spatial and temporal information. However, ensuring long-term temporal consistency while maintaining accurate local motion remains a major challenge. In this paper, we present MSTPFormer, a dual-branch framework that separately models global temporal dynamics and local spatial representations for robust spatio-temporal learning. To model global motion, We design two modules based on the state space mechanism (SSM) of Mamba. The Spatial Scan Block (S-Scan) applies a bidirectional spatial scanning strategy to form closed-loop joint interactions, enhancing local motion chain representation. The Temporal Scan Block (T-Scan) constructs joint-specific temporal channels along the sequence, enabling individualized motion trajectory modeling for each of the 17 joints. For local modeling, we design a Transformer branch to refine spatial features within each frame, thereby enhancing the expressiveness of joint-level details. This dual-branch design enables effective decoupling and fusion of global–local and spatial–temporal cues. Experiments on Human3.6M and MPI-INF-3DHP demonstrate that MSTPFormer achieves state-of-the-art performance, with P1 errors of 37.6 mm on Human3.6M and 13.6 mm on MPI-INF-3DHP.
{"title":"MSTPFormer: Mamba-driven spatiotemporal bidirectional dual-stream parallel transformer for 3D human pose estimation","authors":"Tiandi Peng , Yanmin Luo , Jiancong Liang , Gonggeng Lin","doi":"10.1016/j.imavis.2026.105912","DOIUrl":"10.1016/j.imavis.2026.105912","url":null,"abstract":"<div><div>Monocular 3D human pose estimation from video sequences requires effectively capturing both spatial and temporal information. However, ensuring long-term temporal consistency while maintaining accurate local motion remains a major challenge. In this paper, we present MSTPFormer, a dual-branch framework that separately models global temporal dynamics and local spatial representations for robust spatio-temporal learning. To model global motion, We design two modules based on the state space mechanism (SSM) of Mamba. The Spatial Scan Block (S-Scan) applies a bidirectional spatial scanning strategy to form closed-loop joint interactions, enhancing local motion chain representation. The Temporal Scan Block (T-Scan) constructs joint-specific temporal channels along the sequence, enabling individualized motion trajectory modeling for each of the 17 joints. For local modeling, we design a Transformer branch to refine spatial features within each frame, thereby enhancing the expressiveness of joint-level details. This dual-branch design enables effective decoupling and fusion of global–local and spatial–temporal cues. Experiments on Human3.6M and MPI-INF-3DHP demonstrate that MSTPFormer achieves state-of-the-art performance, with P1 errors of 37.6 mm on Human3.6M and 13.6 mm on MPI-INF-3DHP.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"167 ","pages":"Article 105912"},"PeriodicalIF":4.2,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146023143","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent studies have demonstrated that utilizing natural language as a supervisory signal can enhance face anti-spoofing (FAS) performance; however, these methods still fall short in fully addressing long-text inputs and fine-grained information. To mitigate these limitations, we leverage MiniGPT-4 to generate detailed long-form textual descriptions of facial features for input images, and propose a novel framework, Long-FAS, which extracts textual and visual information through a dual-branch architecture. Specifically, we incorporate positional encoding for knowledge retention to enable the learning of effective feature representations from long texts, and employ principal component analysis (PCA) matching to capture essential attribute information while prioritizing critical attributes. Furthermore, matching visual and textual features at both coarse and fine granularities enhances the model’s ability to effectively handle both long and short texts, thereby empowering it to learn robust discriminative cues from facial images. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art counterparts.
{"title":"Long-FAS: Cross-domain face anti-spoofing with long text guidance","authors":"Jianwen Zhang , Jianfeng Zhang , Dedong Yang, Rongtao Li, Ziyang Li","doi":"10.1016/j.imavis.2026.105901","DOIUrl":"10.1016/j.imavis.2026.105901","url":null,"abstract":"<div><div>Recent studies have demonstrated that utilizing natural language as a supervisory signal can enhance face anti-spoofing (FAS) performance; however, these methods still fall short in fully addressing long-text inputs and fine-grained information. To mitigate these limitations, we leverage MiniGPT-4 to generate detailed long-form textual descriptions of facial features for input images, and propose a novel framework, Long-FAS, which extracts textual and visual information through a dual-branch architecture. Specifically, we incorporate positional encoding for knowledge retention to enable the learning of effective feature representations from long texts, and employ principal component analysis (PCA) matching to capture essential attribute information while prioritizing critical attributes. Furthermore, matching visual and textual features at both coarse and fine granularities enhances the model’s ability to effectively handle both long and short texts, thereby empowering it to learn robust discriminative cues from facial images. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art counterparts.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"167 ","pages":"Article 105901"},"PeriodicalIF":4.2,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145927625","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-01Epub Date: 2026-01-06DOI: 10.1016/j.imavis.2026.105900
Kewen Wang, Bin Wang, Wenzhe Zhai, Jing-an Cheng
In Intelligent Autonomous Transport Systems (IATS), the integration of lightweight machine learning techniques enables the deployment of real-time and efficient AI models on edge devices. A fundamental aspect is to estimate traffic density, which is crucial for efficient intelligent traffic control. The rapid progress in deep neural networks (DNNs) has led to a notable improvement in the accuracy of traffic density estimation. However, two main issues remain unsolved. Firstly, current DNN models involve numerous parameters and consume large computing resources, and thus their performance degrades when detecting multi-scale vehicle targets. Secondly, growing privacy concerns have made individuals increasingly unwilling to share their data for model training, which leads to data isolation challenges. To address the problems above, we introduce the Distributed Quantum Model Learning (DQML) model for traffic density estimation. It combines an Efficient Quantum-driven Adaptive (EQA) module to capture multi-scale information using quantum states. In addition, we propose a distributed learning strategy that trains multiple client models with local data and aggregates them via a global parameter server. This strategy ensures privacy protection while offering a significant improvement in estimation performance compared to models trained on limited and isolated data. We evaluated the proposed model on six key benchmarks for vehicle and crowd density analysis, and comprehensive experiments demonstrated that it surpasses other state-of-the-art models in both accuracy and efficiency.
{"title":"Distributed quantum model learning for traffic density estimation","authors":"Kewen Wang, Bin Wang, Wenzhe Zhai, Jing-an Cheng","doi":"10.1016/j.imavis.2026.105900","DOIUrl":"10.1016/j.imavis.2026.105900","url":null,"abstract":"<div><div>In Intelligent Autonomous Transport Systems (IATS), the integration of lightweight machine learning techniques enables the deployment of real-time and efficient AI models on edge devices. A fundamental aspect is to estimate traffic density, which is crucial for efficient intelligent traffic control. The rapid progress in deep neural networks (DNNs) has led to a notable improvement in the accuracy of traffic density estimation. However, two main issues remain unsolved. Firstly, current DNN models involve numerous parameters and consume large computing resources, and thus their performance degrades when detecting multi-scale vehicle targets. Secondly, growing privacy concerns have made individuals increasingly unwilling to share their data for model training, which leads to data isolation challenges. To address the problems above, we introduce the Distributed Quantum Model Learning (DQML) model for traffic density estimation. It combines an Efficient Quantum-driven Adaptive (EQA) module to capture multi-scale information using quantum states. In addition, we propose a distributed learning strategy that trains multiple client models with local data and aggregates them via a global parameter server. This strategy ensures privacy protection while offering a significant improvement in estimation performance compared to models trained on limited and isolated data. We evaluated the proposed model on six key benchmarks for vehicle and crowd density analysis, and comprehensive experiments demonstrated that it surpasses other state-of-the-art models in both accuracy and efficiency.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"167 ","pages":"Article 105900"},"PeriodicalIF":4.2,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145927626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-01Epub Date: 2026-01-19DOI: 10.1016/j.imavis.2026.105913
Lu Zhou , Yingying Chen , Jinqiao Wang
GCNs (graph convolutional networks) based 2D to 3D human pose estimation has sparked a wave of research and garnered widespread attention profited from its strong competence in joint relation modeling. Yet, the performance still lags behind on account of the scarcity of universal and sophisticated human knowledge. Advancements in state space models, notably Mamba, which demonstrates extraordinary sequential modeling talents has proved its effectiveness on long sequence modeling and macro knowledge acquisition. To alleviate the modeling bias in existing techniques, we advance an innovative hybrid architecture where GCNs are married with the Mamba to learn the multi-level human knowledge in a collaborative manner which is an effective manner to conquer the dilemma caused by the ill-posed issue. Concretely, we design a compositional Gamba (GCNs-Mamba) block where GCNs and Mamba enforce the local–global modeling upon different feature segments alternatively. Additionally, a compositional pattern is skillfully formulated in which multi-level human topological relation is learned and explicit human prior is embedded. The proposed approach outperforms the preceding published works on both the Human3.6M and MPI-INF-3DHP benchmarks, attesting to the efficacy of the hybrid architecture.
{"title":"Compositional Gamba for 3D human pose estimation","authors":"Lu Zhou , Yingying Chen , Jinqiao Wang","doi":"10.1016/j.imavis.2026.105913","DOIUrl":"10.1016/j.imavis.2026.105913","url":null,"abstract":"<div><div>GCNs (graph convolutional networks) based 2D to 3D human pose estimation has sparked a wave of research and garnered widespread attention profited from its strong competence in joint relation modeling. Yet, the performance still lags behind on account of the scarcity of universal and sophisticated human knowledge. Advancements in state space models, notably Mamba, which demonstrates extraordinary sequential modeling talents has proved its effectiveness on long sequence modeling and macro knowledge acquisition. To alleviate the modeling bias in existing techniques, we advance an innovative hybrid architecture where GCNs are married with the Mamba to learn the multi-level human knowledge in a collaborative manner which is an effective manner to conquer the dilemma caused by the ill-posed issue. Concretely, we design a compositional Gamba (GCNs-Mamba) block where GCNs and Mamba enforce the local–global modeling upon different feature segments alternatively. Additionally, a compositional pattern is skillfully formulated in which multi-level human topological relation is learned and explicit human prior is embedded. The proposed approach outperforms the preceding published works on both the Human3.6M and MPI-INF-3DHP benchmarks, attesting to the efficacy of the hybrid architecture.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"167 ","pages":"Article 105913"},"PeriodicalIF":4.2,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146023142","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-01Epub Date: 2025-12-19DOI: 10.1016/j.imavis.2025.105883
Junzhuo Liu , Dorit Merhof , Zhixiang Wang
Colorectal cancer is one of the most prevalent and lethal forms of cancer. The automated detection, segmentation and classification of early polyp tissues from endoscopy images of the colorectum has demonstrated impressive potential in improving clinical diagnostic accuracy, avoiding missed detections and reducing the incidence of colorectal cancer in the population. However, most existing studies fail to consider the potential of information fusion between different deep neural network layers and optimization with respect to model complexity, resulting in poor clinical utility. To address the above limitations, the concept of integrity learning is introduced, which divides polyp segmentation into two stages for progressive completion, and a cross-level fusion lightweight network, IC-FusionNet, is proposed to accurately segment polyps from endoscopy images. First, the Context Fusion Module (CFM) of the network aggregates the encoder neighboring branches and current level information to achieve macro-integrity learning. In the second stage, polyp detail information from shallower layers and deeper high-dimensional semantic information are aggregated to achieve enhancement between different layers of complementary information. IC-FusionNet is evaluated on five polyp segmentation benchmark datasets across eight evaluation metrics to assess its performance. IC-FusionNet achieves of 0.908 and 0.925 on the Kvasir and CVC-ClinicDB datasets, respectively, along with of 0.851 and 0.973. On three external polyp segmentation test datasets, the model obtains an average of 0.788 and an average of 0.712. Compared to existing methods, IC-FusionNet achieves superior or near-optimal performance across most evaluation metrics. Moreover, IC-FusionNet contains only 3.84 M parameters and 0.76G MACs, representing a reduction of 9.22% in parameter count and 74.15% in computational complexity compared to recent lightweight segmentation networks.
{"title":"Cross-level fusion network for two-stage polyp segmentation via integrity learning","authors":"Junzhuo Liu , Dorit Merhof , Zhixiang Wang","doi":"10.1016/j.imavis.2025.105883","DOIUrl":"10.1016/j.imavis.2025.105883","url":null,"abstract":"<div><div>Colorectal cancer is one of the most prevalent and lethal forms of cancer. The automated detection, segmentation and classification of early polyp tissues from endoscopy images of the colorectum has demonstrated impressive potential in improving clinical diagnostic accuracy, avoiding missed detections and reducing the incidence of colorectal cancer in the population. However, most existing studies fail to consider the potential of information fusion between different deep neural network layers and optimization with respect to model complexity, resulting in poor clinical utility. To address the above limitations, the concept of integrity learning is introduced, which divides polyp segmentation into two stages for progressive completion, and a cross-level fusion lightweight network, IC-FusionNet, is proposed to accurately segment polyps from endoscopy images. First, the Context Fusion Module (CFM) of the network aggregates the encoder neighboring branches and current level information to achieve macro-integrity learning. In the second stage, polyp detail information from shallower layers and deeper high-dimensional semantic information are aggregated to achieve enhancement between different layers of complementary information. IC-FusionNet is evaluated on five polyp segmentation benchmark datasets across eight evaluation metrics to assess its performance. IC-FusionNet achieves <span><math><mi>mDice</mi></math></span> of 0.908 and 0.925 on the Kvasir and CVC-ClinicDB datasets, respectively, along with <span><math><mi>mIou</mi></math></span> of 0.851 and 0.973. On three external polyp segmentation test datasets, the model obtains an average <span><math><mi>mDice</mi></math></span> of 0.788 and an average <span><math><mi>mIou</mi></math></span> of 0.712. Compared to existing methods, IC-FusionNet achieves superior or near-optimal performance across most evaluation metrics. Moreover, IC-FusionNet contains only 3.84 M parameters and 0.76G MACs, representing a reduction of 9.22% in parameter count and 74.15% in computational complexity compared to recent lightweight segmentation networks.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"167 ","pages":"Article 105883"},"PeriodicalIF":4.2,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145842587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-01Epub Date: 2026-01-20DOI: 10.1016/j.imavis.2026.105914
Shaoqiang Wang , Guiling Shi , Xiaofeng Xu , Tiyao Liu , Yawu Zhao , Xiaochun Cheng , Yuchen Wang
Medical image segmentation is pivotal for clinical diagnosis but remains challenged by the inherent trade-offs between global context modeling and local detail preservation, as well as the susceptibility of deep networks to acquisition noise and scale variations. While hybrid CNN-Transformer architectures have emerged to address receptive field limitations, they often incur prohibitive computational costs and lack the inductive bias required for small-sample medical datasets. To resolve these systemic bottlenecks efficiently, we propose SFRNet V2. By integrating parallel local-regional perception, active noise filtration in skip connections, and elastic multi-scale aggregation at the bottleneck, our approach systematically overcomes the limitations of fixed receptive fields and feature ambiguity. Extensive experiments on four diverse public datasets (CVC-ClinicDB, ISIC 2017, TN3K, and MICCAI Tooth) demonstrate that SFRNet V2 consistently outperforms recent competitors. Notably, our model achieves the highest accuracy with only 19.85 M parameters and a rapid inference speed of 2.7 ms, offering a superior balance between precision and clinical deployability.
{"title":"Enhanced medical image segmentation via synergistic feature guidance and multi-scale refinement","authors":"Shaoqiang Wang , Guiling Shi , Xiaofeng Xu , Tiyao Liu , Yawu Zhao , Xiaochun Cheng , Yuchen Wang","doi":"10.1016/j.imavis.2026.105914","DOIUrl":"10.1016/j.imavis.2026.105914","url":null,"abstract":"<div><div>Medical image segmentation is pivotal for clinical diagnosis but remains challenged by the inherent trade-offs between global context modeling and local detail preservation, as well as the susceptibility of deep networks to acquisition noise and scale variations. While hybrid CNN-Transformer architectures have emerged to address receptive field limitations, they often incur prohibitive computational costs and lack the inductive bias required for small-sample medical datasets. To resolve these systemic bottlenecks efficiently, we propose SFRNet V2. By integrating parallel local-regional perception, active noise filtration in skip connections, and elastic multi-scale aggregation at the bottleneck, our approach systematically overcomes the limitations of fixed receptive fields and feature ambiguity. Extensive experiments on four diverse public datasets (CVC-ClinicDB, ISIC 2017, TN3K, and MICCAI Tooth) demonstrate that SFRNet V2 consistently outperforms recent competitors. Notably, our model achieves the highest accuracy with only 19.85 M parameters and a rapid inference speed of 2.7 ms, offering a superior balance between precision and clinical deployability.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"167 ","pages":"Article 105914"},"PeriodicalIF":4.2,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146023141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-01Epub Date: 2025-12-20DOI: 10.1016/j.imavis.2025.105885
Yuanfan Jin, Yongfang Wang
The emergence of novel deepfake algorithms capable of generating highly realistic manipulated audio–visual content has sparked significant public concern regarding the authenticity and trustworthiness of digital media. This concern has driven the development of multimodal deepfake detection methods. In this paper, we present a novel two-stage multimodal detection framework that harnesses pre-trained audio–visual speech recognition models and cross-attention fusion to achieve state-of-the-art performance with efficient cross-domain adversarial training. Our approach consists of two stages. In the first stage, we utilize a pre-trained audio–visual representation learning model from the speech recognition domain to extract unimodal features. Comprehensive analysis confirms the efficacy of these features for deepfake detection. In the second stage, we propose a specialized cross-modality fusion module to integrate the unimodal features for multimodal deepfake detection. Furthermore, we utilize a transformer model for final classification and implement an adversarial learning strategy to enhance robustness of the model. Our proposed method achieves 98.9% accuracy and 99.6% AUC on the multimodal deepfake detection benchmark FakeAVCeleb, outperforming the latest multimodal detector NPVForensics by 0.57 percentage points in AUC , while maintaining low training cost and a relatively simple architecture.
{"title":"CMALDD-PTAF: Cross-modal adversarial learning for deepfake detection by leveraging pre-trained models and cross-attention fusion","authors":"Yuanfan Jin, Yongfang Wang","doi":"10.1016/j.imavis.2025.105885","DOIUrl":"10.1016/j.imavis.2025.105885","url":null,"abstract":"<div><div>The emergence of novel deepfake algorithms capable of generating highly realistic manipulated audio–visual content has sparked significant public concern regarding the authenticity and trustworthiness of digital media. This concern has driven the development of multimodal deepfake detection methods. In this paper, we present a novel two-stage multimodal detection framework that harnesses pre-trained audio–visual speech recognition models and cross-attention fusion to achieve state-of-the-art performance with efficient cross-domain adversarial training. Our approach consists of two stages. In the first stage, we utilize a pre-trained audio–visual representation learning model from the speech recognition domain to extract unimodal features. Comprehensive analysis confirms the efficacy of these features for deepfake detection. In the second stage, we propose a specialized cross-modality fusion module to integrate the unimodal features for multimodal deepfake detection. Furthermore, we utilize a transformer model for final classification and implement an adversarial learning strategy to enhance robustness of the model. Our proposed method achieves 98.9% accuracy and 99.6% AUC on the multimodal deepfake detection benchmark FakeAVCeleb, outperforming the latest multimodal detector NPVForensics by 0.57 percentage points in AUC , while maintaining low training cost and a relatively simple architecture.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"166 ","pages":"Article 105885"},"PeriodicalIF":4.2,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145840004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-01Epub Date: 2025-12-12DOI: 10.1016/j.imavis.2025.105872
Xiujin Zhu , Chee-Onn Chow , Joon Huang Chuah
Image shadow removal is a typical low-level vision task, as shadows introduce abrupt local brightness variations that degrade the performance of downstream tasks. Due to the quadratic complexity of Transformers, many existing methods adopt local attention to balance accuracy and efficiency. However, restricting attention to local windows prevents true long-range dependency modeling and limits shadow removal performance. Recently, Mamba has shown strong ability in vision tasks by achieving global modeling with linear complexity. Despite this advantage, existing scanning mechanisms in the Mamba architecture are not suitable for shadow removal because they ignore the semantic continuity within the same region. To address this, a boundary-region selective scanning mechanism is proposed that captures local details while enhancing continuity among semantically related pixels, effectively improving shadow removal performance. In addition, a shadow mask denoising preprocessing method is introduced to improve the accuracy of the scanning mechanism and further enhance the data quality. Based on this, this paper presents ShadowMamba, the first Mamba-based model for shadow removal. Experimental results show that the proposed method outperforms existing mainstream approaches on the AISTD, ISTD, SRD, and WSRD+ datasets, and demonstrates good generalization ability in cross-dataset testing on USR and SBU. Meanwhile, the model also has significant advantages in parameter efficiency and computational complexity. Code is available at: https://github.com/ZHUXIUJINChris/ShadowMamba.
{"title":"ShadowMamba: State-space model with boundary-region selective scan for shadow removal","authors":"Xiujin Zhu , Chee-Onn Chow , Joon Huang Chuah","doi":"10.1016/j.imavis.2025.105872","DOIUrl":"10.1016/j.imavis.2025.105872","url":null,"abstract":"<div><div>Image shadow removal is a typical low-level vision task, as shadows introduce abrupt local brightness variations that degrade the performance of downstream tasks. Due to the quadratic complexity of Transformers, many existing methods adopt local attention to balance accuracy and efficiency. However, restricting attention to local windows prevents true long-range dependency modeling and limits shadow removal performance. Recently, Mamba has shown strong ability in vision tasks by achieving global modeling with linear complexity. Despite this advantage, existing scanning mechanisms in the Mamba architecture are not suitable for shadow removal because they ignore the semantic continuity within the same region. To address this, a boundary-region selective scanning mechanism is proposed that captures local details while enhancing continuity among semantically related pixels, effectively improving shadow removal performance. In addition, a shadow mask denoising preprocessing method is introduced to improve the accuracy of the scanning mechanism and further enhance the data quality. Based on this, this paper presents ShadowMamba, the first Mamba-based model for shadow removal. Experimental results show that the proposed method outperforms existing mainstream approaches on the AISTD, ISTD, SRD, and WSRD+ datasets, and demonstrates good generalization ability in cross-dataset testing on USR and SBU. Meanwhile, the model also has significant advantages in parameter efficiency and computational complexity. Code is available at: <span><span>https://github.com/ZHUXIUJINChris/ShadowMamba</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"166 ","pages":"Article 105872"},"PeriodicalIF":4.2,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145737140","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}