Pub Date : 2025-10-30DOI: 10.1016/j.cviu.2025.104560
Manuel Benavent-Lledo , David Mulero-Pérez , David Ortiz-Perez , Jose Garcia-Rodriguez , Antonis Argyros
We propose a novel approach to improve action recognition by exploiting the hierarchical organization of actions and by incorporating contextualized textual information, including location and previous actions, to reflect the action’s temporal context. To achieve this, we introduce a transformer architecture tailored for action recognition that employs both visual and textual features. Visual features are obtained from RGB and optical flow data, while text embeddings represent contextual information. Furthermore, we define a joint loss function to simultaneously train the model for both coarse- and fine-grained action recognition, effectively exploiting the hierarchical nature of actions. To demonstrate the effectiveness of our method, we extend the Toyota Smarthome Untrimmed (TSU) dataset by incorporating action hierarchies, resulting in the Hierarchical TSU dataset, a hierarchical dataset designed for monitoring activities of the elderly in home environments. An ablation study assesses the performance impact of different strategies for integrating contextual and hierarchical data. Experimental results demonstrate that the proposed method consistently outperforms SOTA methods on the Hierarchical TSU dataset, Assembly101 and IkeaASM, achieving over a 17% improvement in top-1 accuracy.
{"title":"Enhancing action recognition by leveraging the hierarchical structure of actions and textual context","authors":"Manuel Benavent-Lledo , David Mulero-Pérez , David Ortiz-Perez , Jose Garcia-Rodriguez , Antonis Argyros","doi":"10.1016/j.cviu.2025.104560","DOIUrl":"10.1016/j.cviu.2025.104560","url":null,"abstract":"<div><div>We propose a novel approach to improve action recognition <em>by exploiting the hierarchical organization of actions and by incorporating contextualized textual information, including location and previous actions, to reflect the action’s temporal context</em>. To achieve this, we introduce a transformer architecture tailored for action recognition that employs both visual and textual features. Visual features are obtained from RGB and optical flow data, while text embeddings represent contextual information. Furthermore, we define a joint loss function to simultaneously train the model for both coarse- and fine-grained action recognition, effectively exploiting the hierarchical nature of actions. To demonstrate the effectiveness of our method, we extend the Toyota Smarthome Untrimmed (TSU) dataset by incorporating action hierarchies, resulting in the <em>Hierarchical TSU dataset</em>, a hierarchical dataset designed for monitoring activities of the elderly in home environments. An ablation study assesses the performance impact of different strategies for integrating contextual and hierarchical data. Experimental results demonstrate that the proposed method consistently outperforms SOTA methods on the Hierarchical TSU dataset, Assembly101 and IkeaASM, achieving over a 17% improvement in top-1 accuracy.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"262 ","pages":"Article 104560"},"PeriodicalIF":3.5,"publicationDate":"2025-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145418706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-29DOI: 10.1016/j.cviu.2025.104541
Eric L. Wisotzky , Jost Triller , Michael Knoke , Brigitta Globke , Anna Hilsmann , Peter Eisert
We present a real-time stereo hyperspectral imaging (stereo-HSI) system for intraoperative tissue and organ analysis that integrates multispectral snapshot imaging with stereo vision to support clinical decision-making. The system visualize both RGB and high-dimensional spectral data while simultaneously reconstructing 3D surfaces, offering a compact, non-contact solution for seamless integration into surgical workflows. A modular processing pipeline enables robust demosaicing, spectral and spatial fusion, and pixel-wise medical assessment, including perfusion and tissue classification. Our spectral warping algorithm leverages a custom learned mapping, our white-balance network method is the first for snapshot MSI cameras, and our fusion CNN employs spectral-attention modules to exploit the rich hyperspectral domain. Clinical feasibility was demonstrated in 57 surgical procedures, including kidney transplantation, parotidectomy, and neck dissection, achieving high spatial and spectral resolution under standard surgical lighting conditions. The system enables visualization of oxygenation and tissue composition in real-time, offering surgeons a novel tool for image-guided interventions. This study establishes the stereo-HSI platform as a clinically viable and effective method for enhancing intraoperative insight and surgical precision.
{"title":"Real-time fusion of stereo vision and hyperspectral imaging for objective decision support during surgery","authors":"Eric L. Wisotzky , Jost Triller , Michael Knoke , Brigitta Globke , Anna Hilsmann , Peter Eisert","doi":"10.1016/j.cviu.2025.104541","DOIUrl":"10.1016/j.cviu.2025.104541","url":null,"abstract":"<div><div>We present a real-time stereo hyperspectral imaging (stereo-HSI) system for intraoperative tissue and organ analysis that integrates multispectral snapshot imaging with stereo vision to support clinical decision-making. The system visualize both RGB and high-dimensional spectral data while simultaneously reconstructing 3D surfaces, offering a compact, non-contact solution for seamless integration into surgical workflows. A modular processing pipeline enables robust demosaicing, spectral and spatial fusion, and pixel-wise medical assessment, including perfusion and tissue classification. Our spectral warping algorithm leverages a custom learned mapping, our white-balance network method is the first for snapshot MSI cameras, and our fusion CNN employs spectral-attention modules to exploit the rich hyperspectral domain. Clinical feasibility was demonstrated in 57 surgical procedures, including kidney transplantation, parotidectomy, and neck dissection, achieving high spatial and spectral resolution under standard surgical lighting conditions. The system enables visualization of oxygenation and tissue composition in real-time, offering surgeons a novel tool for image-guided interventions. This study establishes the stereo-HSI platform as a clinically viable and effective method for enhancing intraoperative insight and surgical precision.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"262 ","pages":"Article 104541"},"PeriodicalIF":3.5,"publicationDate":"2025-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145418710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper introduces SCENE-Net, a novel low-resource, white-box model that serves as a compelling proof-of-concept for 3D point cloud segmentation. At its core, SCENE-Net employs Group Equivariant Non-Expansive Operators (GENEOs), a mechanism that leverages geometric priors for enhanced object identification. Our contribution extends the theoretical landscape of geometric learning, highlighting the utility of geometric observers as intrinsic biases in analyzing 3D environments. Through empirical testing and efficiency analysis, we demonstrate the performance of SCENE-Net in detecting power line supporting towers, a key application in forest fire prevention. Our results showcase the superior accuracy and resilience of our model to label noise, achieved with minimal computational resources—this instantiation of SCENE-Net has only eleven trainable parameters—thereby marking a significant step forward in trustworthy machine learning applied to 3D scene understanding. Our code is available in: https://github.com/dlavado/scene-net.
本文介绍了SCENE-Net,这是一种新颖的低资源白盒模型,可作为3D点云分割的引人注目的概念验证。在其核心,SCENE-Net采用了群等变非膨胀算子(genos),这是一种利用几何先验来增强对象识别的机制。我们的贡献扩展了几何学习的理论景观,突出了几何观察者在分析3D环境中的内在偏差的效用。通过实证测试和效率分析,我们证明了SCENE-Net在森林防火中的关键应用——电力线支撑塔检测中的性能。我们的结果展示了我们的模型在标记噪声方面的卓越准确性和弹性,以最小的计算资源实现- scene - net的实例只有11个可训练的参数-从而标志着在可信赖的机器学习应用于3D场景理解方面迈出了重要的一步。我们的代码可在:https://github.com/dlavado/scene-net。
{"title":"SCENE-Net: Geometric induction for interpretable and low-resource 3D pole detection with Group-Equivariant Non-Expansive Operators","authors":"Diogo Lavado , Alessandra Micheletti , Giovanni Bocchi , Patrizio Frosini , Cláudia Soares","doi":"10.1016/j.cviu.2025.104531","DOIUrl":"10.1016/j.cviu.2025.104531","url":null,"abstract":"<div><div>This paper introduces SCENE-Net, a novel low-resource, white-box model that serves as a compelling proof-of-concept for 3D point cloud segmentation. At its core, SCENE-Net employs Group Equivariant Non-Expansive Operators (GENEOs), a mechanism that leverages geometric priors for enhanced object identification. Our contribution extends the theoretical landscape of geometric learning, highlighting the utility of geometric observers as intrinsic biases in analyzing 3D environments. Through empirical testing and efficiency analysis, we demonstrate the performance of SCENE-Net in detecting power line supporting towers, a key application in forest fire prevention. Our results showcase the superior accuracy and resilience of our model to label noise, achieved with minimal computational resources—this instantiation of SCENE-Net has only eleven trainable parameters—thereby marking a significant step forward in trustworthy machine learning applied to 3D scene understanding. Our code is available in: <span><span>https://github.com/dlavado/scene-net</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"262 ","pages":"Article 104531"},"PeriodicalIF":3.5,"publicationDate":"2025-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145418703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-26DOI: 10.1016/j.cviu.2025.104548
Wenrun Wang , Jianwu Dang , Yangping Wang , Rui Pan
Accurate segmentation of hands and interacting objects is a critical challenge in computer vision, attributed to complex factors like mutual occlusion, finger self-similarity, and high hand movement flexibility. To tackle these issues, we present RefineHOS, an innovative framework for precise pixel-level segmentation of hands and interacting objects. Based on RefineMask, RefineHOS features substantial architectural optimizations for hand–object interaction scenarios. Specifically, we designed an Augmentation Feature Pyramid Path module (AFPN) integrated with a Dual Attention module (DAM) in the backbone to capture multi-scale feature information of hands and interacting objects. Additionally, we enhanced segmentation performance by introducing a Triplet Attention module (TAM) to optimize both the mask head and the semantic head. We also presented a new Boundary Refinement Module (BRM), utilizing an iterative subdivision approach to enhance the precision of boundary details in the segmentation results. Extensive experiments on multiple benchmark datasets (including VISOR, Ego-HOS, and ENIGMA-51) show that our method achieves state-of-the-art performance. To comprehensively evaluate segmentation quality, we introduce Boundary Average Precision (Boundary AP) as a key metric to complement existing benchmark segmentation metrics.
由于相互遮挡、手指自相似性和手部高度运动灵活性等复杂因素的影响,手部和相互作用物体的准确分割是计算机视觉中的一个关键挑战。为了解决这些问题,我们提出了RefineHOS,这是一个创新的框架,用于对手和相互作用的物体进行精确的像素级分割。基于RefineMask, RefineHOS为手-对象交互场景提供了大量的架构优化。具体而言,我们设计了一个增强特征金字塔路径模块(AFPN),该模块在主干中集成了双注意模块(DAM),用于捕获手和交互物体的多尺度特征信息。此外,我们通过引入三重注意力模块(Triplet Attention module, TAM)来优化掩码头和语义头,从而提高分割性能。我们还提出了一种新的边界细化模块(BRM),利用迭代细分的方法来提高分割结果中边界细节的精度。在多个基准数据集(包括VISOR、Ego-HOS和ENIGMA-51)上进行的大量实验表明,我们的方法达到了最先进的性能。为了全面评价分割质量,我们引入边界平均精度(Boundary Average Precision, Boundary AP)作为一个关键指标来补充现有的基准分割指标。
{"title":"RefineHOS: A high-performance hand–object segmentation with fine-grained spatial features","authors":"Wenrun Wang , Jianwu Dang , Yangping Wang , Rui Pan","doi":"10.1016/j.cviu.2025.104548","DOIUrl":"10.1016/j.cviu.2025.104548","url":null,"abstract":"<div><div>Accurate segmentation of hands and interacting objects is a critical challenge in computer vision, attributed to complex factors like mutual occlusion, finger self-similarity, and high hand movement flexibility. To tackle these issues, we present RefineHOS, an innovative framework for precise pixel-level segmentation of hands and interacting objects. Based on RefineMask, RefineHOS features substantial architectural optimizations for hand–object interaction scenarios. Specifically, we designed an Augmentation Feature Pyramid Path module (AFPN) integrated with a Dual Attention module (DAM) in the backbone to capture multi-scale feature information of hands and interacting objects. Additionally, we enhanced segmentation performance by introducing a Triplet Attention module (TAM) to optimize both the mask head and the semantic head. We also presented a new Boundary Refinement Module (BRM), utilizing an iterative subdivision approach to enhance the precision of boundary details in the segmentation results. Extensive experiments on multiple benchmark datasets (including VISOR, Ego-HOS, and ENIGMA-51) show that our method achieves state-of-the-art performance. To comprehensively evaluate segmentation quality, we introduce Boundary Average Precision (Boundary AP) as a key metric to complement existing benchmark segmentation metrics.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"262 ","pages":"Article 104548"},"PeriodicalIF":3.5,"publicationDate":"2025-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145418711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-24DOI: 10.1016/j.cviu.2025.104551
Jichen Gao , Suiping Zhou , Hang Yu , Chenyang Li , Xiaoxi Hu
As a central task in multi-modal learning, audio–visual event localization seeks to identify consistent event information within visual–audio segments and to identify event categories. Some works do not consider the impact of visual features on audio features and the issue of segment information loss in selecting semantically consistent segments. To address the aforementioned issues, we introduce a network that enhances the multi-task learning performance of visual–audio modalities and resolves the semantic inconsistency present in audio–visual segments by employing bi-directional collaborative guided attention and semantic consistency enhancement. Firstly, we introduce a bi-directional collaborative guided attention module, which integrates multi-modal linear pooling and spatial-channel attention to bolster the semantic information of both audio and visual features across the visual-guided audio attention and audio-guided visual attention pathways. Subsequently, we propose an innovative multi-modal similarity learning model that addresses the issue of information loss during the filtering of low-similarity segments, which is a common problem in existing approaches. By incorporating multi-modal feature random masking, this model is capable of learning robust audio–visual relationships. Lastly, we capture global semantic information across the entire video in the temporal dimension, and enhance semantic consistency of events by using differential semantics between global semantics and audio–visual segment semantics. The experimental results on the AVE dataset indicate that our network has achieved superior performance.
{"title":"SCESS-Net: Semantic consistency enhancement and segment selection network for audio–visual event localization","authors":"Jichen Gao , Suiping Zhou , Hang Yu , Chenyang Li , Xiaoxi Hu","doi":"10.1016/j.cviu.2025.104551","DOIUrl":"10.1016/j.cviu.2025.104551","url":null,"abstract":"<div><div>As a central task in multi-modal learning, audio–visual event localization seeks to identify consistent event information within visual–audio segments and to identify event categories. Some works do not consider the impact of visual features on audio features and the issue of segment information loss in selecting semantically consistent segments. To address the aforementioned issues, we introduce a network that enhances the multi-task learning performance of visual–audio modalities and resolves the semantic inconsistency present in audio–visual segments by employing bi-directional collaborative guided attention and semantic consistency enhancement. Firstly, we introduce a bi-directional collaborative guided attention module, which integrates multi-modal linear pooling and spatial-channel attention to bolster the semantic information of both audio and visual features across the visual-guided audio attention and audio-guided visual attention pathways. Subsequently, we propose an innovative multi-modal similarity learning model that addresses the issue of information loss during the filtering of low-similarity segments, which is a common problem in existing approaches. By incorporating multi-modal feature random masking, this model is capable of learning robust audio–visual relationships. Lastly, we capture global semantic information across the entire video in the temporal dimension, and enhance semantic consistency of events by using differential semantics between global semantics and audio–visual segment semantics. The experimental results on the AVE dataset indicate that our network has achieved superior performance.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"262 ","pages":"Article 104551"},"PeriodicalIF":3.5,"publicationDate":"2025-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145418709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-24DOI: 10.1016/j.cviu.2025.104552
Pengxia Li, Zhonghao Du, Linhui Zhang, Yanyi Lv, Yujie Liu
Visible–Infrared Person Re-identification (VI-ReID) aims to match the identities of pedestrians captured by non-overlapping cameras in both visible and infrared modalities. The key to overcoming the VI-ReID challenge lies in extracting diverse modality-shared features. Current methods mainly focus on channel-level operations during data preprocessing, with the aim of expanding the dataset. However, these methods often overlook the complex relationships among channel features, leading to insufficient utilization of unique information in each channel. To address this issue, we propose the Channel-Aware Feature Mining Network (CAFMNet) to improve VI-ReID effectiveness. Specifically, we design three core modules: a Channel-Level Feature Optimization (CLFO) module, which captures channel-level key features for identity recognition and directly extracts identity-relevant information at the channel level; a Channel-Level Feature Refinement (CLFR) module, which enhances channel-level features while retaining useful information—addressing the irrelevant content in initially extracted features; a Multi-Dimensional Feature Optimization (MDFO) module, which comprehensively processes multi-dimensional feature information to enhance the model’s ability to understand and describe input data. Extensive experiments on the SYSU-MM01 and LLCM datasets demonstrate that our CAFMNet outperforms existing approaches in terms of VI-ReID effectiveness. The code is available at https://github.com/cobeibei/CAFMNet-1.
{"title":"Channel-aware feature mining network for Visible–Infrared Person Re-identification","authors":"Pengxia Li, Zhonghao Du, Linhui Zhang, Yanyi Lv, Yujie Liu","doi":"10.1016/j.cviu.2025.104552","DOIUrl":"10.1016/j.cviu.2025.104552","url":null,"abstract":"<div><div>Visible–Infrared Person Re-identification (VI-ReID) aims to match the identities of pedestrians captured by non-overlapping cameras in both visible and infrared modalities. The key to overcoming the VI-ReID challenge lies in extracting diverse modality-shared features. Current methods mainly focus on channel-level operations during data preprocessing, with the aim of expanding the dataset. However, these methods often overlook the complex relationships among channel features, leading to insufficient utilization of unique information in each channel. To address this issue, we propose the Channel-Aware Feature Mining Network (CAFMNet) to improve VI-ReID effectiveness. Specifically, we design three core modules: a Channel-Level Feature Optimization (CLFO) module, which captures channel-level key features for identity recognition and directly extracts identity-relevant information at the channel level; a Channel-Level Feature Refinement (CLFR) module, which enhances channel-level features while retaining useful information—addressing the irrelevant content in initially extracted features; a Multi-Dimensional Feature Optimization (MDFO) module, which comprehensively processes multi-dimensional feature information to enhance the model’s ability to understand and describe input data. Extensive experiments on the SYSU-MM01 and LLCM datasets demonstrate that our CAFMNet outperforms existing approaches in terms of VI-ReID effectiveness. The code is available at <span><span>https://github.com/cobeibei/CAFMNet-1</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"262 ","pages":"Article 104552"},"PeriodicalIF":3.5,"publicationDate":"2025-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145418707","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-24DOI: 10.1016/j.cviu.2025.104550
Min Mao , Ge Jiao , Wanhui Gao , Jixun Ye
With the rapid advancement of image editing technologies, forensic analysis for detecting malicious image manipulations has become a critical research topic. While existing deep learning-based forgery localization methods have demonstrated promising results, they face three fundamental limitations: (1) heavy reliance on large-scale annotated datasets, (2) computationally intensive training processes, and (3) insufficient capability in capturing diverse forgery traces. To address these challenges, we present MSFENet (Multi-Scale Filter-Enhanced Network), a novel framework that synergistically integrates multiple forensic filters for comprehensive forgery detection. Our approach introduces three key innovations: First, we employ a multi-filter feature extraction module that combines NoisePrint++, SRM, and Bayar Conv to capture complementary forensic traces, including noise patterns, texture inconsistencies, and boundary artifacts. Second, we introduce a dual-branch multi-scale encoder that effectively preserves both local and global manipulation characteristics. Third, we design two novel components: the Coordinate Attention-based Cross-modal Feature Rectification (CAFR) module, which adaptively recalibrates feature representations across different modalities and learns the complementary properties of different extracted features, and the Multi-Scale Selective Fusion (MSF) module, which intelligently integrates discriminative features while suppressing redundant information. Extensive experiments on six benchmark datasets demonstrate the superiority of MSFENet. Our method achieves state-of-the-art performance, with F1-score improvements of 6.36%, 0.84%, 6.22%, and 48.8% on Casiav1, COVER, IMD20, and DSO-1, respectively, compared to existing methods.
{"title":"MSFENet: Multi-Scale Filter-Enhanced Network architecture for digital image forgery trace localization","authors":"Min Mao , Ge Jiao , Wanhui Gao , Jixun Ye","doi":"10.1016/j.cviu.2025.104550","DOIUrl":"10.1016/j.cviu.2025.104550","url":null,"abstract":"<div><div>With the rapid advancement of image editing technologies, forensic analysis for detecting malicious image manipulations has become a critical research topic. While existing deep learning-based forgery localization methods have demonstrated promising results, they face three fundamental limitations: (1) heavy reliance on large-scale annotated datasets, (2) computationally intensive training processes, and (3) insufficient capability in capturing diverse forgery traces. To address these challenges, we present MSFENet (Multi-Scale Filter-Enhanced Network), a novel framework that synergistically integrates multiple forensic filters for comprehensive forgery detection. Our approach introduces three key innovations: First, we employ a multi-filter feature extraction module that combines NoisePrint++, SRM, and Bayar Conv to capture complementary forensic traces, including noise patterns, texture inconsistencies, and boundary artifacts. Second, we introduce a dual-branch multi-scale encoder that effectively preserves both local and global manipulation characteristics. Third, we design two novel components: the Coordinate Attention-based Cross-modal Feature Rectification (CAFR) module, which adaptively recalibrates feature representations across different modalities and learns the complementary properties of different extracted features, and the Multi-Scale Selective Fusion (MSF) module, which intelligently integrates discriminative features while suppressing redundant information. Extensive experiments on six benchmark datasets demonstrate the superiority of MSFENet. Our method achieves state-of-the-art performance, with F1-score improvements of 6.36%, 0.84%, 6.22%, and 48.8% on Casiav1, COVER, IMD20, and DSO-1, respectively, compared to existing methods.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"262 ","pages":"Article 104550"},"PeriodicalIF":3.5,"publicationDate":"2025-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145418704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-24DOI: 10.1016/j.cviu.2025.104553
Yifan Jiao , Xinran Liu , Xiaoqiong Liu , Xiaohui Yuan , Heng Fan , Libo Zhang
Planar tracking has drawn increasing interest owing to its key roles in robotics and augmented reality. Despite recent great advancement, further development of planar tracking, particularly in the deep learning era, is largely limited compared to generic tracking due to the lack of large-scale platforms. To mitigate this, we propose PlanarTrack, a large-scale high-quality and challenging benchmark for planar tracking. Specifically, PlanarTrack consists of 1150 sequences with over 733K frames, including 1000 short-term and 150 new long-term videos, which enables comprehensive evaluation of short- and long-term tracking performance. All videos in PlanarTrack are recorded in unconstrained conditions from the wild, which makes PlanarTrack challenging but more realistic for real-world applications. To ensure high-quality annotations, each video frame is manually annotated by four corner points with multi-round meticulous inspection and refinement. To enhance target diversity of PlanarTrack, we only capture a unique target in one sequence, which is different from existing benchmarks. To our best knowledge, PlanarTrack is by far the largest and most diverse and challenging dataset dedicated to planar tracking. To understand performance of existing methods on PlanarTrack and to provide a comparison for future research, we evaluate 10 representative planar trackers with extensive comparison and in-depth analysis. Our evaluation reveals that, unsurprisingly, the top planar trackers heavily degrade on the challenging PlanarTrack, which indicates more efforts are required for improving planar tracking. Moreover, we derive a variant named PlanarTrackBB from PlanarTrack for generic tracking. Evaluation with 15 generic trackers shows that, surprisingly, our PlanarTrackBB is even more challenging than several popular generic tracking benchmarks, and more attention should be paid to dealing with planar targets, though they are rigid. Our data and results will be released at https://github.com/HengLan/PlanarTrack
{"title":"PlanarTrack: A high-quality and challenging benchmark for large-scale planar object tracking","authors":"Yifan Jiao , Xinran Liu , Xiaoqiong Liu , Xiaohui Yuan , Heng Fan , Libo Zhang","doi":"10.1016/j.cviu.2025.104553","DOIUrl":"10.1016/j.cviu.2025.104553","url":null,"abstract":"<div><div>Planar tracking has drawn increasing interest owing to its key roles in robotics and augmented reality. Despite recent great advancement, further development of planar tracking, particularly in the deep learning era, is largely limited compared to generic tracking due to the lack of large-scale platforms. To mitigate this, we propose <strong>PlanarTrack</strong>, a large-scale high-quality and challenging benchmark for planar tracking. Specifically, PlanarTrack consists of 1150 sequences with over 733K frames, including 1000 short-term and 150 new long-term videos, which enables comprehensive evaluation of short- and long-term tracking performance. All videos in PlanarTrack are recorded in unconstrained conditions from the wild, which makes PlanarTrack challenging but more realistic for real-world applications. To ensure high-quality annotations, each video frame is manually annotated by four corner points with multi-round meticulous inspection and refinement. To enhance target diversity of PlanarTrack, we only capture a unique target in one sequence, which is different from existing benchmarks. To our best knowledge, PlanarTrack is by far the largest and most diverse and challenging dataset dedicated to planar tracking. To understand performance of existing methods on PlanarTrack and to provide a comparison for future research, we evaluate 10 representative planar trackers with extensive comparison and in-depth analysis. Our evaluation reveals that, unsurprisingly, the top planar trackers heavily degrade on the challenging PlanarTrack, which indicates more efforts are required for improving planar tracking. Moreover, we derive a variant named <strong>PlanarTrack</strong><sub>BB</sub> from PlanarTrack for generic tracking. Evaluation with 15 generic trackers shows that, surprisingly, our PlanarTrack<sub>BB</sub> is even more challenging than several popular generic tracking benchmarks, and more attention should be paid to dealing with planar targets, though they are rigid. Our data and results will be released at <span><span>https://github.com/HengLan/PlanarTrack</span><svg><path></path></svg></span></div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"262 ","pages":"Article 104553"},"PeriodicalIF":3.5,"publicationDate":"2025-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145418712","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-24DOI: 10.1016/j.cviu.2025.104554
Wenfei Xiong , Huabing Zhou , Yanduo Zhang , Tao Lu , Jiayi Ma
Document images captured by sensors often suffer from intricate geometric distortions, hindering readability and impeding downstream document analysis tasks. While deep learning-based methods for document geometric rectification have shown promising results, their training heavily relies on high quality ground truth for the mapping field, resulting in challenging and expensive dataset creation. To address this issue, we propose DiffuseDoc, a novel framework for document image geometric rectification based on the diffusion model. Unlike existing methods, the training process of DiffuseDoc only requires pairs of distorted and distortion-free images, eliminating the need for ground truth mapping field supervision. Specifically, DiffuseDoc consists of two primary components: the geometric rectification module and the conditional diffusion module. By jointly training the two components, the rectification results are optimized while simultaneously learning the latent feature distribution of the distortion-free image. Also, we contribute the DocReal dataset, comprising document images captured by diverse high-resolution sensors in real-world scenarios, alongside their corresponding scanned versions. Extensive evaluations demonstrate that DiffuseDoc achieves state-of-the-art performance on both the Doc-U-Net benchmark and DocReal datasets.
{"title":"DiffuseDoc: Document geometric rectification via diffusion model","authors":"Wenfei Xiong , Huabing Zhou , Yanduo Zhang , Tao Lu , Jiayi Ma","doi":"10.1016/j.cviu.2025.104554","DOIUrl":"10.1016/j.cviu.2025.104554","url":null,"abstract":"<div><div>Document images captured by sensors often suffer from intricate geometric distortions, hindering readability and impeding downstream document analysis tasks. While deep learning-based methods for document geometric rectification have shown promising results, their training heavily relies on high quality ground truth for the mapping field, resulting in challenging and expensive dataset creation. To address this issue, we propose DiffuseDoc, a novel framework for document image geometric rectification based on the diffusion model. Unlike existing methods, the training process of DiffuseDoc only requires pairs of distorted and distortion-free images, eliminating the need for ground truth mapping field supervision. Specifically, DiffuseDoc consists of two primary components: the geometric rectification module and the conditional diffusion module. By jointly training the two components, the rectification results are optimized while simultaneously learning the latent feature distribution of the distortion-free image. Also, we contribute the DocReal dataset, comprising document images captured by diverse high-resolution sensors in real-world scenarios, alongside their corresponding scanned versions. Extensive evaluations demonstrate that DiffuseDoc achieves state-of-the-art performance on both the Doc-U-Net benchmark and DocReal datasets.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"262 ","pages":"Article 104554"},"PeriodicalIF":3.5,"publicationDate":"2025-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145418705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-22DOI: 10.1016/j.cviu.2025.104522
Hania Ghouse , Muzammil Behzad
Efficient and accurate multi-organ segmentation from abdominal CT volumes is a fundamental challenge in medical image analysis. Existing 3D segmentation approaches are computationally and memory intensive, often processing entire volumes that contain many anatomically irrelevant slices. Meanwhile, 2D methods suffer from class imbalance and lack cross-view contextual awareness. To address these limitations, we propose a novel, anatomically-aware slice selector pipeline that reduces input volume prior to segmentation. Our unified framework introduces a vision-language model (VLM) for cross-view organ presence detection using fused tri-slice (2.5D) representations from axial, sagittal, and coronal planes. Our proposed model acts as an “expert” in anatomical localization, reasoning over multi-view representations to selectively retain slices with high structural relevance. This enables spatially consistent filtering across orientations while preserving contextual cues. More importantly, since standard segmentation metrics such as Dice or IoU fail to measure the spatial precision of such slice selection, we introduce a novel metric, Slice Localization Concordance (SLC), which jointly captures anatomical coverage and spatial alignment with organ-centric reference slices. Unlike segmentation-specific metrics, SLC provides a model-agnostic evaluation of localization fidelity. Our model offers substantial improvement gains against several baselines across all organs, demonstrating both accurate and reliable organ-focused slice filtering. These results show that our method enables efficient and spatially consistent organ filtering, thereby significantly reducing downstream segmentation cost while maintaining high anatomical fidelity.
{"title":"MOSAIC: A multi-view 2.5D organ slice selector with cross-attentional reasoning for anatomically-aware CT localization in medical organ segmentation","authors":"Hania Ghouse , Muzammil Behzad","doi":"10.1016/j.cviu.2025.104522","DOIUrl":"10.1016/j.cviu.2025.104522","url":null,"abstract":"<div><div>Efficient and accurate multi-organ segmentation from abdominal CT volumes is a fundamental challenge in medical image analysis. Existing 3D segmentation approaches are computationally and memory intensive, often processing entire volumes that contain many anatomically irrelevant slices. Meanwhile, 2D methods suffer from class imbalance and lack cross-view contextual awareness. To address these limitations, we propose a novel, anatomically-aware slice selector pipeline that reduces input volume prior to segmentation. Our unified framework introduces a vision-language model (VLM) for cross-view organ presence detection using fused tri-slice (2.5D) representations from axial, sagittal, and coronal planes. Our proposed model acts as an “expert” in anatomical localization, reasoning over multi-view representations to selectively retain slices with high structural relevance. This enables spatially consistent filtering across orientations while preserving contextual cues. More importantly, since standard segmentation metrics such as Dice or IoU fail to measure the spatial precision of such slice selection, we introduce a novel metric, Slice Localization Concordance (SLC), which jointly captures anatomical coverage and spatial alignment with organ-centric reference slices. Unlike segmentation-specific metrics, SLC provides a model-agnostic evaluation of localization fidelity. Our model offers substantial improvement gains against several baselines across all organs, demonstrating both accurate and reliable organ-focused slice filtering. These results show that our method enables efficient and spatially consistent organ filtering, thereby significantly reducing downstream segmentation cost while maintaining high anatomical fidelity.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"262 ","pages":"Article 104522"},"PeriodicalIF":3.5,"publicationDate":"2025-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145364896","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}