首页 > 最新文献

Computer Vision and Image Understanding最新文献

英文 中文
Enhancing action recognition by leveraging the hierarchical structure of actions and textual context 通过利用动作和文本上下文的层次结构来增强动作识别
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-10-30 DOI: 10.1016/j.cviu.2025.104560
Manuel Benavent-Lledo , David Mulero-Pérez , David Ortiz-Perez , Jose Garcia-Rodriguez , Antonis Argyros
We propose a novel approach to improve action recognition by exploiting the hierarchical organization of actions and by incorporating contextualized textual information, including location and previous actions, to reflect the action’s temporal context. To achieve this, we introduce a transformer architecture tailored for action recognition that employs both visual and textual features. Visual features are obtained from RGB and optical flow data, while text embeddings represent contextual information. Furthermore, we define a joint loss function to simultaneously train the model for both coarse- and fine-grained action recognition, effectively exploiting the hierarchical nature of actions. To demonstrate the effectiveness of our method, we extend the Toyota Smarthome Untrimmed (TSU) dataset by incorporating action hierarchies, resulting in the Hierarchical TSU dataset, a hierarchical dataset designed for monitoring activities of the elderly in home environments. An ablation study assesses the performance impact of different strategies for integrating contextual and hierarchical data. Experimental results demonstrate that the proposed method consistently outperforms SOTA methods on the Hierarchical TSU dataset, Assembly101 and IkeaASM, achieving over a 17% improvement in top-1 accuracy.
我们提出了一种改进动作识别的新方法,该方法利用动作的分层组织,并结合上下文文本信息(包括位置和先前的动作)来反映动作的时间上下文。为了实现这一点,我们引入了一个为动作识别量身定制的转换器架构,该架构同时使用了视觉和文本特征。视觉特征从RGB和光流数据中获得,而文本嵌入则表示上下文信息。此外,我们定义了一个联合损失函数来同时训练模型进行粗粒度和细粒度的动作识别,有效地利用了动作的层次性。为了证明我们方法的有效性,我们通过合并动作层次来扩展丰田智能家居未修剪(TSU)数据集,从而产生分层TSU数据集,这是一个用于监测家庭环境中老年人活动的分层数据集。一项消融研究评估了整合上下文和分层数据的不同策略对性能的影响。实验结果表明,该方法在Hierarchical TSU数据集、Assembly101和IkeaASM上始终优于SOTA方法,在top-1准确率上提高了17%以上。
{"title":"Enhancing action recognition by leveraging the hierarchical structure of actions and textual context","authors":"Manuel Benavent-Lledo ,&nbsp;David Mulero-Pérez ,&nbsp;David Ortiz-Perez ,&nbsp;Jose Garcia-Rodriguez ,&nbsp;Antonis Argyros","doi":"10.1016/j.cviu.2025.104560","DOIUrl":"10.1016/j.cviu.2025.104560","url":null,"abstract":"<div><div>We propose a novel approach to improve action recognition <em>by exploiting the hierarchical organization of actions and by incorporating contextualized textual information, including location and previous actions, to reflect the action’s temporal context</em>. To achieve this, we introduce a transformer architecture tailored for action recognition that employs both visual and textual features. Visual features are obtained from RGB and optical flow data, while text embeddings represent contextual information. Furthermore, we define a joint loss function to simultaneously train the model for both coarse- and fine-grained action recognition, effectively exploiting the hierarchical nature of actions. To demonstrate the effectiveness of our method, we extend the Toyota Smarthome Untrimmed (TSU) dataset by incorporating action hierarchies, resulting in the <em>Hierarchical TSU dataset</em>, a hierarchical dataset designed for monitoring activities of the elderly in home environments. An ablation study assesses the performance impact of different strategies for integrating contextual and hierarchical data. Experimental results demonstrate that the proposed method consistently outperforms SOTA methods on the Hierarchical TSU dataset, Assembly101 and IkeaASM, achieving over a 17% improvement in top-1 accuracy.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"262 ","pages":"Article 104560"},"PeriodicalIF":3.5,"publicationDate":"2025-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145418706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Real-time fusion of stereo vision and hyperspectral imaging for objective decision support during surgery 立体视觉与高光谱成像的实时融合为手术过程中的客观决策提供支持
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-10-29 DOI: 10.1016/j.cviu.2025.104541
Eric L. Wisotzky , Jost Triller , Michael Knoke , Brigitta Globke , Anna Hilsmann , Peter Eisert
We present a real-time stereo hyperspectral imaging (stereo-HSI) system for intraoperative tissue and organ analysis that integrates multispectral snapshot imaging with stereo vision to support clinical decision-making. The system visualize both RGB and high-dimensional spectral data while simultaneously reconstructing 3D surfaces, offering a compact, non-contact solution for seamless integration into surgical workflows. A modular processing pipeline enables robust demosaicing, spectral and spatial fusion, and pixel-wise medical assessment, including perfusion and tissue classification. Our spectral warping algorithm leverages a custom learned mapping, our white-balance network method is the first for snapshot MSI cameras, and our fusion CNN employs spectral-attention modules to exploit the rich hyperspectral domain. Clinical feasibility was demonstrated in 57 surgical procedures, including kidney transplantation, parotidectomy, and neck dissection, achieving high spatial and spectral resolution under standard surgical lighting conditions. The system enables visualization of oxygenation and tissue composition in real-time, offering surgeons a novel tool for image-guided interventions. This study establishes the stereo-HSI platform as a clinically viable and effective method for enhancing intraoperative insight and surgical precision.
我们提出了一种用于术中组织和器官分析的实时立体高光谱成像(stereo- hsi)系统,该系统将多光谱快照成像与立体视觉相结合,以支持临床决策。该系统可以可视化RGB和高维光谱数据,同时重建3D表面,提供紧凑的非接触式解决方案,可无缝集成到手术工作流程中。模块化处理流水线可实现强大的去马赛克、光谱和空间融合以及逐像素医学评估,包括灌注和组织分类。我们的光谱扭曲算法利用自定义学习映射,我们的白平衡网络方法是第一个用于快照MSI相机的方法,我们的融合CNN采用光谱关注模块来利用丰富的高光谱域。临床可行性在57例手术中得到证实,包括肾移植、腮腺切除术和颈部清扫,在标准手术照明条件下实现了高空间和光谱分辨率。该系统能够实时可视化氧合和组织组成,为外科医生提供了一种图像引导干预的新工具。本研究建立了立体hsi平台作为临床可行和有效的方法,以提高术中洞察力和手术精度。
{"title":"Real-time fusion of stereo vision and hyperspectral imaging for objective decision support during surgery","authors":"Eric L. Wisotzky ,&nbsp;Jost Triller ,&nbsp;Michael Knoke ,&nbsp;Brigitta Globke ,&nbsp;Anna Hilsmann ,&nbsp;Peter Eisert","doi":"10.1016/j.cviu.2025.104541","DOIUrl":"10.1016/j.cviu.2025.104541","url":null,"abstract":"<div><div>We present a real-time stereo hyperspectral imaging (stereo-HSI) system for intraoperative tissue and organ analysis that integrates multispectral snapshot imaging with stereo vision to support clinical decision-making. The system visualize both RGB and high-dimensional spectral data while simultaneously reconstructing 3D surfaces, offering a compact, non-contact solution for seamless integration into surgical workflows. A modular processing pipeline enables robust demosaicing, spectral and spatial fusion, and pixel-wise medical assessment, including perfusion and tissue classification. Our spectral warping algorithm leverages a custom learned mapping, our white-balance network method is the first for snapshot MSI cameras, and our fusion CNN employs spectral-attention modules to exploit the rich hyperspectral domain. Clinical feasibility was demonstrated in 57 surgical procedures, including kidney transplantation, parotidectomy, and neck dissection, achieving high spatial and spectral resolution under standard surgical lighting conditions. The system enables visualization of oxygenation and tissue composition in real-time, offering surgeons a novel tool for image-guided interventions. This study establishes the stereo-HSI platform as a clinically viable and effective method for enhancing intraoperative insight and surgical precision.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"262 ","pages":"Article 104541"},"PeriodicalIF":3.5,"publicationDate":"2025-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145418710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SCENE-Net: Geometric induction for interpretable and low-resource 3D pole detection with Group-Equivariant Non-Expansive Operators SCENE-Net:用群等变非膨胀算子进行可解释和低资源三维极点检测的几何归纳
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-10-27 DOI: 10.1016/j.cviu.2025.104531
Diogo Lavado , Alessandra Micheletti , Giovanni Bocchi , Patrizio Frosini , Cláudia Soares
This paper introduces SCENE-Net, a novel low-resource, white-box model that serves as a compelling proof-of-concept for 3D point cloud segmentation. At its core, SCENE-Net employs Group Equivariant Non-Expansive Operators (GENEOs), a mechanism that leverages geometric priors for enhanced object identification. Our contribution extends the theoretical landscape of geometric learning, highlighting the utility of geometric observers as intrinsic biases in analyzing 3D environments. Through empirical testing and efficiency analysis, we demonstrate the performance of SCENE-Net in detecting power line supporting towers, a key application in forest fire prevention. Our results showcase the superior accuracy and resilience of our model to label noise, achieved with minimal computational resources—this instantiation of SCENE-Net has only eleven trainable parameters—thereby marking a significant step forward in trustworthy machine learning applied to 3D scene understanding. Our code is available in: https://github.com/dlavado/scene-net.
本文介绍了SCENE-Net,这是一种新颖的低资源白盒模型,可作为3D点云分割的引人注目的概念验证。在其核心,SCENE-Net采用了群等变非膨胀算子(genos),这是一种利用几何先验来增强对象识别的机制。我们的贡献扩展了几何学习的理论景观,突出了几何观察者在分析3D环境中的内在偏差的效用。通过实证测试和效率分析,我们证明了SCENE-Net在森林防火中的关键应用——电力线支撑塔检测中的性能。我们的结果展示了我们的模型在标记噪声方面的卓越准确性和弹性,以最小的计算资源实现- scene - net的实例只有11个可训练的参数-从而标志着在可信赖的机器学习应用于3D场景理解方面迈出了重要的一步。我们的代码可在:https://github.com/dlavado/scene-net。
{"title":"SCENE-Net: Geometric induction for interpretable and low-resource 3D pole detection with Group-Equivariant Non-Expansive Operators","authors":"Diogo Lavado ,&nbsp;Alessandra Micheletti ,&nbsp;Giovanni Bocchi ,&nbsp;Patrizio Frosini ,&nbsp;Cláudia Soares","doi":"10.1016/j.cviu.2025.104531","DOIUrl":"10.1016/j.cviu.2025.104531","url":null,"abstract":"<div><div>This paper introduces SCENE-Net, a novel low-resource, white-box model that serves as a compelling proof-of-concept for 3D point cloud segmentation. At its core, SCENE-Net employs Group Equivariant Non-Expansive Operators (GENEOs), a mechanism that leverages geometric priors for enhanced object identification. Our contribution extends the theoretical landscape of geometric learning, highlighting the utility of geometric observers as intrinsic biases in analyzing 3D environments. Through empirical testing and efficiency analysis, we demonstrate the performance of SCENE-Net in detecting power line supporting towers, a key application in forest fire prevention. Our results showcase the superior accuracy and resilience of our model to label noise, achieved with minimal computational resources—this instantiation of SCENE-Net has only eleven trainable parameters—thereby marking a significant step forward in trustworthy machine learning applied to 3D scene understanding. Our code is available in: <span><span>https://github.com/dlavado/scene-net</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"262 ","pages":"Article 104531"},"PeriodicalIF":3.5,"publicationDate":"2025-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145418703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
RefineHOS: A high-performance hand–object segmentation with fine-grained spatial features RefineHOS:具有细粒度空间特征的高性能手-对象分割
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-10-26 DOI: 10.1016/j.cviu.2025.104548
Wenrun Wang , Jianwu Dang , Yangping Wang , Rui Pan
Accurate segmentation of hands and interacting objects is a critical challenge in computer vision, attributed to complex factors like mutual occlusion, finger self-similarity, and high hand movement flexibility. To tackle these issues, we present RefineHOS, an innovative framework for precise pixel-level segmentation of hands and interacting objects. Based on RefineMask, RefineHOS features substantial architectural optimizations for hand–object interaction scenarios. Specifically, we designed an Augmentation Feature Pyramid Path module (AFPN) integrated with a Dual Attention module (DAM) in the backbone to capture multi-scale feature information of hands and interacting objects. Additionally, we enhanced segmentation performance by introducing a Triplet Attention module (TAM) to optimize both the mask head and the semantic head. We also presented a new Boundary Refinement Module (BRM), utilizing an iterative subdivision approach to enhance the precision of boundary details in the segmentation results. Extensive experiments on multiple benchmark datasets (including VISOR, Ego-HOS, and ENIGMA-51) show that our method achieves state-of-the-art performance. To comprehensively evaluate segmentation quality, we introduce Boundary Average Precision (Boundary AP) as a key metric to complement existing benchmark segmentation metrics.
由于相互遮挡、手指自相似性和手部高度运动灵活性等复杂因素的影响,手部和相互作用物体的准确分割是计算机视觉中的一个关键挑战。为了解决这些问题,我们提出了RefineHOS,这是一个创新的框架,用于对手和相互作用的物体进行精确的像素级分割。基于RefineMask, RefineHOS为手-对象交互场景提供了大量的架构优化。具体而言,我们设计了一个增强特征金字塔路径模块(AFPN),该模块在主干中集成了双注意模块(DAM),用于捕获手和交互物体的多尺度特征信息。此外,我们通过引入三重注意力模块(Triplet Attention module, TAM)来优化掩码头和语义头,从而提高分割性能。我们还提出了一种新的边界细化模块(BRM),利用迭代细分的方法来提高分割结果中边界细节的精度。在多个基准数据集(包括VISOR、Ego-HOS和ENIGMA-51)上进行的大量实验表明,我们的方法达到了最先进的性能。为了全面评价分割质量,我们引入边界平均精度(Boundary Average Precision, Boundary AP)作为一个关键指标来补充现有的基准分割指标。
{"title":"RefineHOS: A high-performance hand–object segmentation with fine-grained spatial features","authors":"Wenrun Wang ,&nbsp;Jianwu Dang ,&nbsp;Yangping Wang ,&nbsp;Rui Pan","doi":"10.1016/j.cviu.2025.104548","DOIUrl":"10.1016/j.cviu.2025.104548","url":null,"abstract":"<div><div>Accurate segmentation of hands and interacting objects is a critical challenge in computer vision, attributed to complex factors like mutual occlusion, finger self-similarity, and high hand movement flexibility. To tackle these issues, we present RefineHOS, an innovative framework for precise pixel-level segmentation of hands and interacting objects. Based on RefineMask, RefineHOS features substantial architectural optimizations for hand–object interaction scenarios. Specifically, we designed an Augmentation Feature Pyramid Path module (AFPN) integrated with a Dual Attention module (DAM) in the backbone to capture multi-scale feature information of hands and interacting objects. Additionally, we enhanced segmentation performance by introducing a Triplet Attention module (TAM) to optimize both the mask head and the semantic head. We also presented a new Boundary Refinement Module (BRM), utilizing an iterative subdivision approach to enhance the precision of boundary details in the segmentation results. Extensive experiments on multiple benchmark datasets (including VISOR, Ego-HOS, and ENIGMA-51) show that our method achieves state-of-the-art performance. To comprehensively evaluate segmentation quality, we introduce Boundary Average Precision (Boundary AP) as a key metric to complement existing benchmark segmentation metrics.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"262 ","pages":"Article 104548"},"PeriodicalIF":3.5,"publicationDate":"2025-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145418711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SCESS-Net: Semantic consistency enhancement and segment selection network for audio–visual event localization 面向视听事件定位的语义一致性增强和片段选择网络
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-10-24 DOI: 10.1016/j.cviu.2025.104551
Jichen Gao , Suiping Zhou , Hang Yu , Chenyang Li , Xiaoxi Hu
As a central task in multi-modal learning, audio–visual event localization seeks to identify consistent event information within visual–audio segments and to identify event categories. Some works do not consider the impact of visual features on audio features and the issue of segment information loss in selecting semantically consistent segments. To address the aforementioned issues, we introduce a network that enhances the multi-task learning performance of visual–audio modalities and resolves the semantic inconsistency present in audio–visual segments by employing bi-directional collaborative guided attention and semantic consistency enhancement. Firstly, we introduce a bi-directional collaborative guided attention module, which integrates multi-modal linear pooling and spatial-channel attention to bolster the semantic information of both audio and visual features across the visual-guided audio attention and audio-guided visual attention pathways. Subsequently, we propose an innovative multi-modal similarity learning model that addresses the issue of information loss during the filtering of low-similarity segments, which is a common problem in existing approaches. By incorporating multi-modal feature random masking, this model is capable of learning robust audio–visual relationships. Lastly, we capture global semantic information across the entire video in the temporal dimension, and enhance semantic consistency of events by using differential semantics between global semantics and audio–visual segment semantics. The experimental results on the AVE dataset indicate that our network has achieved superior performance.
作为多模态学习的中心任务,视听事件定位寻求识别视听片段中一致的事件信息并识别事件类别。一些作品在选择语义一致的词段时没有考虑视觉特征对音频特征的影响以及词段信息丢失的问题。为了解决上述问题,我们引入了一个网络,该网络通过双向协作引导注意力和语义一致性增强来增强视听模式的多任务学习性能,并解决视听片段中存在的语义不一致问题。首先,我们引入了一个双向协同引导注意模块,该模块集成了多模态线性池和空间通道注意,以增强视觉引导的音频注意和听觉引导的视觉注意路径上的音频和视觉特征的语义信息。随后,我们提出了一种创新的多模态相似学习模型,该模型解决了现有方法中常见的低相似段过滤过程中的信息丢失问题。通过结合多模态特征随机掩蔽,该模型能够学习鲁棒的视听关系。最后,我们在时间维度上捕获整个视频的全局语义信息,并利用全局语义和视听片段语义之间的差异语义增强事件的语义一致性。在AVE数据集上的实验结果表明,我们的网络取得了优异的性能。
{"title":"SCESS-Net: Semantic consistency enhancement and segment selection network for audio–visual event localization","authors":"Jichen Gao ,&nbsp;Suiping Zhou ,&nbsp;Hang Yu ,&nbsp;Chenyang Li ,&nbsp;Xiaoxi Hu","doi":"10.1016/j.cviu.2025.104551","DOIUrl":"10.1016/j.cviu.2025.104551","url":null,"abstract":"<div><div>As a central task in multi-modal learning, audio–visual event localization seeks to identify consistent event information within visual–audio segments and to identify event categories. Some works do not consider the impact of visual features on audio features and the issue of segment information loss in selecting semantically consistent segments. To address the aforementioned issues, we introduce a network that enhances the multi-task learning performance of visual–audio modalities and resolves the semantic inconsistency present in audio–visual segments by employing bi-directional collaborative guided attention and semantic consistency enhancement. Firstly, we introduce a bi-directional collaborative guided attention module, which integrates multi-modal linear pooling and spatial-channel attention to bolster the semantic information of both audio and visual features across the visual-guided audio attention and audio-guided visual attention pathways. Subsequently, we propose an innovative multi-modal similarity learning model that addresses the issue of information loss during the filtering of low-similarity segments, which is a common problem in existing approaches. By incorporating multi-modal feature random masking, this model is capable of learning robust audio–visual relationships. Lastly, we capture global semantic information across the entire video in the temporal dimension, and enhance semantic consistency of events by using differential semantics between global semantics and audio–visual segment semantics. The experimental results on the AVE dataset indicate that our network has achieved superior performance.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"262 ","pages":"Article 104551"},"PeriodicalIF":3.5,"publicationDate":"2025-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145418709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Channel-aware feature mining network for Visible–Infrared Person Re-identification 通道感知特征挖掘网络的可见红外人再识别
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-10-24 DOI: 10.1016/j.cviu.2025.104552
Pengxia Li, Zhonghao Du, Linhui Zhang, Yanyi Lv, Yujie Liu
Visible–Infrared Person Re-identification (VI-ReID) aims to match the identities of pedestrians captured by non-overlapping cameras in both visible and infrared modalities. The key to overcoming the VI-ReID challenge lies in extracting diverse modality-shared features. Current methods mainly focus on channel-level operations during data preprocessing, with the aim of expanding the dataset. However, these methods often overlook the complex relationships among channel features, leading to insufficient utilization of unique information in each channel. To address this issue, we propose the Channel-Aware Feature Mining Network (CAFMNet) to improve VI-ReID effectiveness. Specifically, we design three core modules: a Channel-Level Feature Optimization (CLFO) module, which captures channel-level key features for identity recognition and directly extracts identity-relevant information at the channel level; a Channel-Level Feature Refinement (CLFR) module, which enhances channel-level features while retaining useful information—addressing the irrelevant content in initially extracted features; a Multi-Dimensional Feature Optimization (MDFO) module, which comprehensively processes multi-dimensional feature information to enhance the model’s ability to understand and describe input data. Extensive experiments on the SYSU-MM01 and LLCM datasets demonstrate that our CAFMNet outperforms existing approaches in terms of VI-ReID effectiveness. The code is available at https://github.com/cobeibei/CAFMNet-1.
可见红外人员再识别(VI-ReID)旨在匹配非重叠摄像机在可见和红外模式下捕获的行人的身份。克服VI-ReID挑战的关键在于提取多种模态共享特征。目前的方法主要集中在数据预处理过程中的通道级操作,目的是扩展数据集。然而,这些方法往往忽略了通道特征之间的复杂关系,导致对每个通道中唯一信息的利用不足。为了解决这个问题,我们提出了通道感知特征挖掘网络(CAFMNet)来提高VI-ReID的有效性。具体来说,我们设计了三个核心模块:通道级特征优化(CLFO)模块,该模块捕获用于身份识别的通道级关键特征,并在通道级直接提取与身份相关的信息;Channel-Level Feature refine (CLFR)模块,增强Channel-Level Feature的同时保留有用的信息——在最初提取的Feature中处理不相关的内容;多维特征优化(Multi-Dimensional Feature Optimization, MDFO)模块,对多维特征信息进行综合处理,增强模型对输入数据的理解和描述能力。在SYSU-MM01和LLCM数据集上进行的大量实验表明,我们的CAFMNet在VI-ReID有效性方面优于现有方法。代码可在https://github.com/cobeibei/CAFMNet-1上获得。
{"title":"Channel-aware feature mining network for Visible–Infrared Person Re-identification","authors":"Pengxia Li,&nbsp;Zhonghao Du,&nbsp;Linhui Zhang,&nbsp;Yanyi Lv,&nbsp;Yujie Liu","doi":"10.1016/j.cviu.2025.104552","DOIUrl":"10.1016/j.cviu.2025.104552","url":null,"abstract":"<div><div>Visible–Infrared Person Re-identification (VI-ReID) aims to match the identities of pedestrians captured by non-overlapping cameras in both visible and infrared modalities. The key to overcoming the VI-ReID challenge lies in extracting diverse modality-shared features. Current methods mainly focus on channel-level operations during data preprocessing, with the aim of expanding the dataset. However, these methods often overlook the complex relationships among channel features, leading to insufficient utilization of unique information in each channel. To address this issue, we propose the Channel-Aware Feature Mining Network (CAFMNet) to improve VI-ReID effectiveness. Specifically, we design three core modules: a Channel-Level Feature Optimization (CLFO) module, which captures channel-level key features for identity recognition and directly extracts identity-relevant information at the channel level; a Channel-Level Feature Refinement (CLFR) module, which enhances channel-level features while retaining useful information—addressing the irrelevant content in initially extracted features; a Multi-Dimensional Feature Optimization (MDFO) module, which comprehensively processes multi-dimensional feature information to enhance the model’s ability to understand and describe input data. Extensive experiments on the SYSU-MM01 and LLCM datasets demonstrate that our CAFMNet outperforms existing approaches in terms of VI-ReID effectiveness. The code is available at <span><span>https://github.com/cobeibei/CAFMNet-1</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"262 ","pages":"Article 104552"},"PeriodicalIF":3.5,"publicationDate":"2025-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145418707","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MSFENet: Multi-Scale Filter-Enhanced Network architecture for digital image forgery trace localization 数字图像伪造痕迹定位的多尺度滤波增强网络体系结构
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-10-24 DOI: 10.1016/j.cviu.2025.104550
Min Mao , Ge Jiao , Wanhui Gao , Jixun Ye
With the rapid advancement of image editing technologies, forensic analysis for detecting malicious image manipulations has become a critical research topic. While existing deep learning-based forgery localization methods have demonstrated promising results, they face three fundamental limitations: (1) heavy reliance on large-scale annotated datasets, (2) computationally intensive training processes, and (3) insufficient capability in capturing diverse forgery traces. To address these challenges, we present MSFENet (Multi-Scale Filter-Enhanced Network), a novel framework that synergistically integrates multiple forensic filters for comprehensive forgery detection. Our approach introduces three key innovations: First, we employ a multi-filter feature extraction module that combines NoisePrint++, SRM, and Bayar Conv to capture complementary forensic traces, including noise patterns, texture inconsistencies, and boundary artifacts. Second, we introduce a dual-branch multi-scale encoder that effectively preserves both local and global manipulation characteristics. Third, we design two novel components: the Coordinate Attention-based Cross-modal Feature Rectification (CAFR) module, which adaptively recalibrates feature representations across different modalities and learns the complementary properties of different extracted features, and the Multi-Scale Selective Fusion (MSF) module, which intelligently integrates discriminative features while suppressing redundant information. Extensive experiments on six benchmark datasets demonstrate the superiority of MSFENet. Our method achieves state-of-the-art performance, with F1-score improvements of 6.36%, 0.84%, 6.22%, and 48.8% on Casiav1, COVER, IMD20, and DSO-1, respectively, compared to existing methods.
随着图像编辑技术的飞速发展,检测恶意图像篡改的法医分析已成为一个重要的研究课题。虽然现有的基于深度学习的伪造定位方法已经显示出有希望的结果,但它们面临三个基本限制:(1)严重依赖大规模带注释的数据集,(2)计算密集型训练过程,以及(3)捕获各种伪造痕迹的能力不足。为了应对这些挑战,我们提出了MSFENet(多尺度过滤器增强网络),这是一个新的框架,可以协同集成多个法医过滤器,用于全面的伪造检测。我们的方法引入了三个关键的创新:首先,我们采用了一个多滤波器特征提取模块,该模块结合了noiseprint++、SRM和Bayar Conv来捕获互补的取证痕迹,包括噪声模式、纹理不一致和边界伪像。其次,我们引入了一个双分支多尺度编码器,有效地保留了局部和全局操作特征。第三,我们设计了两个新的组件:基于坐标注意力的跨模态特征校正(CAFR)模块和多尺度选择融合(MSF)模块,该模块可以自适应地重新校准不同模态的特征表示并学习不同提取特征的互补特性;在六个基准数据集上的大量实验证明了MSFENet的优越性。我们的方法达到了最先进的性能,与现有方法相比,在Casiav1、COVER、IMD20和DSO-1上的f1分数分别提高了6.36%、0.84%、6.22%和48.8%。
{"title":"MSFENet: Multi-Scale Filter-Enhanced Network architecture for digital image forgery trace localization","authors":"Min Mao ,&nbsp;Ge Jiao ,&nbsp;Wanhui Gao ,&nbsp;Jixun Ye","doi":"10.1016/j.cviu.2025.104550","DOIUrl":"10.1016/j.cviu.2025.104550","url":null,"abstract":"<div><div>With the rapid advancement of image editing technologies, forensic analysis for detecting malicious image manipulations has become a critical research topic. While existing deep learning-based forgery localization methods have demonstrated promising results, they face three fundamental limitations: (1) heavy reliance on large-scale annotated datasets, (2) computationally intensive training processes, and (3) insufficient capability in capturing diverse forgery traces. To address these challenges, we present MSFENet (Multi-Scale Filter-Enhanced Network), a novel framework that synergistically integrates multiple forensic filters for comprehensive forgery detection. Our approach introduces three key innovations: First, we employ a multi-filter feature extraction module that combines NoisePrint++, SRM, and Bayar Conv to capture complementary forensic traces, including noise patterns, texture inconsistencies, and boundary artifacts. Second, we introduce a dual-branch multi-scale encoder that effectively preserves both local and global manipulation characteristics. Third, we design two novel components: the Coordinate Attention-based Cross-modal Feature Rectification (CAFR) module, which adaptively recalibrates feature representations across different modalities and learns the complementary properties of different extracted features, and the Multi-Scale Selective Fusion (MSF) module, which intelligently integrates discriminative features while suppressing redundant information. Extensive experiments on six benchmark datasets demonstrate the superiority of MSFENet. Our method achieves state-of-the-art performance, with F1-score improvements of 6.36%, 0.84%, 6.22%, and 48.8% on Casiav1, COVER, IMD20, and DSO-1, respectively, compared to existing methods.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"262 ","pages":"Article 104550"},"PeriodicalIF":3.5,"publicationDate":"2025-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145418704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PlanarTrack: A high-quality and challenging benchmark for large-scale planar object tracking PlanarTrack:用于大规模平面对象跟踪的高质量和具有挑战性的基准
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-10-24 DOI: 10.1016/j.cviu.2025.104553
Yifan Jiao , Xinran Liu , Xiaoqiong Liu , Xiaohui Yuan , Heng Fan , Libo Zhang
Planar tracking has drawn increasing interest owing to its key roles in robotics and augmented reality. Despite recent great advancement, further development of planar tracking, particularly in the deep learning era, is largely limited compared to generic tracking due to the lack of large-scale platforms. To mitigate this, we propose PlanarTrack, a large-scale high-quality and challenging benchmark for planar tracking. Specifically, PlanarTrack consists of 1150 sequences with over 733K frames, including 1000 short-term and 150 new long-term videos, which enables comprehensive evaluation of short- and long-term tracking performance. All videos in PlanarTrack are recorded in unconstrained conditions from the wild, which makes PlanarTrack challenging but more realistic for real-world applications. To ensure high-quality annotations, each video frame is manually annotated by four corner points with multi-round meticulous inspection and refinement. To enhance target diversity of PlanarTrack, we only capture a unique target in one sequence, which is different from existing benchmarks. To our best knowledge, PlanarTrack is by far the largest and most diverse and challenging dataset dedicated to planar tracking. To understand performance of existing methods on PlanarTrack and to provide a comparison for future research, we evaluate 10 representative planar trackers with extensive comparison and in-depth analysis. Our evaluation reveals that, unsurprisingly, the top planar trackers heavily degrade on the challenging PlanarTrack, which indicates more efforts are required for improving planar tracking. Moreover, we derive a variant named PlanarTrackBB from PlanarTrack for generic tracking. Evaluation with 15 generic trackers shows that, surprisingly, our PlanarTrackBB is even more challenging than several popular generic tracking benchmarks, and more attention should be paid to dealing with planar targets, though they are rigid. Our data and results will be released at https://github.com/HengLan/PlanarTrack
平面跟踪由于其在机器人和增强现实中的关键作用而引起了越来越多的兴趣。尽管最近取得了很大的进步,但由于缺乏大规模的平台,与通用跟踪相比,平面跟踪的进一步发展,特别是在深度学习时代,在很大程度上受到限制。为了解决这个问题,我们提出了PlanarTrack,一个大规模的高质量和具有挑战性的平面跟踪基准。具体来说,PlanarTrack由超过733K帧的1150个序列组成,其中包括1000个短期视频和150个新的长期视频,可以对短期和长期跟踪性能进行综合评估。PlanarTrack中的所有视频都是在野外不受约束的条件下录制的,这使得PlanarTrack具有挑战性,但对于现实世界的应用来说更现实。为了保证高质量的标注,每一帧视频都由四个角点手工标注,经过多轮细致的检查和细化。为了提高PlanarTrack的目标多样性,我们在一个序列中只捕获一个唯一的目标,这与现有的基准测试不同。据我们所知,PlanarTrack是迄今为止致力于平面跟踪的最大,最多样化和最具挑战性的数据集。为了了解现有方法在PlanarTrack上的性能,并为未来的研究提供比较,我们对10个有代表性的平面跟踪器进行了广泛的比较和深入的分析。我们的评估显示,不出所料,顶部平面跟踪器在具有挑战性的PlanarTrack上严重退化,这表明需要更多的努力来改进平面跟踪。此外,我们从PlanarTrack派生了一个名为PlanarTrackBB的变体,用于通用跟踪。对15个通用跟踪器的评估表明,令人惊讶的是,我们的PlanarTrackBB甚至比几个流行的通用跟踪基准更具挑战性,并且应该更多地关注处理平面目标,尽管它们是刚性的。我们的数据和结果将在https://github.com/HengLan/PlanarTrack上发布
{"title":"PlanarTrack: A high-quality and challenging benchmark for large-scale planar object tracking","authors":"Yifan Jiao ,&nbsp;Xinran Liu ,&nbsp;Xiaoqiong Liu ,&nbsp;Xiaohui Yuan ,&nbsp;Heng Fan ,&nbsp;Libo Zhang","doi":"10.1016/j.cviu.2025.104553","DOIUrl":"10.1016/j.cviu.2025.104553","url":null,"abstract":"<div><div>Planar tracking has drawn increasing interest owing to its key roles in robotics and augmented reality. Despite recent great advancement, further development of planar tracking, particularly in the deep learning era, is largely limited compared to generic tracking due to the lack of large-scale platforms. To mitigate this, we propose <strong>PlanarTrack</strong>, a large-scale high-quality and challenging benchmark for planar tracking. Specifically, PlanarTrack consists of 1150 sequences with over 733K frames, including 1000 short-term and 150 new long-term videos, which enables comprehensive evaluation of short- and long-term tracking performance. All videos in PlanarTrack are recorded in unconstrained conditions from the wild, which makes PlanarTrack challenging but more realistic for real-world applications. To ensure high-quality annotations, each video frame is manually annotated by four corner points with multi-round meticulous inspection and refinement. To enhance target diversity of PlanarTrack, we only capture a unique target in one sequence, which is different from existing benchmarks. To our best knowledge, PlanarTrack is by far the largest and most diverse and challenging dataset dedicated to planar tracking. To understand performance of existing methods on PlanarTrack and to provide a comparison for future research, we evaluate 10 representative planar trackers with extensive comparison and in-depth analysis. Our evaluation reveals that, unsurprisingly, the top planar trackers heavily degrade on the challenging PlanarTrack, which indicates more efforts are required for improving planar tracking. Moreover, we derive a variant named <strong>PlanarTrack</strong><sub>BB</sub> from PlanarTrack for generic tracking. Evaluation with 15 generic trackers shows that, surprisingly, our PlanarTrack<sub>BB</sub> is even more challenging than several popular generic tracking benchmarks, and more attention should be paid to dealing with planar targets, though they are rigid. Our data and results will be released at <span><span>https://github.com/HengLan/PlanarTrack</span><svg><path></path></svg></span></div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"262 ","pages":"Article 104553"},"PeriodicalIF":3.5,"publicationDate":"2025-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145418712","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DiffuseDoc: Document geometric rectification via diffusion model DiffuseDoc:通过扩散模型记录几何校正
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-10-24 DOI: 10.1016/j.cviu.2025.104554
Wenfei Xiong , Huabing Zhou , Yanduo Zhang , Tao Lu , Jiayi Ma
Document images captured by sensors often suffer from intricate geometric distortions, hindering readability and impeding downstream document analysis tasks. While deep learning-based methods for document geometric rectification have shown promising results, their training heavily relies on high quality ground truth for the mapping field, resulting in challenging and expensive dataset creation. To address this issue, we propose DiffuseDoc, a novel framework for document image geometric rectification based on the diffusion model. Unlike existing methods, the training process of DiffuseDoc only requires pairs of distorted and distortion-free images, eliminating the need for ground truth mapping field supervision. Specifically, DiffuseDoc consists of two primary components: the geometric rectification module and the conditional diffusion module. By jointly training the two components, the rectification results are optimized while simultaneously learning the latent feature distribution of the distortion-free image. Also, we contribute the DocReal dataset, comprising document images captured by diverse high-resolution sensors in real-world scenarios, alongside their corresponding scanned versions. Extensive evaluations demonstrate that DiffuseDoc achieves state-of-the-art performance on both the Doc-U-Net benchmark and DocReal datasets.
传感器捕捉到的文档图像往往存在复杂的几何畸变,影响了文档的可读性,也阻碍了后续的文档分析任务。虽然基于深度学习的文档几何校正方法已经显示出有希望的结果,但它们的训练严重依赖于测绘领域的高质量地面真值,从而导致具有挑战性和昂贵的数据集创建。为了解决这个问题,我们提出了一种新的基于扩散模型的文档图像几何校正框架DiffuseDoc。与现有方法不同,DiffuseDoc的训练过程只需要对扭曲和无扭曲的图像,不需要对地面真值制图现场监督。具体来说,DiffuseDoc由两个主要组件组成:几何校正模块和条件扩散模块。通过对两个分量的联合训练,优化校正结果,同时学习无失真图像的潜在特征分布。此外,我们还提供了DocReal数据集,其中包括由各种高分辨率传感器在真实场景中捕获的文档图像,以及相应的扫描版本。广泛的评估表明,DiffuseDoc在Doc-U-Net基准测试和DocReal数据集上都达到了最先进的性能。
{"title":"DiffuseDoc: Document geometric rectification via diffusion model","authors":"Wenfei Xiong ,&nbsp;Huabing Zhou ,&nbsp;Yanduo Zhang ,&nbsp;Tao Lu ,&nbsp;Jiayi Ma","doi":"10.1016/j.cviu.2025.104554","DOIUrl":"10.1016/j.cviu.2025.104554","url":null,"abstract":"<div><div>Document images captured by sensors often suffer from intricate geometric distortions, hindering readability and impeding downstream document analysis tasks. While deep learning-based methods for document geometric rectification have shown promising results, their training heavily relies on high quality ground truth for the mapping field, resulting in challenging and expensive dataset creation. To address this issue, we propose DiffuseDoc, a novel framework for document image geometric rectification based on the diffusion model. Unlike existing methods, the training process of DiffuseDoc only requires pairs of distorted and distortion-free images, eliminating the need for ground truth mapping field supervision. Specifically, DiffuseDoc consists of two primary components: the geometric rectification module and the conditional diffusion module. By jointly training the two components, the rectification results are optimized while simultaneously learning the latent feature distribution of the distortion-free image. Also, we contribute the DocReal dataset, comprising document images captured by diverse high-resolution sensors in real-world scenarios, alongside their corresponding scanned versions. Extensive evaluations demonstrate that DiffuseDoc achieves state-of-the-art performance on both the Doc-U-Net benchmark and DocReal datasets.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"262 ","pages":"Article 104554"},"PeriodicalIF":3.5,"publicationDate":"2025-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145418705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MOSAIC: A multi-view 2.5D organ slice selector with cross-attentional reasoning for anatomically-aware CT localization in medical organ segmentation MOSAIC:一种具有交叉注意推理的多视图2.5D器官切片选择器,用于医学器官分割中具有解剖意识的CT定位
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-10-22 DOI: 10.1016/j.cviu.2025.104522
Hania Ghouse , Muzammil Behzad
Efficient and accurate multi-organ segmentation from abdominal CT volumes is a fundamental challenge in medical image analysis. Existing 3D segmentation approaches are computationally and memory intensive, often processing entire volumes that contain many anatomically irrelevant slices. Meanwhile, 2D methods suffer from class imbalance and lack cross-view contextual awareness. To address these limitations, we propose a novel, anatomically-aware slice selector pipeline that reduces input volume prior to segmentation. Our unified framework introduces a vision-language model (VLM) for cross-view organ presence detection using fused tri-slice (2.5D) representations from axial, sagittal, and coronal planes. Our proposed model acts as an “expert” in anatomical localization, reasoning over multi-view representations to selectively retain slices with high structural relevance. This enables spatially consistent filtering across orientations while preserving contextual cues. More importantly, since standard segmentation metrics such as Dice or IoU fail to measure the spatial precision of such slice selection, we introduce a novel metric, Slice Localization Concordance (SLC), which jointly captures anatomical coverage and spatial alignment with organ-centric reference slices. Unlike segmentation-specific metrics, SLC provides a model-agnostic evaluation of localization fidelity. Our model offers substantial improvement gains against several baselines across all organs, demonstrating both accurate and reliable organ-focused slice filtering. These results show that our method enables efficient and spatially consistent organ filtering, thereby significantly reducing downstream segmentation cost while maintaining high anatomical fidelity.
高效、准确的腹部CT多器官分割是医学图像分析中的一个基本问题。现有的3D分割方法是计算和内存密集型的,通常处理包含许多解剖无关切片的整个体积。同时,2D方法存在类不平衡和缺乏跨视图上下文感知的问题。为了解决这些限制,我们提出了一种新颖的,具有解剖意识的切片选择器管道,可以在分割之前减少输入量。我们的统一框架引入了一种视觉语言模型(VLM),用于横视器官存在检测,该模型使用轴向、矢状面和冠状面融合的三层(2.5D)表示。我们提出的模型充当解剖定位的“专家”,在多视图表示中进行推理,以选择性地保留具有高结构相关性的切片。这样可以在保留上下文线索的同时实现跨方向的空间一致过滤。更重要的是,由于Dice或IoU等标准分割指标无法衡量这种切片选择的空间精度,我们引入了一种新的度量,切片定位一致性(SLC),它可以联合捕获以器官为中心的参考切片的解剖覆盖和空间对齐。与特定于细分的指标不同,SLC提供了与模型无关的定位保真度评估。我们的模型在所有器官的几个基线上提供了实质性的改进收益,展示了准确可靠的器官聚焦切片过滤。这些结果表明,我们的方法能够实现高效和空间一致的器官过滤,从而显着降低下游分割成本,同时保持较高的解剖保真度。
{"title":"MOSAIC: A multi-view 2.5D organ slice selector with cross-attentional reasoning for anatomically-aware CT localization in medical organ segmentation","authors":"Hania Ghouse ,&nbsp;Muzammil Behzad","doi":"10.1016/j.cviu.2025.104522","DOIUrl":"10.1016/j.cviu.2025.104522","url":null,"abstract":"<div><div>Efficient and accurate multi-organ segmentation from abdominal CT volumes is a fundamental challenge in medical image analysis. Existing 3D segmentation approaches are computationally and memory intensive, often processing entire volumes that contain many anatomically irrelevant slices. Meanwhile, 2D methods suffer from class imbalance and lack cross-view contextual awareness. To address these limitations, we propose a novel, anatomically-aware slice selector pipeline that reduces input volume prior to segmentation. Our unified framework introduces a vision-language model (VLM) for cross-view organ presence detection using fused tri-slice (2.5D) representations from axial, sagittal, and coronal planes. Our proposed model acts as an “expert” in anatomical localization, reasoning over multi-view representations to selectively retain slices with high structural relevance. This enables spatially consistent filtering across orientations while preserving contextual cues. More importantly, since standard segmentation metrics such as Dice or IoU fail to measure the spatial precision of such slice selection, we introduce a novel metric, Slice Localization Concordance (SLC), which jointly captures anatomical coverage and spatial alignment with organ-centric reference slices. Unlike segmentation-specific metrics, SLC provides a model-agnostic evaluation of localization fidelity. Our model offers substantial improvement gains against several baselines across all organs, demonstrating both accurate and reliable organ-focused slice filtering. These results show that our method enables efficient and spatially consistent organ filtering, thereby significantly reducing downstream segmentation cost while maintaining high anatomical fidelity.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"262 ","pages":"Article 104522"},"PeriodicalIF":3.5,"publicationDate":"2025-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145364896","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Computer Vision and Image Understanding
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1