首页 > 最新文献

Computer Vision and Image Understanding最新文献

英文 中文
MFDiff: Diffusion probabilistic model for medical image segmentation with multi-scale features and frequency-aware attention MFDiff:基于多尺度特征和频率感知关注的医学图像分割扩散概率模型
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-17 DOI: 10.1016/j.cviu.2025.104605
Xingli Zhang , Yameng Liu , Haiyang Yu , Zhihui Wang
Medical image segmentation serves as a critical technique in clinical applications such as disease diagnosis, surgical planning, and image-guided therapy, where segmentation accuracy directly impacts the precision of clinical decisions. However, existing methods still face significant challenges in handling inherent issues of medical images, including blurred boundaries, complex multi-scale structures, and difficulties in fine-grained feature representation. To address these challenges, this paper proposes a medical image segmentation method based on a diffusion probabilistic model, MFDiff, which aims to enhance multi-scale contextual awareness and fine-grained structural modeling capabilities. The method incorporates a frequency-aware attention fusion module that effectively strengthens the model’s ability to represent complex structures and ambiguous boundaries. Additionally, a multi-scale feature enhancement module is introduced to expand the receptive field while maintaining low computational cost, thereby improving the extraction and fusion of multi-scale features. Furthermore, an uncertainty-weighted majority voting fusion strategy is proposed to enhance the robustness and consistency of fused predictions from multiple sampling iterations. The proposed method was validated on five medical image segmentation datasets. Experimental results demonstrate that MFDiff outperforms current mainstream methods across all datasets, exhibiting strong generalization ability and robustness.
医学图像分割是疾病诊断、手术计划、图像引导治疗等临床应用中的一项关键技术,分割的准确性直接影响临床决策的准确性。然而,现有的方法在处理医学图像固有问题时仍然面临重大挑战,包括边界模糊、复杂的多尺度结构以及细粒度特征表示困难。为了解决这些问题,本文提出了一种基于扩散概率模型MFDiff的医学图像分割方法,旨在增强多尺度上下文感知和细粒度结构建模能力。该方法结合了频率感知注意力融合模块,有效增强了模型表示复杂结构和模糊边界的能力。此外,引入了多尺度特征增强模块,在保持低计算成本的同时扩大了接收域,从而提高了多尺度特征的提取和融合。在此基础上,提出了一种不确定性加权多数投票融合策略,以提高多采样迭代融合预测的鲁棒性和一致性。在5个医学图像分割数据集上进行了验证。实验结果表明,MFDiff在所有数据集上都优于当前主流方法,具有较强的泛化能力和鲁棒性。
{"title":"MFDiff: Diffusion probabilistic model for medical image segmentation with multi-scale features and frequency-aware attention","authors":"Xingli Zhang ,&nbsp;Yameng Liu ,&nbsp;Haiyang Yu ,&nbsp;Zhihui Wang","doi":"10.1016/j.cviu.2025.104605","DOIUrl":"10.1016/j.cviu.2025.104605","url":null,"abstract":"<div><div>Medical image segmentation serves as a critical technique in clinical applications such as disease diagnosis, surgical planning, and image-guided therapy, where segmentation accuracy directly impacts the precision of clinical decisions. However, existing methods still face significant challenges in handling inherent issues of medical images, including blurred boundaries, complex multi-scale structures, and difficulties in fine-grained feature representation. To address these challenges, this paper proposes a medical image segmentation method based on a diffusion probabilistic model, MFDiff, which aims to enhance multi-scale contextual awareness and fine-grained structural modeling capabilities. The method incorporates a frequency-aware attention fusion module that effectively strengthens the model’s ability to represent complex structures and ambiguous boundaries. Additionally, a multi-scale feature enhancement module is introduced to expand the receptive field while maintaining low computational cost, thereby improving the extraction and fusion of multi-scale features. Furthermore, an uncertainty-weighted majority voting fusion strategy is proposed to enhance the robustness and consistency of fused predictions from multiple sampling iterations. The proposed method was validated on five medical image segmentation datasets. Experimental results demonstrate that MFDiff outperforms current mainstream methods across all datasets, exhibiting strong generalization ability and robustness.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104605"},"PeriodicalIF":3.5,"publicationDate":"2025-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Generalized prompt-driven zero-shot domain adaptive segmentation with feature rectification and semantic modulation 基于特征校正和语义调制的广义提示驱动零距域自适应分割
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-17 DOI: 10.1016/j.cviu.2025.104615
Jinyi Li , Longyu Yang , Donghyun Kim , Kuniaki Saito , Kate Saenko , Stan Sclaroff , Xiaofeng Zhu , Ping Hu
Recent prompt-driven zero-shot adaptation methods offer a promising way to handle domain shifts in semantic segmentation by learning with features simulated from natural language prompts. However, these methods typically depend on a fixed set of predefined domain descriptions, which limits their capacity to generalize to previously undefined domains and often necessitates retraining when encountering novel environments. To address this challenge, we propose a Generalized Prompt-driven Zero-shot Domain Adaptive Segmentation framework that enables flexible and robust cross-domain segmentation by learning to map target domain features into the source domain space. This allows inference to be performed through a unified and well-optimized source model, without requiring target data-based or prompt-based retraining when encountering novel conditions. Our framework comprises two key modules: a Low-level Feature Rectification (LLFR) module that aligns visual styles using a historical source-style memory bank, and a High-level Semantic Modulation (HLSM) module that applies language-guided affine transformations to align high-level semantics. Together, these modules enable adaptive multi-level feature adaptation that maps target inputs into the source domain space, thus allowing the model to handle unseen domains effectively at test time. Extensive experiments on multiple zero-shot domain adaptation benchmarks are conducted, and the results show that our method consistently outperforms previous approaches.
近年来,提示驱动的零距自适应方法通过学习自然语言提示中模拟的特征,提供了一种很有前途的处理语义分割领域转移的方法。然而,这些方法通常依赖于一组固定的预定义领域描述,这限制了它们泛化到以前未定义的领域的能力,并且在遇到新环境时经常需要重新训练。为了解决这一挑战,我们提出了一种广义提示驱动的零射击域自适应分割框架,该框架通过学习将目标域特征映射到源域空间,实现灵活而稳健的跨域分割。这允许通过统一且优化良好的源模型执行推理,而不需要在遇到新情况时进行基于目标数据或基于提示的再训练。我们的框架包括两个关键模块:一个低级特征校正(LLFR)模块,它使用历史源风格的记忆库来对齐视觉样式,一个高级语义调制(HLSM)模块,它应用语言引导的仿射变换来对齐高级语义。总之,这些模块支持自适应多层次特征适应,将目标输入映射到源域空间,从而允许模型在测试时有效地处理未见过的域。在多个零射击域自适应基准上进行了大量实验,结果表明我们的方法始终优于先前的方法。
{"title":"Generalized prompt-driven zero-shot domain adaptive segmentation with feature rectification and semantic modulation","authors":"Jinyi Li ,&nbsp;Longyu Yang ,&nbsp;Donghyun Kim ,&nbsp;Kuniaki Saito ,&nbsp;Kate Saenko ,&nbsp;Stan Sclaroff ,&nbsp;Xiaofeng Zhu ,&nbsp;Ping Hu","doi":"10.1016/j.cviu.2025.104615","DOIUrl":"10.1016/j.cviu.2025.104615","url":null,"abstract":"<div><div>Recent prompt-driven zero-shot adaptation methods offer a promising way to handle domain shifts in semantic segmentation by learning with features simulated from natural language prompts. However, these methods typically depend on a fixed set of predefined domain descriptions, which limits their capacity to generalize to previously undefined domains and often necessitates retraining when encountering novel environments. To address this challenge, we propose a Generalized Prompt-driven Zero-shot Domain Adaptive Segmentation framework that enables flexible and robust cross-domain segmentation by learning to map target domain features into the source domain space. This allows inference to be performed through a unified and well-optimized source model, without requiring target data-based or prompt-based retraining when encountering novel conditions. Our framework comprises two key modules: a Low-level Feature Rectification (LLFR) module that aligns visual styles using a historical source-style memory bank, and a High-level Semantic Modulation (HLSM) module that applies language-guided affine transformations to align high-level semantics. Together, these modules enable adaptive multi-level feature adaptation that maps target inputs into the source domain space, thus allowing the model to handle unseen domains effectively at test time. Extensive experiments on multiple zero-shot domain adaptation benchmarks are conducted, and the results show that our method consistently outperforms previous approaches.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104615"},"PeriodicalIF":3.5,"publicationDate":"2025-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SGCNet: Silhouette Guided Cascaded Network for multi-modal image fusion 面向多模态图像融合的轮廓引导级联网络
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-16 DOI: 10.1016/j.cviu.2025.104603
Yuxuan Wang , Zhongwei Shen , Hui Li , Yuning Zhang , Zhenping Xia
For generating high-quality fused images in the field of image fusion, it is essential to effectively capture local detail information (e.g., texture) alongside global information (e.g., color blocks). However, conventional fusion techniques often fail to balance local and global information. This imbalance can lead to fused results excessively favoring either infrared or visible light characteristics, compromising the contrast and detail in the fused image. To tackle this problem, we propose the Silhouette Guided Cascaded Network (SGCNet). The encoder of our method employs Cascaded Dense Connection structure that integrates CNN and Transformer-based encoders to extract both local and global features in a compatible manner. In the fusion stage, the silhouettes of the targets are extracted by a pretrained semantic segmentation model that provides global spatial weighting for detailed features, guiding the alignment of features across different modalities. Extensive experiments demonstrate that SGCNet outperforms existing fusion methods across a variety of tasks, including infrared-visible and medical image fusion, highlighting its technological advancements and broad practical application potential.
在图像融合领域中,为了生成高质量的融合图像,必须有效地捕获局部细节信息(如纹理)和全局信息(如色块)。然而,传统的融合技术往往不能平衡局部和全局信息。这种不平衡可能导致融合的结果过于有利于红外或可见光的特性,妥协的对比度和细节在融合的图像。为了解决这个问题,我们提出了轮廓引导级联网络(SGCNet)。我们方法的编码器采用级联密集连接结构,该结构集成了CNN和基于transformer的编码器,以兼容的方式提取局部和全局特征。在融合阶段,通过预训练的语义分割模型提取目标轮廓,该模型为细节特征提供全局空间权重,指导特征跨不同模态的对齐。大量实验表明,SGCNet在红外-可见光和医学图像融合等多种任务上都优于现有的融合方法,凸显了其技术的先进性和广泛的实际应用潜力。
{"title":"SGCNet: Silhouette Guided Cascaded Network for multi-modal image fusion","authors":"Yuxuan Wang ,&nbsp;Zhongwei Shen ,&nbsp;Hui Li ,&nbsp;Yuning Zhang ,&nbsp;Zhenping Xia","doi":"10.1016/j.cviu.2025.104603","DOIUrl":"10.1016/j.cviu.2025.104603","url":null,"abstract":"<div><div>For generating high-quality fused images in the field of image fusion, it is essential to effectively capture local detail information (e.g., texture) alongside global information (e.g., color blocks). However, conventional fusion techniques often fail to balance local and global information. This imbalance can lead to fused results excessively favoring either infrared or visible light characteristics, compromising the contrast and detail in the fused image. To tackle this problem, we propose the Silhouette Guided Cascaded Network (SGCNet). The encoder of our method employs Cascaded Dense Connection structure that integrates CNN and Transformer-based encoders to extract both local and global features in a compatible manner. In the fusion stage, the silhouettes of the targets are extracted by a pretrained semantic segmentation model that provides global spatial weighting for detailed features, guiding the alignment of features across different modalities. Extensive experiments demonstrate that SGCNet outperforms existing fusion methods across a variety of tasks, including infrared-visible and medical image fusion, highlighting its technological advancements and broad practical application potential.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104603"},"PeriodicalIF":3.5,"publicationDate":"2025-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145840287","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Spatio-temporal side tuning pre-trained foundation models for video-based pedestrian attribute recognition 基于视频行人属性识别的时空侧调优预训练基础模型
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-16 DOI: 10.1016/j.cviu.2025.104588
Xiao Wang , Qian Zhu , Jiandong Jin , Jun Zhu , Futian Wang , Bo Jiang , Yaowei Wang , Yonghong Tian
Pedestrian Attribute Recognition (PAR) models based on static images struggle to handle issues such as occlusion and motion blur, and recently proposed video-PAR models have not fully utilized the potential of larger models, resulting in sub-optimal performance. In this work, we propose a video-PAR framework that leverages temporal information by efficiently fine-tuning a multi-modal foundation model. Specifically, we cast video-based PAR as a vision-language fusion task, using CLIP for visual feature extraction and prompt engineering to convert attributes into sentences for text embedding. We introduce a spatiotemporal side-tuning strategy for parameter-efficient optimization and fuse visual and textual tokens via a Transformer for interactive learning. The enhanced tokens are used for final attribute prediction. Experiments on two video-PAR datasets validate the effectiveness of our method. The source code of this paper is available at https://github.com/Event-AHU/OpenPAR.
基于静态图像的行人属性识别(PAR)模型难以处理遮挡和运动模糊等问题,最近提出的视频-PAR模型没有充分利用大型模型的潜力,导致性能不佳。在这项工作中,我们提出了一个视频par框架,该框架通过有效微调多模态基础模型来利用时间信息。具体来说,我们将基于视频的PAR作为视觉语言融合任务,使用CLIP进行视觉特征提取,并使用提示工程将属性转换为句子以进行文本嵌入。我们引入了一种时空侧调策略,用于参数高效优化,并通过Transformer融合视觉和文本标记以进行交互式学习。增强的令牌用于最终属性预测。在两个视频par数据集上的实验验证了该方法的有效性。本文的源代码可从https://github.com/Event-AHU/OpenPAR获得。
{"title":"Spatio-temporal side tuning pre-trained foundation models for video-based pedestrian attribute recognition","authors":"Xiao Wang ,&nbsp;Qian Zhu ,&nbsp;Jiandong Jin ,&nbsp;Jun Zhu ,&nbsp;Futian Wang ,&nbsp;Bo Jiang ,&nbsp;Yaowei Wang ,&nbsp;Yonghong Tian","doi":"10.1016/j.cviu.2025.104588","DOIUrl":"10.1016/j.cviu.2025.104588","url":null,"abstract":"<div><div>Pedestrian Attribute Recognition (PAR) models based on static images struggle to handle issues such as occlusion and motion blur, and recently proposed video-PAR models have not fully utilized the potential of larger models, resulting in sub-optimal performance. In this work, we propose a video-PAR framework that leverages temporal information by efficiently fine-tuning a multi-modal foundation model. Specifically, we cast video-based PAR as a vision-language fusion task, using CLIP for visual feature extraction and prompt engineering to convert attributes into sentences for text embedding. We introduce a spatiotemporal side-tuning strategy for parameter-efficient optimization and fuse visual and textual tokens via a Transformer for interactive learning. The enhanced tokens are used for final attribute prediction. Experiments on two video-PAR datasets validate the effectiveness of our method. The source code of this paper is available at <span><span>https://github.com/Event-AHU/OpenPAR</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104588"},"PeriodicalIF":3.5,"publicationDate":"2025-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145840281","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FreqOR: Frequency-guided sampling initialization with attention enhancements for training-free object repositioning FreqOR:用于无训练对象重定位的带有注意增强的频率引导采样初始化
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-15 DOI: 10.1016/j.cviu.2025.104610
Yuanxiang Fang, Jingyue Wang, Meiqing Wang, Shujie Zhang, Huimin Liu
Object repositioning in real images remains a challenging task. Existing approaches are typically built upon the DDIM inversion framework, whose sampling initialization tends to preserve strong layout priors in the latent space, thereby leading to object residuals or ghosting artifacts in the vacated region. Additionally, masking low-resolution self-attention maps often results in boundary misjudgments, which impair the inpainting capability. To address these limitations, we propose FreqOR, a training-free framework that integrates sampling initialization optimization with attention-level enhancements. For sampling initialization, high-frequency components of the inverted latent in the vacated region are suppressed to weaken inherited priors, thereby providing a cleaner sampling initialization. For attention enhancement, we incorporate two complementary strategies. The first is Resolution-Aligned Key–Value Interpolation, which achieves precise regional control by enabling pixel-wise masking of attention maps. The second is Query-Guided Consistency, which preserves the identity and texture consistency of the designated object by reusing inversion queries as priors during sampling. Integrated into the energy-based guidance framework, FreqOR is evaluated on the COCO-130 and VOC-100 datasets. The results demonstrate that it effectively suppresses residuals in the vacated region and enhances object consistency.
物体在真实图像中的重新定位仍然是一项具有挑战性的任务。现有的方法通常建立在DDIM反演框架之上,其采样初始化倾向于在潜在空间中保留强布局先验,从而导致空出区域中的目标残差或鬼影伪影。此外,掩蔽低分辨率的自注意图往往会导致边界误判,从而影响图像的绘制能力。为了解决这些限制,我们提出了FreqOR,这是一个集成了采样初始化优化和注意力水平增强的无训练框架。对于采样初始化,在空出的区域中,反向隐波的高频分量被抑制以削弱继承的先验,从而提供更清晰的采样初始化。为了提高注意力,我们采用了两种互补的策略。第一种是分辨率对齐键值插值,它通过启用逐像素的注意力地图掩蔽来实现精确的区域控制。二是查询引导一致性,通过在采样过程中重用反转查询作为先验,保持指定对象的身份和纹理一致性。FreqOR集成到基于能量的指导框架中,在COCO-130和VOC-100数据集上进行评估。结果表明,该方法有效地抑制了空出区域的残差,提高了目标的一致性。
{"title":"FreqOR: Frequency-guided sampling initialization with attention enhancements for training-free object repositioning","authors":"Yuanxiang Fang,&nbsp;Jingyue Wang,&nbsp;Meiqing Wang,&nbsp;Shujie Zhang,&nbsp;Huimin Liu","doi":"10.1016/j.cviu.2025.104610","DOIUrl":"10.1016/j.cviu.2025.104610","url":null,"abstract":"<div><div>Object repositioning in real images remains a challenging task. Existing approaches are typically built upon the DDIM inversion framework, whose sampling initialization tends to preserve strong layout priors in the latent space, thereby leading to object residuals or ghosting artifacts in the vacated region. Additionally, masking low-resolution self-attention maps often results in boundary misjudgments, which impair the inpainting capability. To address these limitations, we propose FreqOR, a training-free framework that integrates sampling initialization optimization with attention-level enhancements. For sampling initialization, high-frequency components of the inverted latent in the vacated region are suppressed to weaken inherited priors, thereby providing a cleaner sampling initialization. For attention enhancement, we incorporate two complementary strategies. The first is Resolution-Aligned Key–Value Interpolation, which achieves precise regional control by enabling pixel-wise masking of attention maps. The second is Query-Guided Consistency, which preserves the identity and texture consistency of the designated object by reusing inversion queries as priors during sampling. Integrated into the energy-based guidance framework, FreqOR is evaluated on the COCO-130 and VOC-100 datasets. The results demonstrate that it effectively suppresses residuals in the vacated region and enhances object consistency.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104610"},"PeriodicalIF":3.5,"publicationDate":"2025-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AdaMulti: An adaptive cascaded multi-modal recognition framework for sports action analysis 一个用于运动动作分析的自适应级联多模态识别框架
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-15 DOI: 10.1016/j.cviu.2025.104604
Jianwei Li , Rui Cao , Haiqing Hu , Xiaomei Zhao , Pengju Zhang
Computer vision-based sports action analysis has emerged as a pivotal research domain, driving transformative applications including healthcare and sports analytics. While deep learning advancements have significantly improved automatic human action recognition and assessment, existing approaches typically rely exclusively on either RGB video streams or skeletal key points-each presenting unique advantages. RGB data offers rich contextual information and widespread accessibility, whereas skeleton data provides a compact representation ideal for direct pose analysis. To harness the complementary strengths of both modalities, we propose AdaMulti, an adaptive cascaded multi-modal framework for fine-grained human action analysis. Our novel approach integrates both RGB and skeleton data through two key innovations: (1) an intelligent policy network that dynamically selects the optimal modality (RGB or skeleton) for each frame, and (2) a cascaded recognition architecture that effectively fuses multi-modal features. We evaluate AdaMulti using a newly constructed multi-modal dataset derived from our 3D-Yoga project, comprising extensive yoga poses with detailed performance annotations. Experimental results demonstrate that AdaMulti outperforms single-modal methods by 17% and 32% in recognition accuracy. Furthermore, comparative studies on the public NTU-RGB+D 60 benchmark show that our method achieves a 0.6% higher accuracy than the state-of-the-art method, validating its effectiveness for complex action analysis tasks.
基于计算机视觉的体育动作分析已经成为一个关键的研究领域,推动了包括医疗保健和体育分析在内的变革性应用。虽然深度学习的进步显著改善了人类行为的自动识别和评估,但现有的方法通常只依赖于RGB视频流或骨架关键点,每种方法都具有独特的优势。RGB数据提供了丰富的上下文信息和广泛的可访问性,而骨骼数据提供了一个紧凑的表示,适合直接姿态分析。为了利用这两种模式的互补优势,我们提出了AdaMulti,一个用于细粒度人类行为分析的自适应级联多模式框架。我们的新方法通过两个关键创新集成了RGB和骨架数据:(1)智能策略网络,为每帧动态选择最佳模态(RGB或骨架);(2)级联识别架构,有效融合多模态特征。我们使用来自3d瑜伽项目的新构建的多模态数据集来评估AdaMulti,该数据集包括广泛的瑜伽姿势和详细的性能注释。实验结果表明,AdaMulti的识别准确率分别比单模态方法高17%和32%。此外,在公开的NTU-RGB+ d60基准上的比较研究表明,我们的方法比最先进的方法准确率高0.6%,验证了其对复杂动作分析任务的有效性。
{"title":"AdaMulti: An adaptive cascaded multi-modal recognition framework for sports action analysis","authors":"Jianwei Li ,&nbsp;Rui Cao ,&nbsp;Haiqing Hu ,&nbsp;Xiaomei Zhao ,&nbsp;Pengju Zhang","doi":"10.1016/j.cviu.2025.104604","DOIUrl":"10.1016/j.cviu.2025.104604","url":null,"abstract":"<div><div>Computer vision-based sports action analysis has emerged as a pivotal research domain, driving transformative applications including healthcare and sports analytics. While deep learning advancements have significantly improved automatic human action recognition and assessment, existing approaches typically rely exclusively on either RGB video streams or skeletal key points-each presenting unique advantages. RGB data offers rich contextual information and widespread accessibility, whereas skeleton data provides a compact representation ideal for direct pose analysis. To harness the complementary strengths of both modalities, we propose AdaMulti, an adaptive cascaded multi-modal framework for fine-grained human action analysis. Our novel approach integrates both RGB and skeleton data through two key innovations: (1) an intelligent policy network that dynamically selects the optimal modality (RGB or skeleton) for each frame, and (2) a cascaded recognition architecture that effectively fuses multi-modal features. We evaluate AdaMulti using a newly constructed multi-modal dataset derived from our 3D-Yoga project, comprising extensive yoga poses with detailed performance annotations. Experimental results demonstrate that AdaMulti outperforms single-modal methods by 17% and 32% in recognition accuracy. Furthermore, comparative studies on the public NTU-RGB+D 60 benchmark show that our method achieves a 0.6% higher accuracy than the state-of-the-art method, validating its effectiveness for complex action analysis tasks.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104604"},"PeriodicalIF":3.5,"publicationDate":"2025-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A detector-free feature matching method with dual-frequency transformer 一种无检测器的双频变压器特征匹配方法
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-13 DOI: 10.1016/j.cviu.2025.104597
Zhen Han , Ning Lv , Chen Chen , Li Cong , Chengbin Huang , Bin Wang
Detector-free methods have achieved notable progress in recent years, but the limited capacity of existing models to leverage multi-frequency features continues to constrain matching performance. To address this challenge, we propose a novel feature matching approach based on a dual-frequency Transformer model, which effectively exploits multi-level image information. The proposed architecture employs dual attention branches, specifically designed to capture high-frequency details and low-frequency structural features. The high-frequency attention branch incorporates a feature enhancement module to accentuate edge visual features, which play a pivotal role in matching tasks. In addition, a frequency-based loss function is designed to constrain the consistency and integrity of features in the frequency domain during the feature extraction process, effectively mitigating frequency feature distortion. The proposed method not only enhances the model’s ability to represent contextual features across different frequency components but also improves selective attention to reliable feature details. Experimental results demonstrate the proposed method achieves superior performance in multiple feature matching tasks.
近年来,无检测器方法取得了显著进展,但现有模型利用多频率特征的能力有限,继续限制匹配性能。为了解决这一问题,我们提出了一种新的基于双频变压器模型的特征匹配方法,该方法有效地利用了多层次的图像信息。所提出的架构采用双注意分支,专门用于捕获高频细节和低频结构特征。高频注意分支包含特征增强模块,以突出在匹配任务中起关键作用的边缘视觉特征。此外,设计了基于频率的损失函数,在特征提取过程中约束特征在频域的一致性和完整性,有效缓解频率特征失真。该方法不仅增强了模型跨不同频率分量表示上下文特征的能力,而且提高了对可靠特征细节的选择性关注。实验结果表明,该方法在多种特征匹配任务中取得了较好的效果。
{"title":"A detector-free feature matching method with dual-frequency transformer","authors":"Zhen Han ,&nbsp;Ning Lv ,&nbsp;Chen Chen ,&nbsp;Li Cong ,&nbsp;Chengbin Huang ,&nbsp;Bin Wang","doi":"10.1016/j.cviu.2025.104597","DOIUrl":"10.1016/j.cviu.2025.104597","url":null,"abstract":"<div><div>Detector-free methods have achieved notable progress in recent years, but the limited capacity of existing models to leverage multi-frequency features continues to constrain matching performance. To address this challenge, we propose a novel feature matching approach based on a dual-frequency Transformer model, which effectively exploits multi-level image information. The proposed architecture employs dual attention branches, specifically designed to capture high-frequency details and low-frequency structural features. The high-frequency attention branch incorporates a feature enhancement module to accentuate edge visual features, which play a pivotal role in matching tasks. In addition, a frequency-based loss function is designed to constrain the consistency and integrity of features in the frequency domain during the feature extraction process, effectively mitigating frequency feature distortion. The proposed method not only enhances the model’s ability to represent contextual features across different frequency components but also improves selective attention to reliable feature details. Experimental results demonstrate the proposed method achieves superior performance in multiple feature matching tasks.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104597"},"PeriodicalIF":3.5,"publicationDate":"2025-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Temporal prompt guided visual–text–object alignment for zero-shot video captioning 零镜头视频字幕的时间提示引导视觉-文本-对象对齐
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-12 DOI: 10.1016/j.cviu.2025.104601
Ping Li , Tao Wang , Zeyu Pan
Video captioning generates the descriptive sentence for a video. Existing methods rely on a plentiful of annotated captions for training the model, but it is usually very expensive to collect so many captions. This raises a challenge that how to generate video captions with unpaired videos and sentences, i.e., zero-shot video captioning. While some progress using Large Language Model (LLM) has been made in zero-shot image captioning, it still fails to consider the temporal relations in the video domain. This may easily lead to the incorrect verbs and nouns in sentences if directly adapting LLM-based image methods to video. To address this problem, we propose the Temporal Prompt guided Visual–text–object Alignment (TPVA) approach for zero-shot video captioning. It consists of the temporal prompt guidance module and the visual–text–object alignment module. The former employs the pre-trained action recognition model to yield the action class as the key word of the temporal prompt, which guides the LLM to generate the text phrase containing the verb identifying action. The latter implements both visual–text alignment and text–object alignment by computing their similarity scores, respectively, which allows the model to generate the words better revealing the video semantics. Experimental results on several benchmarks demonstrate the superiority of the proposed method in zero-shot video captioning. Code is available at https://github.com/mlvccn/TPVA_VidCap_ZeroShot.
视频字幕生成视频的描述性句子。现有的方法依赖于大量带注释的说明文字来训练模型,但是收集这么多说明文字通常是非常昂贵的。这就提出了一个挑战,即如何用不配对的视频和句子生成视频字幕,即零镜头视频字幕。虽然利用大语言模型(Large Language Model, LLM)在零镜头图像字幕中取得了一些进展,但它仍然没有考虑视频域的时间关系。如果将基于llm的图像方法直接应用到视频中,很容易导致句子中的动词和名词出现错误。为了解决这个问题,我们提出了零镜头视频字幕的时间提示引导视觉文本对象对齐(TPVA)方法。它由时间提示引导模块和可视-文本-对象对齐模块组成。前者采用预先训练好的动作识别模型,生成动作类作为时态提示的关键词,引导LLM生成包含识别动作动词的文本短语。后者分别通过计算它们的相似度得分来实现视觉文本对齐和文本对象对齐,这使得模型能够生成更好地揭示视频语义的单词。几个基准的实验结果证明了该方法在零镜头视频字幕中的优越性。代码可从https://github.com/mlvccn/TPVA_VidCap_ZeroShot获得。
{"title":"Temporal prompt guided visual–text–object alignment for zero-shot video captioning","authors":"Ping Li ,&nbsp;Tao Wang ,&nbsp;Zeyu Pan","doi":"10.1016/j.cviu.2025.104601","DOIUrl":"10.1016/j.cviu.2025.104601","url":null,"abstract":"<div><div>Video captioning generates the descriptive sentence for a video. Existing methods rely on a plentiful of annotated captions for training the model, but it is usually very expensive to collect so many captions. This raises a challenge that how to generate video captions with unpaired videos and sentences, i.e., zero-shot video captioning. While some progress using Large Language Model (LLM) has been made in zero-shot image captioning, it still fails to consider the temporal relations in the video domain. This may easily lead to the incorrect verbs and nouns in sentences if directly adapting LLM-based image methods to video. To address this problem, we propose the Temporal Prompt guided Visual–text–object Alignment (<strong>TPVA</strong>) approach for zero-shot video captioning. It consists of the temporal prompt guidance module and the visual–text–object alignment module. The former employs the pre-trained action recognition model to yield the action class as the key word of the temporal prompt, which guides the LLM to generate the text phrase containing the verb identifying action. The latter implements both visual–text alignment and text–object alignment by computing their similarity scores, respectively, which allows the model to generate the words better revealing the video semantics. Experimental results on several benchmarks demonstrate the superiority of the proposed method in zero-shot video captioning. Code is available at <span><span>https://github.com/mlvccn/TPVA_VidCap_ZeroShot</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104601"},"PeriodicalIF":3.5,"publicationDate":"2025-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Real-time habitat mapping with YOLOv8: A multi-threaded approach to biodiversity preservation 基于YOLOv8的实时生境制图:一种多线程的生物多样性保护方法
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-12 DOI: 10.1016/j.cviu.2025.104606
Oluwakemi Akinwehinmi , Alberto Tena , Javier Mora , Francesc Solsona , Pedro Arnau del Amo
This paper presents a robust system for real-time object detection and counting in ecological video streams. It is based on the YOLOv8 architecture integrated within a multi-threaded video processing architecture. The system reduces latency and improves throughput by parallelizing object detection and preprocessing tasks. This leads to outperforming traditional single-threaded implementations in continuous video analysis.
The system also incorporates dynamic thresholding methods, fine-tuning, and data augmentation to enhance object detection reliability in dynamic natural environments. These mechanisms improve robustness to changing lighting, occlusions, and background complexity, common challenges in outdoor footage. The system is thoroughly evaluated through performance comparisons between multi-threaded and single-threaded implementations, environmental stress tests, and an ablation study.
Results demonstrate improved consistency in object detection and counting in dynamic environments, along with significant gains in processing speed. Designed for deployment on lightweight and low-power devices, the system is suitable for remote or resource-constrained settings.
While designed for biodiversity monitoring, the approach is applicable to other domains requiring efficient, real-time video analysis in unstructured environments.
提出了一种鲁棒的生态视频流实时目标检测与计数系统。它基于集成在多线程视频处理架构中的YOLOv8架构。该系统通过并行处理对象检测和预处理任务,减少了延迟,提高了吞吐量。这导致在连续视频分析中优于传统的单线程实现。该系统还结合了动态阈值方法、微调和数据增强,以提高动态自然环境中目标检测的可靠性。这些机制提高了对改变照明,遮挡和背景复杂性的鲁棒性,这是户外镜头中常见的挑战。通过对多线程和单线程实现、环境压力测试和消融研究之间的性能比较,对系统进行了全面评估。结果表明,在动态环境中,目标检测和计数的一致性得到了改善,处理速度也有了显著提高。该系统专为轻量级和低功耗设备而设计,适用于远程或资源受限的环境。虽然该方法是为生物多样性监测而设计的,但它也适用于其他需要在非结构化环境中进行高效、实时视频分析的领域。
{"title":"Real-time habitat mapping with YOLOv8: A multi-threaded approach to biodiversity preservation","authors":"Oluwakemi Akinwehinmi ,&nbsp;Alberto Tena ,&nbsp;Javier Mora ,&nbsp;Francesc Solsona ,&nbsp;Pedro Arnau del Amo","doi":"10.1016/j.cviu.2025.104606","DOIUrl":"10.1016/j.cviu.2025.104606","url":null,"abstract":"<div><div>This paper presents a robust system for real-time object detection and counting in ecological video streams. It is based on the YOLOv8 architecture integrated within a multi-threaded video processing architecture. The system reduces latency and improves throughput by parallelizing object detection and preprocessing tasks. This leads to outperforming traditional single-threaded implementations in continuous video analysis.</div><div>The system also incorporates dynamic thresholding methods, fine-tuning, and data augmentation to enhance object detection reliability in dynamic natural environments. These mechanisms improve robustness to changing lighting, occlusions, and background complexity, common challenges in outdoor footage. The system is thoroughly evaluated through performance comparisons between multi-threaded and single-threaded implementations, environmental stress tests, and an ablation study.</div><div>Results demonstrate improved consistency in object detection and counting in dynamic environments, along with significant gains in processing speed. Designed for deployment on lightweight and low-power devices, the system is suitable for remote or resource-constrained settings.</div><div>While designed for biodiversity monitoring, the approach is applicable to other domains requiring efficient, real-time video analysis in unstructured environments.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104606"},"PeriodicalIF":3.5,"publicationDate":"2025-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Distractor suppression Siamese network with task-aware attention for visual tracking 基于任务感知注意力的干扰抑制Siamese网络视觉跟踪
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-11 DOI: 10.1016/j.cviu.2025.104607
Zhigang Liu , Fuyuan Xing , Hao Huang , Kexin Wang , Yuxuan Shao
Existing IoU-guided trackers suppress background distractors by weighting the classification scores with IoU predictions, which limits their effectiveness in complex tracking scenarios. In this paper, we propose a Distractor feature suppression Siamese network with Task-aware attention (SiamDT) for visual tracking. Firstly, we design a distractor feature suppression network that uses IoU scores to suppress distractor features in the classification feature, achieving distractor suppression at the feature level. At the same time, we design a task-aware attention network that reconstructs the cross-correlation feature by using a hybrid attention mechanism, which enhances the semantic representation capability of the features from the classification and regression branches across spatial and channel domains. Extensive experiments on benchmarks including OTB2013, OTB2015, UAV123, LaSOT, and GOT10k demonstrate that the proposed SiamDT achieves state-of-the-art tracking performance.
现有的IoU引导跟踪器通过IoU预测加权分类分数来抑制背景干扰物,这限制了其在复杂跟踪场景中的有效性。在本文中,我们提出了一种带有任务感知注意(SiamDT)的干扰特征抑制Siamese网络用于视觉跟踪。首先,我们设计了一个干扰物特征抑制网络,利用IoU分数在分类特征中抑制干扰物特征,实现特征层面的干扰物抑制。同时,我们设计了一个任务感知的注意网络,利用混合注意机制重构相互关联特征,增强了分类和回归分支跨空间和通道域特征的语义表示能力。在包括OTB2013、OTB2015、UAV123、LaSOT和GOT10k在内的基准测试上进行的大量实验表明,所提出的SiamDT实现了最先进的跟踪性能。
{"title":"Distractor suppression Siamese network with task-aware attention for visual tracking","authors":"Zhigang Liu ,&nbsp;Fuyuan Xing ,&nbsp;Hao Huang ,&nbsp;Kexin Wang ,&nbsp;Yuxuan Shao","doi":"10.1016/j.cviu.2025.104607","DOIUrl":"10.1016/j.cviu.2025.104607","url":null,"abstract":"<div><div>Existing IoU-guided trackers suppress background distractors by weighting the classification scores with IoU predictions, which limits their effectiveness in complex tracking scenarios. In this paper, we propose a Distractor feature suppression Siamese network with Task-aware attention (SiamDT) for visual tracking. Firstly, we design a distractor feature suppression network that uses IoU scores to suppress distractor features in the classification feature, achieving distractor suppression at the feature level. At the same time, we design a task-aware attention network that reconstructs the cross-correlation feature by using a hybrid attention mechanism, which enhances the semantic representation capability of the features from the classification and regression branches across spatial and channel domains. Extensive experiments on benchmarks including OTB2013, OTB2015, UAV123, LaSOT, and GOT10k demonstrate that the proposed SiamDT achieves state-of-the-art tracking performance.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104607"},"PeriodicalIF":3.5,"publicationDate":"2025-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Computer Vision and Image Understanding
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1