首页 > 最新文献

Computer Vision and Image Understanding最新文献

英文 中文
Distractor suppression Siamese network with task-aware attention for visual tracking 基于任务感知注意力的干扰抑制Siamese网络视觉跟踪
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-01 Epub Date: 2025-12-11 DOI: 10.1016/j.cviu.2025.104607
Zhigang Liu , Fuyuan Xing , Hao Huang , Kexin Wang , Yuxuan Shao
Existing IoU-guided trackers suppress background distractors by weighting the classification scores with IoU predictions, which limits their effectiveness in complex tracking scenarios. In this paper, we propose a Distractor feature suppression Siamese network with Task-aware attention (SiamDT) for visual tracking. Firstly, we design a distractor feature suppression network that uses IoU scores to suppress distractor features in the classification feature, achieving distractor suppression at the feature level. At the same time, we design a task-aware attention network that reconstructs the cross-correlation feature by using a hybrid attention mechanism, which enhances the semantic representation capability of the features from the classification and regression branches across spatial and channel domains. Extensive experiments on benchmarks including OTB2013, OTB2015, UAV123, LaSOT, and GOT10k demonstrate that the proposed SiamDT achieves state-of-the-art tracking performance.
现有的IoU引导跟踪器通过IoU预测加权分类分数来抑制背景干扰物,这限制了其在复杂跟踪场景中的有效性。在本文中,我们提出了一种带有任务感知注意(SiamDT)的干扰特征抑制Siamese网络用于视觉跟踪。首先,我们设计了一个干扰物特征抑制网络,利用IoU分数在分类特征中抑制干扰物特征,实现特征层面的干扰物抑制。同时,我们设计了一个任务感知的注意网络,利用混合注意机制重构相互关联特征,增强了分类和回归分支跨空间和通道域特征的语义表示能力。在包括OTB2013、OTB2015、UAV123、LaSOT和GOT10k在内的基准测试上进行的大量实验表明,所提出的SiamDT实现了最先进的跟踪性能。
{"title":"Distractor suppression Siamese network with task-aware attention for visual tracking","authors":"Zhigang Liu ,&nbsp;Fuyuan Xing ,&nbsp;Hao Huang ,&nbsp;Kexin Wang ,&nbsp;Yuxuan Shao","doi":"10.1016/j.cviu.2025.104607","DOIUrl":"10.1016/j.cviu.2025.104607","url":null,"abstract":"<div><div>Existing IoU-guided trackers suppress background distractors by weighting the classification scores with IoU predictions, which limits their effectiveness in complex tracking scenarios. In this paper, we propose a Distractor feature suppression Siamese network with Task-aware attention (SiamDT) for visual tracking. Firstly, we design a distractor feature suppression network that uses IoU scores to suppress distractor features in the classification feature, achieving distractor suppression at the feature level. At the same time, we design a task-aware attention network that reconstructs the cross-correlation feature by using a hybrid attention mechanism, which enhances the semantic representation capability of the features from the classification and regression branches across spatial and channel domains. Extensive experiments on benchmarks including OTB2013, OTB2015, UAV123, LaSOT, and GOT10k demonstrate that the proposed SiamDT achieves state-of-the-art tracking performance.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104607"},"PeriodicalIF":3.5,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LoTeR: Localized text prompt refinement for zero-shot referring image segmentation LoTeR:用于零点参考图像分割的局部文本提示细化
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-01 Epub Date: 2025-12-04 DOI: 10.1016/j.cviu.2025.104596
Lei Zhang , Yongqiu Huang , Yingjun Du , Fang Lei , Zhiying Yang , Cees G.M. Snoek , Yehui Wang
This paper addresses the challenge of segmenting an object in an image based solely on a textual description, without requiring any training on specific object classes. In contrast to traditional methods that rely on generating numerous mask proposals, we introduce a novel patch-based approach. Our method computes the similarity between small image patches, extracted using a sliding window, and textual descriptions, producing a patch score map that identifies the regions most likely to contain the target object. This score map guides a segment-anything model to generate precise mask proposals. To further improve segmentation accuracy, we refine the textual prompts by generating detailed object descriptions using a multi-modal large language model. Our method’s effectiveness is validated through extensive experiments on the RefCOCO, RefCOCO+, and RefCOCOg datasets, where it outperforms state-of-the-art zero-shot referring image segmentation methods. Ablation studies confirm the key contributions of our patch-based segmentation and localized text prompt refinement, demonstrating their significant role in enhancing both precision and robustness.
本文解决了仅基于文本描述分割图像中对象的挑战,而不需要对特定对象类进行任何训练。与传统的依赖于生成大量掩模建议的方法相比,我们引入了一种基于补丁的新方法。我们的方法计算使用滑动窗口提取的小图像补丁和文本描述之间的相似性,生成一个补丁评分图,该图识别最有可能包含目标物体的区域。这个分数地图引导一个分段-任何模型来生成精确的掩码建议。为了进一步提高分割精度,我们通过使用多模态大语言模型生成详细的对象描述来改进文本提示。通过在RefCOCO、RefCOCO+和RefCOCO数据集上的大量实验,我们的方法的有效性得到了验证,在这些数据集上,它优于最先进的零采样参考图像分割方法。消融研究证实了我们基于补丁的分割和本地化文本提示细化的关键贡献,证明了它们在提高精度和鲁棒性方面的重要作用。
{"title":"LoTeR: Localized text prompt refinement for zero-shot referring image segmentation","authors":"Lei Zhang ,&nbsp;Yongqiu Huang ,&nbsp;Yingjun Du ,&nbsp;Fang Lei ,&nbsp;Zhiying Yang ,&nbsp;Cees G.M. Snoek ,&nbsp;Yehui Wang","doi":"10.1016/j.cviu.2025.104596","DOIUrl":"10.1016/j.cviu.2025.104596","url":null,"abstract":"<div><div>This paper addresses the challenge of segmenting an object in an image based solely on a textual description, without requiring any training on specific object classes. In contrast to traditional methods that rely on generating numerous mask proposals, we introduce a novel patch-based approach. Our method computes the similarity between small image patches, extracted using a sliding window, and textual descriptions, producing a patch score map that identifies the regions most likely to contain the target object. This score map guides a segment-anything model to generate precise mask proposals. To further improve segmentation accuracy, we refine the textual prompts by generating detailed object descriptions using a multi-modal large language model. Our method’s effectiveness is validated through extensive experiments on the RefCOCO, RefCOCO+, and RefCOCOg datasets, where it outperforms state-of-the-art zero-shot referring image segmentation methods. Ablation studies confirm the key contributions of our patch-based segmentation and localized text prompt refinement, demonstrating their significant role in enhancing both precision and robustness.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104596"},"PeriodicalIF":3.5,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145685139","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Unsupervised multi-modal domain adaptation for RGB-T Semantic Segmentation RGB-T语义分割的无监督多模态域自适应
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-01 Epub Date: 2025-12-03 DOI: 10.1016/j.cviu.2025.104573
Zeyang Chen , Chunyu Lin , Yao Zhao , Tammam Tillo
This paper proposes an Unsupervised multi-modal domain adaptation approach for semantic segmentation of visible and thermal images. The method addresses the issue of data scarcity by transferring knowledge from existing semantic segmentation networks, thereby helping to avoid the high costs associated with data labeling. We take into account changes in temperature and light to reduce the intra-domain gap between visible and thermal images captured during the day and night. Additionally, we narrow the inter-domain gap between visible and thermal images using a self-distillation loss. Our approach allows for high-quality semantic segmentation without the need for annotations, even under challenging conditions such as nighttime and adverse weather. Experiments conducted on both visible and thermal benchmarks demonstrate the effectiveness of our method, quantitatively and qualitatively.
提出了一种无监督多模态域自适应的视觉图像和热图像语义分割方法。该方法通过从现有的语义分割网络转移知识来解决数据稀缺的问题,从而有助于避免与数据标记相关的高成本。我们考虑了温度和光线的变化,以减少在白天和晚上捕获的可见光和热图像之间的域内差距。此外,我们利用自蒸馏损失缩小了可见光图像和热图像之间的域间差距。我们的方法允许在不需要注释的情况下进行高质量的语义分割,即使在夜间和恶劣天气等具有挑战性的条件下也是如此。在可见光和热基准上进行的实验证明了我们的方法在定量和定性上的有效性。
{"title":"Unsupervised multi-modal domain adaptation for RGB-T Semantic Segmentation","authors":"Zeyang Chen ,&nbsp;Chunyu Lin ,&nbsp;Yao Zhao ,&nbsp;Tammam Tillo","doi":"10.1016/j.cviu.2025.104573","DOIUrl":"10.1016/j.cviu.2025.104573","url":null,"abstract":"<div><div>This paper proposes an Unsupervised multi-modal domain adaptation approach for semantic segmentation of visible and thermal images. The method addresses the issue of data scarcity by transferring knowledge from existing semantic segmentation networks, thereby helping to avoid the high costs associated with data labeling. We take into account changes in temperature and light to reduce the intra-domain gap between visible and thermal images captured during the day and night. Additionally, we narrow the inter-domain gap between visible and thermal images using a self-distillation loss. Our approach allows for high-quality semantic segmentation without the need for annotations, even under challenging conditions such as nighttime and adverse weather. Experiments conducted on both visible and thermal benchmarks demonstrate the effectiveness of our method, quantitatively and qualitatively.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104573"},"PeriodicalIF":3.5,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145685141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Open-vocabulary object detection for high-resolution remote sensing images 高分辨率遥感图像的开放词汇目标检测
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-01 Epub Date: 2025-12-01 DOI: 10.1016/j.cviu.2025.104566
HuaDong Li
In high-resolution remote sensing interpretation, object detection is evolving from closed-set to open-set, i.e., generalizing traditional detection models to detect objects described by open-vocabulary. The rapid development of vision-language pre-training in recent years has made research on open-vocabulary detection (OVD) feasible, which is also considered a critical step in the transition from weak to strong artificial intelligence. However, limited by the scarcity of large-scale vision-language paired datasets, research on open-vocabulary detection for high-resolution remote sensing images (RS-OVD) significantly lags behind that of natural images. Additionally, the high-scale variability of remote-sensing objects poses more significant challenges for open-vocabulary object detection. To address these challenges, we innovatively disentangle the generalizing process into an object-level task transformation problem and a semantic expansion problem. Furthermore, we propose a Cascade Knowledge Distillation model addressing the problems stage by stage. We evaluate our method on the DIOR and NWPU VHR-10 datasets. The experimental results demonstrate that the proposed method effectively generalizes the object detector to unknown categories.
在高分辨率遥感解译中,目标检测正从封闭集向开放集发展,即将传统的检测模型推广到开放词汇描述的目标检测。近年来,视觉语言预训练的快速发展使得开放词汇检测(OVD)的研究成为可能,这也是人工智能从弱智能向强智能过渡的关键一步。然而,受大规模视觉语言配对数据集的限制,高分辨率遥感图像开放词汇检测的研究明显滞后于自然图像。此外,遥感目标的高尺度变异性对开放词汇目标检测提出了更大的挑战。为了解决这些挑战,我们创新地将泛化过程分解为对象级任务转换问题和语义扩展问题。在此基础上,提出了一种逐级解决问题的级联知识蒸馏模型。我们在DIOR和NWPU VHR-10数据集上评估了我们的方法。实验结果表明,该方法有效地将目标检测器推广到未知类别。
{"title":"Open-vocabulary object detection for high-resolution remote sensing images","authors":"HuaDong Li","doi":"10.1016/j.cviu.2025.104566","DOIUrl":"10.1016/j.cviu.2025.104566","url":null,"abstract":"<div><div>In high-resolution remote sensing interpretation, object detection is evolving from closed-set to open-set, i.e., generalizing traditional detection models to detect objects described by open-vocabulary. The rapid development of vision-language pre-training in recent years has made research on open-vocabulary detection (OVD) feasible, which is also considered a critical step in the transition from weak to strong artificial intelligence. However, limited by the scarcity of large-scale vision-language paired datasets, research on open-vocabulary detection for high-resolution remote sensing images (RS-OVD) significantly lags behind that of natural images. Additionally, the high-scale variability of remote-sensing objects poses more significant challenges for open-vocabulary object detection. To address these challenges, we innovatively disentangle the generalizing process into an object-level task transformation problem and a semantic expansion problem. Furthermore, we propose a Cascade Knowledge Distillation model addressing the problems stage by stage. We evaluate our method on the DIOR and NWPU VHR-10 datasets. The experimental results demonstrate that the proposed method effectively generalizes the object detector to unknown categories.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104566"},"PeriodicalIF":3.5,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145685143","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Few-shot Medical Image Segmentation via Boundary-extended Prototypes and Momentum Inference 基于边界扩展原型和动量推理的小镜头医学图像分割
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-01 Epub Date: 2025-11-20 DOI: 10.1016/j.cviu.2025.104571
Bin Xu , Yazhou Zhu , Shidong Wang , Yang Long , Haofeng Zhang
Few-Shot Medical Image Segmentation (FSMIS) aims to achieve precise segmentation of different organs using minimal annotated data. Current prototype-based FSMIS methods primarily extract prototypes from support samples through random sampling or local averaging. However, due to the extremely small proportion of boundary features, traditional methods have difficulty generating boundary prototypes, resulting in poorly delineated boundaries in segmentation results. Moreover, their reliance on a single support image for segmenting all query images leads to significant performance degradation when substantial discrepancies exist between support and query images. To address these challenges, we propose an innovative solution namely Boundary-extended Prototypes and Momentum Inference (BePMI), which includes two key modules: a Boundary-extended Prototypes (BePro) module and a Momentum Inference (MoIf) module. BePro constructs boundary prototypes by explicitly clustering the internal and external boundary features to alleviate the problem of boundary ambiguity. MoIf employs the spatial consistency of adjacent slices in 3D medical images to dynamically optimize the prototype representation, thereby reducing the reliance on a single sample. Extensive experiments on three publicly available medical image datasets demonstrate that our method outperforms the state-of-the-art methods. Code is available at https://github.com/xubin471/BePMI.
少镜头医学图像分割(FSMIS)旨在使用最少的注释数据实现不同器官的精确分割。目前基于原型的FSMIS方法主要是通过随机抽样或局部平均从支持样本中提取原型。然而,由于边界特征所占比例极小,传统方法难以生成边界原型,导致分割结果中边界圈定不清。此外,当支持图像和查询图像之间存在重大差异时,它们依赖于单个支持图像来分割所有查询图像,这会导致性能显著下降。为了应对这些挑战,我们提出了一种创新的解决方案,即边界扩展原型和动量推理(BePMI),它包括两个关键模块:边界扩展原型(BePro)模块和动量推理(MoIf)模块。BePro通过显式聚类内部和外部边界特征来构建边界原型,以缓解边界模糊问题。MoIf利用三维医学图像中相邻切片的空间一致性来动态优化原型表示,从而减少对单个样本的依赖。在三个公开可用的医学图像数据集上进行的大量实验表明,我们的方法优于最先进的方法。代码可从https://github.com/xubin471/BePMI获得。
{"title":"Few-shot Medical Image Segmentation via Boundary-extended Prototypes and Momentum Inference","authors":"Bin Xu ,&nbsp;Yazhou Zhu ,&nbsp;Shidong Wang ,&nbsp;Yang Long ,&nbsp;Haofeng Zhang","doi":"10.1016/j.cviu.2025.104571","DOIUrl":"10.1016/j.cviu.2025.104571","url":null,"abstract":"<div><div>Few-Shot Medical Image Segmentation (<strong>FSMIS</strong>) aims to achieve precise segmentation of different organs using minimal annotated data. Current prototype-based FSMIS methods primarily extract prototypes from support samples through random sampling or local averaging. However, due to the extremely small proportion of boundary features, traditional methods have difficulty generating boundary prototypes, resulting in poorly delineated boundaries in segmentation results. Moreover, their reliance on a single support image for segmenting all query images leads to significant performance degradation when substantial discrepancies exist between support and query images. To address these challenges, we propose an innovative solution namely Boundary-extended Prototypes and Momentum Inference (<strong>BePMI</strong>), which includes two key modules: a Boundary-extended Prototypes (<strong>BePro</strong>) module and a Momentum Inference (<strong>MoIf</strong>) module. BePro constructs boundary prototypes by explicitly clustering the internal and external boundary features to alleviate the problem of boundary ambiguity. MoIf employs the spatial consistency of adjacent slices in 3D medical images to dynamically optimize the prototype representation, thereby reducing the reliance on a single sample. Extensive experiments on three publicly available medical image datasets demonstrate that our method outperforms the state-of-the-art methods. Code is available at <span><span>https://github.com/xubin471/BePMI</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104571"},"PeriodicalIF":3.5,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145571880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FreqOR: Frequency-guided sampling initialization with attention enhancements for training-free object repositioning FreqOR:用于无训练对象重定位的带有注意增强的频率引导采样初始化
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-01 Epub Date: 2025-12-15 DOI: 10.1016/j.cviu.2025.104610
Yuanxiang Fang, Jingyue Wang, Meiqing Wang, Shujie Zhang, Huimin Liu
Object repositioning in real images remains a challenging task. Existing approaches are typically built upon the DDIM inversion framework, whose sampling initialization tends to preserve strong layout priors in the latent space, thereby leading to object residuals or ghosting artifacts in the vacated region. Additionally, masking low-resolution self-attention maps often results in boundary misjudgments, which impair the inpainting capability. To address these limitations, we propose FreqOR, a training-free framework that integrates sampling initialization optimization with attention-level enhancements. For sampling initialization, high-frequency components of the inverted latent in the vacated region are suppressed to weaken inherited priors, thereby providing a cleaner sampling initialization. For attention enhancement, we incorporate two complementary strategies. The first is Resolution-Aligned Key–Value Interpolation, which achieves precise regional control by enabling pixel-wise masking of attention maps. The second is Query-Guided Consistency, which preserves the identity and texture consistency of the designated object by reusing inversion queries as priors during sampling. Integrated into the energy-based guidance framework, FreqOR is evaluated on the COCO-130 and VOC-100 datasets. The results demonstrate that it effectively suppresses residuals in the vacated region and enhances object consistency.
物体在真实图像中的重新定位仍然是一项具有挑战性的任务。现有的方法通常建立在DDIM反演框架之上,其采样初始化倾向于在潜在空间中保留强布局先验,从而导致空出区域中的目标残差或鬼影伪影。此外,掩蔽低分辨率的自注意图往往会导致边界误判,从而影响图像的绘制能力。为了解决这些限制,我们提出了FreqOR,这是一个集成了采样初始化优化和注意力水平增强的无训练框架。对于采样初始化,在空出的区域中,反向隐波的高频分量被抑制以削弱继承的先验,从而提供更清晰的采样初始化。为了提高注意力,我们采用了两种互补的策略。第一种是分辨率对齐键值插值,它通过启用逐像素的注意力地图掩蔽来实现精确的区域控制。二是查询引导一致性,通过在采样过程中重用反转查询作为先验,保持指定对象的身份和纹理一致性。FreqOR集成到基于能量的指导框架中,在COCO-130和VOC-100数据集上进行评估。结果表明,该方法有效地抑制了空出区域的残差,提高了目标的一致性。
{"title":"FreqOR: Frequency-guided sampling initialization with attention enhancements for training-free object repositioning","authors":"Yuanxiang Fang,&nbsp;Jingyue Wang,&nbsp;Meiqing Wang,&nbsp;Shujie Zhang,&nbsp;Huimin Liu","doi":"10.1016/j.cviu.2025.104610","DOIUrl":"10.1016/j.cviu.2025.104610","url":null,"abstract":"<div><div>Object repositioning in real images remains a challenging task. Existing approaches are typically built upon the DDIM inversion framework, whose sampling initialization tends to preserve strong layout priors in the latent space, thereby leading to object residuals or ghosting artifacts in the vacated region. Additionally, masking low-resolution self-attention maps often results in boundary misjudgments, which impair the inpainting capability. To address these limitations, we propose FreqOR, a training-free framework that integrates sampling initialization optimization with attention-level enhancements. For sampling initialization, high-frequency components of the inverted latent in the vacated region are suppressed to weaken inherited priors, thereby providing a cleaner sampling initialization. For attention enhancement, we incorporate two complementary strategies. The first is Resolution-Aligned Key–Value Interpolation, which achieves precise regional control by enabling pixel-wise masking of attention maps. The second is Query-Guided Consistency, which preserves the identity and texture consistency of the designated object by reusing inversion queries as priors during sampling. Integrated into the energy-based guidance framework, FreqOR is evaluated on the COCO-130 and VOC-100 datasets. The results demonstrate that it effectively suppresses residuals in the vacated region and enhances object consistency.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104610"},"PeriodicalIF":3.5,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MFDiff: Diffusion probabilistic model for medical image segmentation with multi-scale features and frequency-aware attention MFDiff:基于多尺度特征和频率感知关注的医学图像分割扩散概率模型
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-01 Epub Date: 2025-12-17 DOI: 10.1016/j.cviu.2025.104605
Xingli Zhang , Yameng Liu , Haiyang Yu , Zhihui Wang
Medical image segmentation serves as a critical technique in clinical applications such as disease diagnosis, surgical planning, and image-guided therapy, where segmentation accuracy directly impacts the precision of clinical decisions. However, existing methods still face significant challenges in handling inherent issues of medical images, including blurred boundaries, complex multi-scale structures, and difficulties in fine-grained feature representation. To address these challenges, this paper proposes a medical image segmentation method based on a diffusion probabilistic model, MFDiff, which aims to enhance multi-scale contextual awareness and fine-grained structural modeling capabilities. The method incorporates a frequency-aware attention fusion module that effectively strengthens the model’s ability to represent complex structures and ambiguous boundaries. Additionally, a multi-scale feature enhancement module is introduced to expand the receptive field while maintaining low computational cost, thereby improving the extraction and fusion of multi-scale features. Furthermore, an uncertainty-weighted majority voting fusion strategy is proposed to enhance the robustness and consistency of fused predictions from multiple sampling iterations. The proposed method was validated on five medical image segmentation datasets. Experimental results demonstrate that MFDiff outperforms current mainstream methods across all datasets, exhibiting strong generalization ability and robustness.
医学图像分割是疾病诊断、手术计划、图像引导治疗等临床应用中的一项关键技术,分割的准确性直接影响临床决策的准确性。然而,现有的方法在处理医学图像固有问题时仍然面临重大挑战,包括边界模糊、复杂的多尺度结构以及细粒度特征表示困难。为了解决这些问题,本文提出了一种基于扩散概率模型MFDiff的医学图像分割方法,旨在增强多尺度上下文感知和细粒度结构建模能力。该方法结合了频率感知注意力融合模块,有效增强了模型表示复杂结构和模糊边界的能力。此外,引入了多尺度特征增强模块,在保持低计算成本的同时扩大了接收域,从而提高了多尺度特征的提取和融合。在此基础上,提出了一种不确定性加权多数投票融合策略,以提高多采样迭代融合预测的鲁棒性和一致性。在5个医学图像分割数据集上进行了验证。实验结果表明,MFDiff在所有数据集上都优于当前主流方法,具有较强的泛化能力和鲁棒性。
{"title":"MFDiff: Diffusion probabilistic model for medical image segmentation with multi-scale features and frequency-aware attention","authors":"Xingli Zhang ,&nbsp;Yameng Liu ,&nbsp;Haiyang Yu ,&nbsp;Zhihui Wang","doi":"10.1016/j.cviu.2025.104605","DOIUrl":"10.1016/j.cviu.2025.104605","url":null,"abstract":"<div><div>Medical image segmentation serves as a critical technique in clinical applications such as disease diagnosis, surgical planning, and image-guided therapy, where segmentation accuracy directly impacts the precision of clinical decisions. However, existing methods still face significant challenges in handling inherent issues of medical images, including blurred boundaries, complex multi-scale structures, and difficulties in fine-grained feature representation. To address these challenges, this paper proposes a medical image segmentation method based on a diffusion probabilistic model, MFDiff, which aims to enhance multi-scale contextual awareness and fine-grained structural modeling capabilities. The method incorporates a frequency-aware attention fusion module that effectively strengthens the model’s ability to represent complex structures and ambiguous boundaries. Additionally, a multi-scale feature enhancement module is introduced to expand the receptive field while maintaining low computational cost, thereby improving the extraction and fusion of multi-scale features. Furthermore, an uncertainty-weighted majority voting fusion strategy is proposed to enhance the robustness and consistency of fused predictions from multiple sampling iterations. The proposed method was validated on five medical image segmentation datasets. Experimental results demonstrate that MFDiff outperforms current mainstream methods across all datasets, exhibiting strong generalization ability and robustness.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104605"},"PeriodicalIF":3.5,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Human-in-the-loop adaptation in group activity feature learning for team sports video retrieval 团队运动视频检索中群体活动特征学习的人在环适应
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-01 Epub Date: 2025-11-29 DOI: 10.1016/j.cviu.2025.104577
Chihiro Nakatani , Hiroaki Kawashima , Norimichi Ukita
This paper proposes human-in-the-loop adaptation for Group Activity Feature Learning (GAFL) without group activity annotations. This human-in-the-loop adaptation is employed in a group-activity video retrieval framework to improve its retrieval performance. Our method initially pre-trains the GAF space based on the similarity of group activities in a self-supervised manner, unlike prior work that classifies videos into pre-defined group activity classes in a supervised learning manner. Our interactive fine-tuning process updates the GAF space to allow a user to better retrieve videos similar to query videos given by the user. In this fine-tuning, our proposed data-efficient video selection process provides several videos, which are selected from a video database, to the user in order to manually label these videos as positive or negative. These labeled videos are used to update (i.e., fine-tune) the GAF space, so that the positive and negative videos move closer to and farther away from the query videos through contrastive learning. Our comprehensive experimental results on two team sports datasets validate that our method significantly improves the retrieval performance. Ablation studies also demonstrate that several components in our human-in-the-loop adaptation contribute to the improvement of the retrieval performance. Code: https://github.com/chihina/GAFL-FINE-CVIU.
本文提出了一种针对群组活动特征学习(GAFL)的人在环自适应方法。在群体活动视频检索框架中引入人在环自适应,提高了检索性能。我们的方法首先以自监督的方式基于小组活动的相似性对GAF空间进行预训练,而不像之前的工作那样以监督学习的方式将视频分类为预定义的小组活动类。我们的交互式微调过程更新了GAF空间,允许用户更好地检索类似于用户提供的查询视频的视频。在这种微调中,我们提出的数据高效视频选择过程向用户提供从视频数据库中选择的几个视频,以便手动将这些视频标记为积极或消极。这些被标记的视频用于更新(即微调)GAF空间,通过对比学习,使正视频和负视频离查询视频更近或更远。我们在两个团队运动数据集上的综合实验结果验证了我们的方法显著提高了检索性能。消融研究还表明,我们的人在环适应中的几个组成部分有助于提高检索性能。代码:https://github.com/chihina/GAFL-FINE-CVIU。
{"title":"Human-in-the-loop adaptation in group activity feature learning for team sports video retrieval","authors":"Chihiro Nakatani ,&nbsp;Hiroaki Kawashima ,&nbsp;Norimichi Ukita","doi":"10.1016/j.cviu.2025.104577","DOIUrl":"10.1016/j.cviu.2025.104577","url":null,"abstract":"<div><div>This paper proposes human-in-the-loop adaptation for Group Activity Feature Learning (GAFL) without group activity annotations. This human-in-the-loop adaptation is employed in a group-activity video retrieval framework to improve its retrieval performance. Our method initially pre-trains the GAF space based on the similarity of group activities in a self-supervised manner, unlike prior work that classifies videos into pre-defined group activity classes in a supervised learning manner. Our interactive fine-tuning process updates the GAF space to allow a user to better retrieve videos similar to query videos given by the user. In this fine-tuning, our proposed data-efficient video selection process provides several videos, which are selected from a video database, to the user in order to manually label these videos as positive or negative. These labeled videos are used to update (i.e., fine-tune) the GAF space, so that the positive and negative videos move closer to and farther away from the query videos through contrastive learning. Our comprehensive experimental results on two team sports datasets validate that our method significantly improves the retrieval performance. Ablation studies also demonstrate that several components in our human-in-the-loop adaptation contribute to the improvement of the retrieval performance. Code: <span><span>https://github.com/chihina/GAFL-FINE-CVIU</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104577"},"PeriodicalIF":3.5,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145685142","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A configurable global context reconstruction hybrid detector for enhanced small object detection in UAV aerial imagery 一种用于增强无人机航拍图像小目标检测的可配置全局上下文重建混合检测器
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-01 Epub Date: 2025-12-04 DOI: 10.1016/j.cviu.2025.104598
Hongcheng Xue , Tong Gao , Zhan Tang , Yuantian Xia , Longhe Wang , Lin Li
To address the challenge of balancing detection accuracy and efficiency for small objects in complex aerial scenes, we propose a Configurable Global Context Reconstruction Hybrid Detector (GCRH) to enhance overall detection performance. The GCRH framework consists of three key components. First, the Efficient Re-parameterized Encoder (ERE) reduces the computational overhead of multi-head self-attention through re-parameterization while maintaining the integrity and independence of global–local feature interactions. Second, the Global-Aware Feature Pyramid Network (GAFPN) reconstructs and injects global contextual semantics, cascading selective feature fusion to distribute this semantic information across feature layers, thereby alleviating small-object feature degradation and cross-level semantic inconsistency. Finally, two configurable model variants are provided, allowing the control of high-resolution feature layers to balance detection accuracy and inference efficiency. Experiments on the VisDrone2019 and TinyPerson datasets demonstrate that GCRH achieves an effective trade-off between precision and efficiency, validating its applicability to small object detection in aerial imagery. The code is available at: https://github.com/Mundane-X/GCRH.
为了解决复杂航空场景中小目标检测精度和效率平衡的挑战,我们提出了一种可配置的全局上下文重建混合检测器(GCRH)来提高整体检测性能。GCRH框架由三个关键部分组成。首先,高效重参数化编码器(ERE)通过重参数化降低了多头自关注的计算开销,同时保持了全局-局部特征交互的完整性和独立性。其次,全局感知特征金字塔网络(GAFPN)重构并注入全局上下文语义,级联选择性特征融合将该语义信息分布在特征层之间,从而缓解小目标特征退化和跨层语义不一致。最后,提供了两种可配置的模型变体,允许控制高分辨率特征层以平衡检测精度和推理效率。在VisDrone2019和TinyPerson数据集上的实验表明,GCRH实现了精度和效率之间的有效权衡,验证了其在航空图像小目标检测中的适用性。代码可从https://github.com/Mundane-X/GCRH获得。
{"title":"A configurable global context reconstruction hybrid detector for enhanced small object detection in UAV aerial imagery","authors":"Hongcheng Xue ,&nbsp;Tong Gao ,&nbsp;Zhan Tang ,&nbsp;Yuantian Xia ,&nbsp;Longhe Wang ,&nbsp;Lin Li","doi":"10.1016/j.cviu.2025.104598","DOIUrl":"10.1016/j.cviu.2025.104598","url":null,"abstract":"<div><div>To address the challenge of balancing detection accuracy and efficiency for small objects in complex aerial scenes, we propose a Configurable Global Context Reconstruction Hybrid Detector (GCRH) to enhance overall detection performance. The GCRH framework consists of three key components. First, the Efficient Re-parameterized Encoder (ERE) reduces the computational overhead of multi-head self-attention through re-parameterization while maintaining the integrity and independence of global–local feature interactions. Second, the Global-Aware Feature Pyramid Network (GAFPN) reconstructs and injects global contextual semantics, cascading selective feature fusion to distribute this semantic information across feature layers, thereby alleviating small-object feature degradation and cross-level semantic inconsistency. Finally, two configurable model variants are provided, allowing the control of high-resolution feature layers to balance detection accuracy and inference efficiency. Experiments on the VisDrone2019 and TinyPerson datasets demonstrate that GCRH achieves an effective trade-off between precision and efficiency, validating its applicability to small object detection in aerial imagery. The code is available at: <span><span>https://github.com/Mundane-X/GCRH</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104598"},"PeriodicalIF":3.5,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145736973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Generalized prompt-driven zero-shot domain adaptive segmentation with feature rectification and semantic modulation 基于特征校正和语义调制的广义提示驱动零距域自适应分割
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-01 Epub Date: 2025-12-17 DOI: 10.1016/j.cviu.2025.104615
Jinyi Li , Longyu Yang , Donghyun Kim , Kuniaki Saito , Kate Saenko , Stan Sclaroff , Xiaofeng Zhu , Ping Hu
Recent prompt-driven zero-shot adaptation methods offer a promising way to handle domain shifts in semantic segmentation by learning with features simulated from natural language prompts. However, these methods typically depend on a fixed set of predefined domain descriptions, which limits their capacity to generalize to previously undefined domains and often necessitates retraining when encountering novel environments. To address this challenge, we propose a Generalized Prompt-driven Zero-shot Domain Adaptive Segmentation framework that enables flexible and robust cross-domain segmentation by learning to map target domain features into the source domain space. This allows inference to be performed through a unified and well-optimized source model, without requiring target data-based or prompt-based retraining when encountering novel conditions. Our framework comprises two key modules: a Low-level Feature Rectification (LLFR) module that aligns visual styles using a historical source-style memory bank, and a High-level Semantic Modulation (HLSM) module that applies language-guided affine transformations to align high-level semantics. Together, these modules enable adaptive multi-level feature adaptation that maps target inputs into the source domain space, thus allowing the model to handle unseen domains effectively at test time. Extensive experiments on multiple zero-shot domain adaptation benchmarks are conducted, and the results show that our method consistently outperforms previous approaches.
近年来,提示驱动的零距自适应方法通过学习自然语言提示中模拟的特征,提供了一种很有前途的处理语义分割领域转移的方法。然而,这些方法通常依赖于一组固定的预定义领域描述,这限制了它们泛化到以前未定义的领域的能力,并且在遇到新环境时经常需要重新训练。为了解决这一挑战,我们提出了一种广义提示驱动的零射击域自适应分割框架,该框架通过学习将目标域特征映射到源域空间,实现灵活而稳健的跨域分割。这允许通过统一且优化良好的源模型执行推理,而不需要在遇到新情况时进行基于目标数据或基于提示的再训练。我们的框架包括两个关键模块:一个低级特征校正(LLFR)模块,它使用历史源风格的记忆库来对齐视觉样式,一个高级语义调制(HLSM)模块,它应用语言引导的仿射变换来对齐高级语义。总之,这些模块支持自适应多层次特征适应,将目标输入映射到源域空间,从而允许模型在测试时有效地处理未见过的域。在多个零射击域自适应基准上进行了大量实验,结果表明我们的方法始终优于先前的方法。
{"title":"Generalized prompt-driven zero-shot domain adaptive segmentation with feature rectification and semantic modulation","authors":"Jinyi Li ,&nbsp;Longyu Yang ,&nbsp;Donghyun Kim ,&nbsp;Kuniaki Saito ,&nbsp;Kate Saenko ,&nbsp;Stan Sclaroff ,&nbsp;Xiaofeng Zhu ,&nbsp;Ping Hu","doi":"10.1016/j.cviu.2025.104615","DOIUrl":"10.1016/j.cviu.2025.104615","url":null,"abstract":"<div><div>Recent prompt-driven zero-shot adaptation methods offer a promising way to handle domain shifts in semantic segmentation by learning with features simulated from natural language prompts. However, these methods typically depend on a fixed set of predefined domain descriptions, which limits their capacity to generalize to previously undefined domains and often necessitates retraining when encountering novel environments. To address this challenge, we propose a Generalized Prompt-driven Zero-shot Domain Adaptive Segmentation framework that enables flexible and robust cross-domain segmentation by learning to map target domain features into the source domain space. This allows inference to be performed through a unified and well-optimized source model, without requiring target data-based or prompt-based retraining when encountering novel conditions. Our framework comprises two key modules: a Low-level Feature Rectification (LLFR) module that aligns visual styles using a historical source-style memory bank, and a High-level Semantic Modulation (HLSM) module that applies language-guided affine transformations to align high-level semantics. Together, these modules enable adaptive multi-level feature adaptation that maps target inputs into the source domain space, thus allowing the model to handle unseen domains effectively at test time. Extensive experiments on multiple zero-shot domain adaptation benchmarks are conducted, and the results show that our method consistently outperforms previous approaches.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104615"},"PeriodicalIF":3.5,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Computer Vision and Image Understanding
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1