Pub Date : 2026-01-01Epub Date: 2025-12-11DOI: 10.1016/j.cviu.2025.104607
Zhigang Liu , Fuyuan Xing , Hao Huang , Kexin Wang , Yuxuan Shao
Existing IoU-guided trackers suppress background distractors by weighting the classification scores with IoU predictions, which limits their effectiveness in complex tracking scenarios. In this paper, we propose a Distractor feature suppression Siamese network with Task-aware attention (SiamDT) for visual tracking. Firstly, we design a distractor feature suppression network that uses IoU scores to suppress distractor features in the classification feature, achieving distractor suppression at the feature level. At the same time, we design a task-aware attention network that reconstructs the cross-correlation feature by using a hybrid attention mechanism, which enhances the semantic representation capability of the features from the classification and regression branches across spatial and channel domains. Extensive experiments on benchmarks including OTB2013, OTB2015, UAV123, LaSOT, and GOT10k demonstrate that the proposed SiamDT achieves state-of-the-art tracking performance.
{"title":"Distractor suppression Siamese network with task-aware attention for visual tracking","authors":"Zhigang Liu , Fuyuan Xing , Hao Huang , Kexin Wang , Yuxuan Shao","doi":"10.1016/j.cviu.2025.104607","DOIUrl":"10.1016/j.cviu.2025.104607","url":null,"abstract":"<div><div>Existing IoU-guided trackers suppress background distractors by weighting the classification scores with IoU predictions, which limits their effectiveness in complex tracking scenarios. In this paper, we propose a Distractor feature suppression Siamese network with Task-aware attention (SiamDT) for visual tracking. Firstly, we design a distractor feature suppression network that uses IoU scores to suppress distractor features in the classification feature, achieving distractor suppression at the feature level. At the same time, we design a task-aware attention network that reconstructs the cross-correlation feature by using a hybrid attention mechanism, which enhances the semantic representation capability of the features from the classification and regression branches across spatial and channel domains. Extensive experiments on benchmarks including OTB2013, OTB2015, UAV123, LaSOT, and GOT10k demonstrate that the proposed SiamDT achieves state-of-the-art tracking performance.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104607"},"PeriodicalIF":3.5,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01Epub Date: 2025-12-04DOI: 10.1016/j.cviu.2025.104596
Lei Zhang , Yongqiu Huang , Yingjun Du , Fang Lei , Zhiying Yang , Cees G.M. Snoek , Yehui Wang
This paper addresses the challenge of segmenting an object in an image based solely on a textual description, without requiring any training on specific object classes. In contrast to traditional methods that rely on generating numerous mask proposals, we introduce a novel patch-based approach. Our method computes the similarity between small image patches, extracted using a sliding window, and textual descriptions, producing a patch score map that identifies the regions most likely to contain the target object. This score map guides a segment-anything model to generate precise mask proposals. To further improve segmentation accuracy, we refine the textual prompts by generating detailed object descriptions using a multi-modal large language model. Our method’s effectiveness is validated through extensive experiments on the RefCOCO, RefCOCO+, and RefCOCOg datasets, where it outperforms state-of-the-art zero-shot referring image segmentation methods. Ablation studies confirm the key contributions of our patch-based segmentation and localized text prompt refinement, demonstrating their significant role in enhancing both precision and robustness.
{"title":"LoTeR: Localized text prompt refinement for zero-shot referring image segmentation","authors":"Lei Zhang , Yongqiu Huang , Yingjun Du , Fang Lei , Zhiying Yang , Cees G.M. Snoek , Yehui Wang","doi":"10.1016/j.cviu.2025.104596","DOIUrl":"10.1016/j.cviu.2025.104596","url":null,"abstract":"<div><div>This paper addresses the challenge of segmenting an object in an image based solely on a textual description, without requiring any training on specific object classes. In contrast to traditional methods that rely on generating numerous mask proposals, we introduce a novel patch-based approach. Our method computes the similarity between small image patches, extracted using a sliding window, and textual descriptions, producing a patch score map that identifies the regions most likely to contain the target object. This score map guides a segment-anything model to generate precise mask proposals. To further improve segmentation accuracy, we refine the textual prompts by generating detailed object descriptions using a multi-modal large language model. Our method’s effectiveness is validated through extensive experiments on the RefCOCO, RefCOCO+, and RefCOCOg datasets, where it outperforms state-of-the-art zero-shot referring image segmentation methods. Ablation studies confirm the key contributions of our patch-based segmentation and localized text prompt refinement, demonstrating their significant role in enhancing both precision and robustness.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104596"},"PeriodicalIF":3.5,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145685139","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01Epub Date: 2025-12-03DOI: 10.1016/j.cviu.2025.104573
Zeyang Chen , Chunyu Lin , Yao Zhao , Tammam Tillo
This paper proposes an Unsupervised multi-modal domain adaptation approach for semantic segmentation of visible and thermal images. The method addresses the issue of data scarcity by transferring knowledge from existing semantic segmentation networks, thereby helping to avoid the high costs associated with data labeling. We take into account changes in temperature and light to reduce the intra-domain gap between visible and thermal images captured during the day and night. Additionally, we narrow the inter-domain gap between visible and thermal images using a self-distillation loss. Our approach allows for high-quality semantic segmentation without the need for annotations, even under challenging conditions such as nighttime and adverse weather. Experiments conducted on both visible and thermal benchmarks demonstrate the effectiveness of our method, quantitatively and qualitatively.
{"title":"Unsupervised multi-modal domain adaptation for RGB-T Semantic Segmentation","authors":"Zeyang Chen , Chunyu Lin , Yao Zhao , Tammam Tillo","doi":"10.1016/j.cviu.2025.104573","DOIUrl":"10.1016/j.cviu.2025.104573","url":null,"abstract":"<div><div>This paper proposes an Unsupervised multi-modal domain adaptation approach for semantic segmentation of visible and thermal images. The method addresses the issue of data scarcity by transferring knowledge from existing semantic segmentation networks, thereby helping to avoid the high costs associated with data labeling. We take into account changes in temperature and light to reduce the intra-domain gap between visible and thermal images captured during the day and night. Additionally, we narrow the inter-domain gap between visible and thermal images using a self-distillation loss. Our approach allows for high-quality semantic segmentation without the need for annotations, even under challenging conditions such as nighttime and adverse weather. Experiments conducted on both visible and thermal benchmarks demonstrate the effectiveness of our method, quantitatively and qualitatively.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104573"},"PeriodicalIF":3.5,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145685141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01Epub Date: 2025-12-01DOI: 10.1016/j.cviu.2025.104566
HuaDong Li
In high-resolution remote sensing interpretation, object detection is evolving from closed-set to open-set, i.e., generalizing traditional detection models to detect objects described by open-vocabulary. The rapid development of vision-language pre-training in recent years has made research on open-vocabulary detection (OVD) feasible, which is also considered a critical step in the transition from weak to strong artificial intelligence. However, limited by the scarcity of large-scale vision-language paired datasets, research on open-vocabulary detection for high-resolution remote sensing images (RS-OVD) significantly lags behind that of natural images. Additionally, the high-scale variability of remote-sensing objects poses more significant challenges for open-vocabulary object detection. To address these challenges, we innovatively disentangle the generalizing process into an object-level task transformation problem and a semantic expansion problem. Furthermore, we propose a Cascade Knowledge Distillation model addressing the problems stage by stage. We evaluate our method on the DIOR and NWPU VHR-10 datasets. The experimental results demonstrate that the proposed method effectively generalizes the object detector to unknown categories.
{"title":"Open-vocabulary object detection for high-resolution remote sensing images","authors":"HuaDong Li","doi":"10.1016/j.cviu.2025.104566","DOIUrl":"10.1016/j.cviu.2025.104566","url":null,"abstract":"<div><div>In high-resolution remote sensing interpretation, object detection is evolving from closed-set to open-set, i.e., generalizing traditional detection models to detect objects described by open-vocabulary. The rapid development of vision-language pre-training in recent years has made research on open-vocabulary detection (OVD) feasible, which is also considered a critical step in the transition from weak to strong artificial intelligence. However, limited by the scarcity of large-scale vision-language paired datasets, research on open-vocabulary detection for high-resolution remote sensing images (RS-OVD) significantly lags behind that of natural images. Additionally, the high-scale variability of remote-sensing objects poses more significant challenges for open-vocabulary object detection. To address these challenges, we innovatively disentangle the generalizing process into an object-level task transformation problem and a semantic expansion problem. Furthermore, we propose a Cascade Knowledge Distillation model addressing the problems stage by stage. We evaluate our method on the DIOR and NWPU VHR-10 datasets. The experimental results demonstrate that the proposed method effectively generalizes the object detector to unknown categories.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104566"},"PeriodicalIF":3.5,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145685143","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01Epub Date: 2025-11-20DOI: 10.1016/j.cviu.2025.104571
Bin Xu , Yazhou Zhu , Shidong Wang , Yang Long , Haofeng Zhang
Few-Shot Medical Image Segmentation (FSMIS) aims to achieve precise segmentation of different organs using minimal annotated data. Current prototype-based FSMIS methods primarily extract prototypes from support samples through random sampling or local averaging. However, due to the extremely small proportion of boundary features, traditional methods have difficulty generating boundary prototypes, resulting in poorly delineated boundaries in segmentation results. Moreover, their reliance on a single support image for segmenting all query images leads to significant performance degradation when substantial discrepancies exist between support and query images. To address these challenges, we propose an innovative solution namely Boundary-extended Prototypes and Momentum Inference (BePMI), which includes two key modules: a Boundary-extended Prototypes (BePro) module and a Momentum Inference (MoIf) module. BePro constructs boundary prototypes by explicitly clustering the internal and external boundary features to alleviate the problem of boundary ambiguity. MoIf employs the spatial consistency of adjacent slices in 3D medical images to dynamically optimize the prototype representation, thereby reducing the reliance on a single sample. Extensive experiments on three publicly available medical image datasets demonstrate that our method outperforms the state-of-the-art methods. Code is available at https://github.com/xubin471/BePMI.
{"title":"Few-shot Medical Image Segmentation via Boundary-extended Prototypes and Momentum Inference","authors":"Bin Xu , Yazhou Zhu , Shidong Wang , Yang Long , Haofeng Zhang","doi":"10.1016/j.cviu.2025.104571","DOIUrl":"10.1016/j.cviu.2025.104571","url":null,"abstract":"<div><div>Few-Shot Medical Image Segmentation (<strong>FSMIS</strong>) aims to achieve precise segmentation of different organs using minimal annotated data. Current prototype-based FSMIS methods primarily extract prototypes from support samples through random sampling or local averaging. However, due to the extremely small proportion of boundary features, traditional methods have difficulty generating boundary prototypes, resulting in poorly delineated boundaries in segmentation results. Moreover, their reliance on a single support image for segmenting all query images leads to significant performance degradation when substantial discrepancies exist between support and query images. To address these challenges, we propose an innovative solution namely Boundary-extended Prototypes and Momentum Inference (<strong>BePMI</strong>), which includes two key modules: a Boundary-extended Prototypes (<strong>BePro</strong>) module and a Momentum Inference (<strong>MoIf</strong>) module. BePro constructs boundary prototypes by explicitly clustering the internal and external boundary features to alleviate the problem of boundary ambiguity. MoIf employs the spatial consistency of adjacent slices in 3D medical images to dynamically optimize the prototype representation, thereby reducing the reliance on a single sample. Extensive experiments on three publicly available medical image datasets demonstrate that our method outperforms the state-of-the-art methods. Code is available at <span><span>https://github.com/xubin471/BePMI</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104571"},"PeriodicalIF":3.5,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145571880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01Epub Date: 2025-12-15DOI: 10.1016/j.cviu.2025.104610
Yuanxiang Fang, Jingyue Wang, Meiqing Wang, Shujie Zhang, Huimin Liu
Object repositioning in real images remains a challenging task. Existing approaches are typically built upon the DDIM inversion framework, whose sampling initialization tends to preserve strong layout priors in the latent space, thereby leading to object residuals or ghosting artifacts in the vacated region. Additionally, masking low-resolution self-attention maps often results in boundary misjudgments, which impair the inpainting capability. To address these limitations, we propose FreqOR, a training-free framework that integrates sampling initialization optimization with attention-level enhancements. For sampling initialization, high-frequency components of the inverted latent in the vacated region are suppressed to weaken inherited priors, thereby providing a cleaner sampling initialization. For attention enhancement, we incorporate two complementary strategies. The first is Resolution-Aligned Key–Value Interpolation, which achieves precise regional control by enabling pixel-wise masking of attention maps. The second is Query-Guided Consistency, which preserves the identity and texture consistency of the designated object by reusing inversion queries as priors during sampling. Integrated into the energy-based guidance framework, FreqOR is evaluated on the COCO-130 and VOC-100 datasets. The results demonstrate that it effectively suppresses residuals in the vacated region and enhances object consistency.
{"title":"FreqOR: Frequency-guided sampling initialization with attention enhancements for training-free object repositioning","authors":"Yuanxiang Fang, Jingyue Wang, Meiqing Wang, Shujie Zhang, Huimin Liu","doi":"10.1016/j.cviu.2025.104610","DOIUrl":"10.1016/j.cviu.2025.104610","url":null,"abstract":"<div><div>Object repositioning in real images remains a challenging task. Existing approaches are typically built upon the DDIM inversion framework, whose sampling initialization tends to preserve strong layout priors in the latent space, thereby leading to object residuals or ghosting artifacts in the vacated region. Additionally, masking low-resolution self-attention maps often results in boundary misjudgments, which impair the inpainting capability. To address these limitations, we propose FreqOR, a training-free framework that integrates sampling initialization optimization with attention-level enhancements. For sampling initialization, high-frequency components of the inverted latent in the vacated region are suppressed to weaken inherited priors, thereby providing a cleaner sampling initialization. For attention enhancement, we incorporate two complementary strategies. The first is Resolution-Aligned Key–Value Interpolation, which achieves precise regional control by enabling pixel-wise masking of attention maps. The second is Query-Guided Consistency, which preserves the identity and texture consistency of the designated object by reusing inversion queries as priors during sampling. Integrated into the energy-based guidance framework, FreqOR is evaluated on the COCO-130 and VOC-100 datasets. The results demonstrate that it effectively suppresses residuals in the vacated region and enhances object consistency.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104610"},"PeriodicalIF":3.5,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01Epub Date: 2025-12-17DOI: 10.1016/j.cviu.2025.104605
Xingli Zhang , Yameng Liu , Haiyang Yu , Zhihui Wang
Medical image segmentation serves as a critical technique in clinical applications such as disease diagnosis, surgical planning, and image-guided therapy, where segmentation accuracy directly impacts the precision of clinical decisions. However, existing methods still face significant challenges in handling inherent issues of medical images, including blurred boundaries, complex multi-scale structures, and difficulties in fine-grained feature representation. To address these challenges, this paper proposes a medical image segmentation method based on a diffusion probabilistic model, MFDiff, which aims to enhance multi-scale contextual awareness and fine-grained structural modeling capabilities. The method incorporates a frequency-aware attention fusion module that effectively strengthens the model’s ability to represent complex structures and ambiguous boundaries. Additionally, a multi-scale feature enhancement module is introduced to expand the receptive field while maintaining low computational cost, thereby improving the extraction and fusion of multi-scale features. Furthermore, an uncertainty-weighted majority voting fusion strategy is proposed to enhance the robustness and consistency of fused predictions from multiple sampling iterations. The proposed method was validated on five medical image segmentation datasets. Experimental results demonstrate that MFDiff outperforms current mainstream methods across all datasets, exhibiting strong generalization ability and robustness.
{"title":"MFDiff: Diffusion probabilistic model for medical image segmentation with multi-scale features and frequency-aware attention","authors":"Xingli Zhang , Yameng Liu , Haiyang Yu , Zhihui Wang","doi":"10.1016/j.cviu.2025.104605","DOIUrl":"10.1016/j.cviu.2025.104605","url":null,"abstract":"<div><div>Medical image segmentation serves as a critical technique in clinical applications such as disease diagnosis, surgical planning, and image-guided therapy, where segmentation accuracy directly impacts the precision of clinical decisions. However, existing methods still face significant challenges in handling inherent issues of medical images, including blurred boundaries, complex multi-scale structures, and difficulties in fine-grained feature representation. To address these challenges, this paper proposes a medical image segmentation method based on a diffusion probabilistic model, MFDiff, which aims to enhance multi-scale contextual awareness and fine-grained structural modeling capabilities. The method incorporates a frequency-aware attention fusion module that effectively strengthens the model’s ability to represent complex structures and ambiguous boundaries. Additionally, a multi-scale feature enhancement module is introduced to expand the receptive field while maintaining low computational cost, thereby improving the extraction and fusion of multi-scale features. Furthermore, an uncertainty-weighted majority voting fusion strategy is proposed to enhance the robustness and consistency of fused predictions from multiple sampling iterations. The proposed method was validated on five medical image segmentation datasets. Experimental results demonstrate that MFDiff outperforms current mainstream methods across all datasets, exhibiting strong generalization ability and robustness.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104605"},"PeriodicalIF":3.5,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper proposes human-in-the-loop adaptation for Group Activity Feature Learning (GAFL) without group activity annotations. This human-in-the-loop adaptation is employed in a group-activity video retrieval framework to improve its retrieval performance. Our method initially pre-trains the GAF space based on the similarity of group activities in a self-supervised manner, unlike prior work that classifies videos into pre-defined group activity classes in a supervised learning manner. Our interactive fine-tuning process updates the GAF space to allow a user to better retrieve videos similar to query videos given by the user. In this fine-tuning, our proposed data-efficient video selection process provides several videos, which are selected from a video database, to the user in order to manually label these videos as positive or negative. These labeled videos are used to update (i.e., fine-tune) the GAF space, so that the positive and negative videos move closer to and farther away from the query videos through contrastive learning. Our comprehensive experimental results on two team sports datasets validate that our method significantly improves the retrieval performance. Ablation studies also demonstrate that several components in our human-in-the-loop adaptation contribute to the improvement of the retrieval performance. Code: https://github.com/chihina/GAFL-FINE-CVIU.
{"title":"Human-in-the-loop adaptation in group activity feature learning for team sports video retrieval","authors":"Chihiro Nakatani , Hiroaki Kawashima , Norimichi Ukita","doi":"10.1016/j.cviu.2025.104577","DOIUrl":"10.1016/j.cviu.2025.104577","url":null,"abstract":"<div><div>This paper proposes human-in-the-loop adaptation for Group Activity Feature Learning (GAFL) without group activity annotations. This human-in-the-loop adaptation is employed in a group-activity video retrieval framework to improve its retrieval performance. Our method initially pre-trains the GAF space based on the similarity of group activities in a self-supervised manner, unlike prior work that classifies videos into pre-defined group activity classes in a supervised learning manner. Our interactive fine-tuning process updates the GAF space to allow a user to better retrieve videos similar to query videos given by the user. In this fine-tuning, our proposed data-efficient video selection process provides several videos, which are selected from a video database, to the user in order to manually label these videos as positive or negative. These labeled videos are used to update (i.e., fine-tune) the GAF space, so that the positive and negative videos move closer to and farther away from the query videos through contrastive learning. Our comprehensive experimental results on two team sports datasets validate that our method significantly improves the retrieval performance. Ablation studies also demonstrate that several components in our human-in-the-loop adaptation contribute to the improvement of the retrieval performance. Code: <span><span>https://github.com/chihina/GAFL-FINE-CVIU</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104577"},"PeriodicalIF":3.5,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145685142","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01Epub Date: 2025-12-04DOI: 10.1016/j.cviu.2025.104598
Hongcheng Xue , Tong Gao , Zhan Tang , Yuantian Xia , Longhe Wang , Lin Li
To address the challenge of balancing detection accuracy and efficiency for small objects in complex aerial scenes, we propose a Configurable Global Context Reconstruction Hybrid Detector (GCRH) to enhance overall detection performance. The GCRH framework consists of three key components. First, the Efficient Re-parameterized Encoder (ERE) reduces the computational overhead of multi-head self-attention through re-parameterization while maintaining the integrity and independence of global–local feature interactions. Second, the Global-Aware Feature Pyramid Network (GAFPN) reconstructs and injects global contextual semantics, cascading selective feature fusion to distribute this semantic information across feature layers, thereby alleviating small-object feature degradation and cross-level semantic inconsistency. Finally, two configurable model variants are provided, allowing the control of high-resolution feature layers to balance detection accuracy and inference efficiency. Experiments on the VisDrone2019 and TinyPerson datasets demonstrate that GCRH achieves an effective trade-off between precision and efficiency, validating its applicability to small object detection in aerial imagery. The code is available at: https://github.com/Mundane-X/GCRH.
{"title":"A configurable global context reconstruction hybrid detector for enhanced small object detection in UAV aerial imagery","authors":"Hongcheng Xue , Tong Gao , Zhan Tang , Yuantian Xia , Longhe Wang , Lin Li","doi":"10.1016/j.cviu.2025.104598","DOIUrl":"10.1016/j.cviu.2025.104598","url":null,"abstract":"<div><div>To address the challenge of balancing detection accuracy and efficiency for small objects in complex aerial scenes, we propose a Configurable Global Context Reconstruction Hybrid Detector (GCRH) to enhance overall detection performance. The GCRH framework consists of three key components. First, the Efficient Re-parameterized Encoder (ERE) reduces the computational overhead of multi-head self-attention through re-parameterization while maintaining the integrity and independence of global–local feature interactions. Second, the Global-Aware Feature Pyramid Network (GAFPN) reconstructs and injects global contextual semantics, cascading selective feature fusion to distribute this semantic information across feature layers, thereby alleviating small-object feature degradation and cross-level semantic inconsistency. Finally, two configurable model variants are provided, allowing the control of high-resolution feature layers to balance detection accuracy and inference efficiency. Experiments on the VisDrone2019 and TinyPerson datasets demonstrate that GCRH achieves an effective trade-off between precision and efficiency, validating its applicability to small object detection in aerial imagery. The code is available at: <span><span>https://github.com/Mundane-X/GCRH</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104598"},"PeriodicalIF":3.5,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145736973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01Epub Date: 2025-12-17DOI: 10.1016/j.cviu.2025.104615
Jinyi Li , Longyu Yang , Donghyun Kim , Kuniaki Saito , Kate Saenko , Stan Sclaroff , Xiaofeng Zhu , Ping Hu
Recent prompt-driven zero-shot adaptation methods offer a promising way to handle domain shifts in semantic segmentation by learning with features simulated from natural language prompts. However, these methods typically depend on a fixed set of predefined domain descriptions, which limits their capacity to generalize to previously undefined domains and often necessitates retraining when encountering novel environments. To address this challenge, we propose a Generalized Prompt-driven Zero-shot Domain Adaptive Segmentation framework that enables flexible and robust cross-domain segmentation by learning to map target domain features into the source domain space. This allows inference to be performed through a unified and well-optimized source model, without requiring target data-based or prompt-based retraining when encountering novel conditions. Our framework comprises two key modules: a Low-level Feature Rectification (LLFR) module that aligns visual styles using a historical source-style memory bank, and a High-level Semantic Modulation (HLSM) module that applies language-guided affine transformations to align high-level semantics. Together, these modules enable adaptive multi-level feature adaptation that maps target inputs into the source domain space, thus allowing the model to handle unseen domains effectively at test time. Extensive experiments on multiple zero-shot domain adaptation benchmarks are conducted, and the results show that our method consistently outperforms previous approaches.
{"title":"Generalized prompt-driven zero-shot domain adaptive segmentation with feature rectification and semantic modulation","authors":"Jinyi Li , Longyu Yang , Donghyun Kim , Kuniaki Saito , Kate Saenko , Stan Sclaroff , Xiaofeng Zhu , Ping Hu","doi":"10.1016/j.cviu.2025.104615","DOIUrl":"10.1016/j.cviu.2025.104615","url":null,"abstract":"<div><div>Recent prompt-driven zero-shot adaptation methods offer a promising way to handle domain shifts in semantic segmentation by learning with features simulated from natural language prompts. However, these methods typically depend on a fixed set of predefined domain descriptions, which limits their capacity to generalize to previously undefined domains and often necessitates retraining when encountering novel environments. To address this challenge, we propose a Generalized Prompt-driven Zero-shot Domain Adaptive Segmentation framework that enables flexible and robust cross-domain segmentation by learning to map target domain features into the source domain space. This allows inference to be performed through a unified and well-optimized source model, without requiring target data-based or prompt-based retraining when encountering novel conditions. Our framework comprises two key modules: a Low-level Feature Rectification (LLFR) module that aligns visual styles using a historical source-style memory bank, and a High-level Semantic Modulation (HLSM) module that applies language-guided affine transformations to align high-level semantics. Together, these modules enable adaptive multi-level feature adaptation that maps target inputs into the source domain space, thus allowing the model to handle unseen domains effectively at test time. Extensive experiments on multiple zero-shot domain adaptation benchmarks are conducted, and the results show that our method consistently outperforms previous approaches.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104615"},"PeriodicalIF":3.5,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}