首页 > 最新文献

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society最新文献

英文 中文
Uncertainty-Guided Refinement for Fine-Grained Salient Object Detection 不确定性引导下的细粒度突出物体精细化检测
Yao Yuan;Pan Gao;Qun Dai;Jie Qin;Wei Xiang
Recently, salient object detection (SOD) methods have achieved impressive performance. However, salient regions predicted by existing methods usually contain unsaturated regions and shadows, which limits the model for reliable fine-grained predictions. To address this, we introduce the uncertainty guidance learning approach to SOD, intended to enhance the model’s perception of uncertain regions. Specifically, we design a novel Uncertainty Guided Refinement Attention Network (UGRAN), which incorporates three important components, i.e., the Multilevel Interaction Attention (MIA) module, the Scale Spatial-Consistent Attention (SSCA) module, and the Uncertainty Refinement Attention (URA) module. Unlike conventional methods dedicated to enhancing features, the proposed MIA facilitates the interaction and perception of multilevel features, leveraging the complementary characteristics among multilevel features. Then, through the proposed SSCA, the salient information across diverse scales within the aggregated features can be integrated more comprehensively and integrally. In the subsequent steps, we utilize the uncertainty map generated from the saliency prediction map to enhance the model’s perception capability of uncertain regions, generating a highly-saturated fine-grained saliency prediction map. Additionally, we devise an adaptive dynamic partition (ADP) mechanism to minimize the computational overhead of the URA module and improve the utilization of uncertainty guidance. Experiments on seven benchmark datasets demonstrate the superiority of the proposed UGRAN over the state-of-the-art methodologies. Codes will be released at https://github.com/I2-Multimedia-Lab/UGRAN
{"title":"Uncertainty-Guided Refinement for Fine-Grained Salient Object Detection","authors":"Yao Yuan;Pan Gao;Qun Dai;Jie Qin;Wei Xiang","doi":"10.1109/TIP.2025.3557562","DOIUrl":"10.1109/TIP.2025.3557562","url":null,"abstract":"Recently, salient object detection (SOD) methods have achieved impressive performance. However, salient regions predicted by existing methods usually contain unsaturated regions and shadows, which limits the model for reliable fine-grained predictions. To address this, we introduce the uncertainty guidance learning approach to SOD, intended to enhance the model’s perception of uncertain regions. Specifically, we design a novel Uncertainty Guided Refinement Attention Network (UGRAN), which incorporates three important components, i.e., the Multilevel Interaction Attention (MIA) module, the Scale Spatial-Consistent Attention (SSCA) module, and the Uncertainty Refinement Attention (URA) module. Unlike conventional methods dedicated to enhancing features, the proposed MIA facilitates the interaction and perception of multilevel features, leveraging the complementary characteristics among multilevel features. Then, through the proposed SSCA, the salient information across diverse scales within the aggregated features can be integrated more comprehensively and integrally. In the subsequent steps, we utilize the uncertainty map generated from the saliency prediction map to enhance the model’s perception capability of uncertain regions, generating a highly-saturated fine-grained saliency prediction map. Additionally, we devise an adaptive dynamic partition (ADP) mechanism to minimize the computational overhead of the URA module and improve the utilization of uncertainty guidance. Experiments on seven benchmark datasets demonstrate the superiority of the proposed UGRAN over the state-of-the-art methodologies. Codes will be released at <uri>https://github.com/I2-Multimedia-Lab/UGRAN</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"2301-2314"},"PeriodicalIF":0.0,"publicationDate":"2025-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143813801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Prediction and Reference Quality Adaptation for Learned Video Compression
Xihua Sheng;Li Li;Dong Liu;Houqiang Li
Temporal prediction is one of the most important technologies for video compression. Various prediction coding modes are designed in traditional video codecs. Traditional video codecs will adaptively to decide the optimal coding mode according to the prediction quality and reference quality. Recently, learned video codecs have made great progress. However, they did not effectively address the problem of prediction and reference quality adaptation, which limits the effective utilization of temporal prediction and reduction of reconstruction error propagation. Therefore, in this paper, we first propose a confidence-based prediction quality adaptation (PQA) module to provide explicit discrimination for the spatial and channel-wise prediction quality difference. With this module, the prediction with low quality will be suppressed and that with high quality will be enhanced. The codec can adaptively decide which spatial or channel location of predictions to use. Then, we further propose a reference quality adaptation (RQA) module and an associated repeat-long training strategy to provide dynamic spatially variant filters for diverse reference qualities. With these filters, our codec can adapt to different reference qualities, making it easier to achieve the target reconstruction quality and reduce the reconstruction error propagation. Experimental results verify that our proposed modules can effectively help our codec achieve a higher compression performance.
{"title":"Prediction and Reference Quality Adaptation for Learned Video Compression","authors":"Xihua Sheng;Li Li;Dong Liu;Houqiang Li","doi":"10.1109/TIP.2025.3555401","DOIUrl":"10.1109/TIP.2025.3555401","url":null,"abstract":"Temporal prediction is one of the most important technologies for video compression. Various prediction coding modes are designed in traditional video codecs. Traditional video codecs will adaptively to decide the optimal coding mode according to the prediction quality and reference quality. Recently, learned video codecs have made great progress. However, they did not effectively address the problem of prediction and reference quality adaptation, which limits the effective utilization of temporal prediction and reduction of reconstruction error propagation. Therefore, in this paper, we first propose a confidence-based prediction quality adaptation (PQA) module to provide explicit discrimination for the spatial and channel-wise prediction quality difference. With this module, the prediction with low quality will be suppressed and that with high quality will be enhanced. The codec can adaptively decide which spatial or channel location of predictions to use. Then, we further propose a reference quality adaptation (RQA) module and an associated repeat-long training strategy to provide dynamic spatially variant filters for diverse reference qualities. With these filters, our codec can adapt to different reference qualities, making it easier to achieve the target reconstruction quality and reduce the reconstruction error propagation. Experimental results verify that our proposed modules can effectively help our codec achieve a higher compression performance.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"2285-2300"},"PeriodicalIF":0.0,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143805696","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Corrections to “Windowed Two-Dimensional Fourier Transform Concentration and Its Application to ISAR Imaging” 对 "窗口二维傅立叶变换浓度及其在 ISAR 成像中的应用 "的更正
Karol Abratkiewicz
Presents corrections to the paper, (Corrections to “Windowed Two-Dimensional Fourier Transform Concentration and Its Application to ISAR Imaging”).
介绍对论文的更正,(对 "窗口二维傅立叶变换浓度及其在 ISAR 成像中的应用 "的更正)。
{"title":"Corrections to “Windowed Two-Dimensional Fourier Transform Concentration and Its Application to ISAR Imaging”","authors":"Karol Abratkiewicz","doi":"10.1109/TIP.2024.3517252","DOIUrl":"https://doi.org/10.1109/TIP.2024.3517252","url":null,"abstract":"Presents corrections to the paper, (Corrections to “Windowed Two-Dimensional Fourier Transform Concentration and Its Application to ISAR Imaging”).","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"2241-2241"},"PeriodicalIF":0.0,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10949651","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143777958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Swiss Army Knife for Tracking by Natural Language Specification
Kaige Mao;Xiaopeng Hong;Xiaopeng Fan;Wangmeng Zuo
Tracking by natural language specification requires trackers to jointly perform grounding and tracking tasks. Existing methods either use separate models or a single shared network, failing to account for the link and diversity between tasks jointly. In this paper, we propose a novel framework that performs dynamic task switching to customize its network path routing for each task within a unified model. For this purpose, we design a task-switchable attention module, which enables the acquisition of modal relation patterns with different dominant modalities for each task via dynamic task switching. In addition, to alleviate the inconsistency between the static language description and the dynamic target appearance during tracking, we propose a language renovation mechanism that renovates the initial language online via visual-context-aware linguistic prompting. Extensive experimental results on five datasets demonstrate that the proposed method performs favorably against state-of-the-art approaches for both grounding and tracking. Our project will be available at: https://github.com/mkg1204/SAKTrack.
{"title":"A Swiss Army Knife for Tracking by Natural Language Specification","authors":"Kaige Mao;Xiaopeng Hong;Xiaopeng Fan;Wangmeng Zuo","doi":"10.1109/TIP.2025.3553290","DOIUrl":"10.1109/TIP.2025.3553290","url":null,"abstract":"Tracking by natural language specification requires trackers to jointly perform grounding and tracking tasks. Existing methods either use separate models or a single shared network, failing to account for the link and diversity between tasks jointly. In this paper, we propose a novel framework that performs dynamic task switching to customize its network path routing for each task within a unified model. For this purpose, we design a task-switchable attention module, which enables the acquisition of modal relation patterns with different dominant modalities for each task via dynamic task switching. In addition, to alleviate the inconsistency between the static language description and the dynamic target appearance during tracking, we propose a language renovation mechanism that renovates the initial language online via visual-context-aware linguistic prompting. Extensive experimental results on five datasets demonstrate that the proposed method performs favorably against state-of-the-art approaches for both grounding and tracking. Our project will be available at: <uri>https://github.com/mkg1204/SAKTrack</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"2254-2268"},"PeriodicalIF":0.0,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143757740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Object Adaptive Self-Supervised Dense Visual Pre-Training
Yu Zhang;Tao Zhang;Hongyuan Zhu;Zihan Chen;Siya Mi;Xi Peng;Xin Geng
Self-supervised visual pre-training models have achieved significant success without employing expensive annotations. Nevertheless, most of these models focus on iconic single-instance datasets (e.g. ImageNet), ignoring the insufficient discriminative representation for non-iconic multi-instance datasets (e.g. COCO). In this paper, we propose a novel Object Adaptive Dense Pre-training (OADP) method to learn the visual representation directly on the multi-instance datasets (e.g., PASCAL VOC and COCO) for dense prediction tasks (e.g., object detection and instance segmentation). We present a novel object-aware and learning-adaptive random view augmentation to focus the contrastive learning to enhance the discrimination of object presentations from large to small scale during different learning stages. Furthermore, the representations across different scale and resolutions are integrated so that the method can learn diverse representations. In the experiment, we evaluated OADP pre-trained on PASCAL VOC and COCO. Results show that our method has better performances than most existing state-of-the-art methods when transferring to various downstream tasks, including image classification, object detection, instance segmentation and semantic segmentation.
{"title":"Object Adaptive Self-Supervised Dense Visual Pre-Training","authors":"Yu Zhang;Tao Zhang;Hongyuan Zhu;Zihan Chen;Siya Mi;Xi Peng;Xin Geng","doi":"10.1109/TIP.2025.3555073","DOIUrl":"10.1109/TIP.2025.3555073","url":null,"abstract":"Self-supervised visual pre-training models have achieved significant success without employing expensive annotations. Nevertheless, most of these models focus on iconic single-instance datasets (e.g. ImageNet), ignoring the insufficient discriminative representation for non-iconic multi-instance datasets (e.g. COCO). In this paper, we propose a novel Object Adaptive Dense Pre-training (OADP) method to learn the visual representation directly on the multi-instance datasets (e.g., PASCAL VOC and COCO) for dense prediction tasks (e.g., object detection and instance segmentation). We present a novel object-aware and learning-adaptive random view augmentation to focus the contrastive learning to enhance the discrimination of object presentations from large to small scale during different learning stages. Furthermore, the representations across different scale and resolutions are integrated so that the method can learn diverse representations. In the experiment, we evaluated OADP pre-trained on PASCAL VOC and COCO. Results show that our method has better performances than most existing state-of-the-art methods when transferring to various downstream tasks, including image classification, object detection, instance segmentation and semantic segmentation.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"2228-2240"},"PeriodicalIF":0.0,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143757764","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploring Effective Factors for Improving Visual In-Context Learning
Yanpeng Sun;Qiang Chen;Jian Wang;Jingdong Wang;Zechao Li
The In-Context Learning (ICL) is to understand a new task via a few demonstrations (aka. prompt) and predict new inputs without tuning the models. While it has been widely studied in NLP, it is still a relatively new area of research in computer vision. To reveal the factors influencing the performance of visual in-context learning, this paper shows that Prompt Selection and Prompt Fusion are two major factors that have a direct impact on the inference performance of visual in-context learning. Prompt selection is the process of selecting the most suitable prompt for query image. This is crucial because high-quality prompts assist large-scale visual models in rapidly and accurately comprehending new tasks. Prompt fusion involves combining prompts and query images to activate knowledge within large-scale visual models. However, altering the prompt fusion method significantly impacts its performance on new tasks. Based on these findings, we propose a simple framework prompt-SelF to improve visual in-context learning. Specifically, we first use the pixel-level retrieval method to select a suitable prompt, and then use different prompt fusion methods to activate diverse knowledge stored in the large-scale vision model, and finally, ensemble the prediction results obtained from different prompt fusion methods to obtain the final prediction results. We conducted extensive experiments on single-object segmentation and detection tasks to demonstrate the effectiveness of prompt-SelF. Remarkably, prompt-SelF has outperformed OSLSM method-based meta-learning in 1-shot segmentation for the first time. This indicated the great potential of visual in-context learning. The source code and models will be available at https://github.com/syp2ysy/prompt-SelF.
{"title":"Exploring Effective Factors for Improving Visual In-Context Learning","authors":"Yanpeng Sun;Qiang Chen;Jian Wang;Jingdong Wang;Zechao Li","doi":"10.1109/TIP.2025.3554410","DOIUrl":"10.1109/TIP.2025.3554410","url":null,"abstract":"The In-Context Learning (ICL) is to understand a new task via a few demonstrations (aka. prompt) and predict new inputs without tuning the models. While it has been widely studied in NLP, it is still a relatively new area of research in computer vision. To reveal the factors influencing the performance of visual in-context learning, this paper shows that Prompt Selection and Prompt Fusion are two major factors that have a direct impact on the inference performance of visual in-context learning. Prompt selection is the process of selecting the most suitable prompt for query image. This is crucial because high-quality prompts assist large-scale visual models in rapidly and accurately comprehending new tasks. Prompt fusion involves combining prompts and query images to activate knowledge within large-scale visual models. However, altering the prompt fusion method significantly impacts its performance on new tasks. Based on these findings, we propose a simple framework prompt-SelF to improve visual in-context learning. Specifically, we first use the pixel-level retrieval method to select a suitable prompt, and then use different prompt fusion methods to activate diverse knowledge stored in the large-scale vision model, and finally, ensemble the prediction results obtained from different prompt fusion methods to obtain the final prediction results. We conducted extensive experiments on single-object segmentation and detection tasks to demonstrate the effectiveness of prompt-SelF. Remarkably, prompt-SelF has outperformed OSLSM method-based meta-learning in 1-shot segmentation for the first time. This indicated the great potential of visual in-context learning. The source code and models will be available at <uri>https://github.com/syp2ysy/prompt-SelF</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"2147-2160"},"PeriodicalIF":0.0,"publicationDate":"2025-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143744924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Local Cross-Patch Activation From Multi-Direction for Weakly Supervised Object Localization
Pei Lv;Junying Ren;Genwang Han;Jiwen Lu;Mingliang Xu
Weakly supervised object localization (WSOL) learns to localize objects using only image-level labels. Recently, some studies apply transformers in WSOL to capture the long-range feature dependency and alleviate the partial activation issue of CNN-based methods. However, existing transformer-based methods still face two challenges. The first challenge is the over-activation of backgrounds. Specifically, the object boundaries and background are often semantically similar, and localization models may misidentify the background as a part of objects. The second challenge is the incomplete activation of occluded objects, since transformer architecture makes it difficult to capture local features across patches due to ignoring semantic and spatial coherence. To address these issues, in this paper, we propose LCA-MD, a novel transformer-based WSOL method using local cross-patch activation from multi-direction, which can capture more details of local features while inhibiting the background over-activation. In LCA-MD, first, combining contrastive learning with the transformer, we propose a token feature contrast module (TCM) that can maximize the difference between foregrounds and backgrounds and further separate them more accurately. Second, we propose a semantic-spatial fusion module (SFM), which leverages multi-directional perception to capture the local cross-patch features and diffuse activation across occlusions. Experiment results on the CUB-200-2011 and ILSVRC datasets demonstrate that our LCA-MD is significantly superior and has achieved state-of-the-art results in WSOL. The project code is available at https://github.com/rjy-fighting/LCA-MD.
{"title":"Local Cross-Patch Activation From Multi-Direction for Weakly Supervised Object Localization","authors":"Pei Lv;Junying Ren;Genwang Han;Jiwen Lu;Mingliang Xu","doi":"10.1109/TIP.2025.3554398","DOIUrl":"10.1109/TIP.2025.3554398","url":null,"abstract":"Weakly supervised object localization (WSOL) learns to localize objects using only image-level labels. Recently, some studies apply transformers in WSOL to capture the long-range feature dependency and alleviate the partial activation issue of CNN-based methods. However, existing transformer-based methods still face two challenges. The first challenge is the over-activation of backgrounds. Specifically, the object boundaries and background are often semantically similar, and localization models may misidentify the background as a part of objects. The second challenge is the incomplete activation of occluded objects, since transformer architecture makes it difficult to capture local features across patches due to ignoring semantic and spatial coherence. To address these issues, in this paper, we propose LCA-MD, a novel transformer-based WSOL method using local cross-patch activation from multi-direction, which can capture more details of local features while inhibiting the background over-activation. In LCA-MD, first, combining contrastive learning with the transformer, we propose a token feature contrast module (TCM) that can maximize the difference between foregrounds and backgrounds and further separate them more accurately. Second, we propose a semantic-spatial fusion module (SFM), which leverages multi-directional perception to capture the local cross-patch features and diffuse activation across occlusions. Experiment results on the CUB-200-2011 and ILSVRC datasets demonstrate that our LCA-MD is significantly superior and has achieved state-of-the-art results in WSOL. The project code is available at <uri>https://github.com/rjy-fighting/LCA-MD</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"2213-2227"},"PeriodicalIF":0.0,"publicationDate":"2025-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143744926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
IDENet: An Inter-Domain Equilibrium Network for Unsupervised Cross-Domain Person Re-Identification
Xi Yang;Wenjiao Dong;Gu Zheng;Nannan Wang;Xinbo Gao
Unsupervised person re-identification aims to retrieve a given pedestrian image from unlabeled data. For training on the unlabeled data, the method of clustering and assigning pseudo-labels has become mainstream, but the pseudo-labels themselves are noisy and will reduce the accuracy. To overcome this problem, several pseudo-label improvement methods have been proposed. But on the one hand, they only use target domain data for fine-tuning and do not make sufficient use of high-quality labeled data in the source domain. On the other hand, they ignore the critical fine-grained features of pedestrians and overfitting problems in the later training period. In this paper, we propose a novel unsupervised cross-domain person re-identification network (IDENet) based on an inter-domain equilibrium structure to improve the quality of pseudo-labels. Specifically, we make full use of both source domain and target domain information and construct a small learning network to equalize label allocation between the two domains. Based on it, we also develop a dynamic neural network with adaptive convolution kernels to generate adaptive residuals for adapting domain-agnostic deep fine-grained features. In addition, we design the network structure based on ordinary differential equations and embed modules to solve the problem of network overfitting. Extensive cross-domain experimental results on Market1501, PersonX, and MSMT17 prove that our proposed method outperforms the state-of-the-art methods.
{"title":"IDENet: An Inter-Domain Equilibrium Network for Unsupervised Cross-Domain Person Re-Identification","authors":"Xi Yang;Wenjiao Dong;Gu Zheng;Nannan Wang;Xinbo Gao","doi":"10.1109/TIP.2025.3554408","DOIUrl":"10.1109/TIP.2025.3554408","url":null,"abstract":"Unsupervised person re-identification aims to retrieve a given pedestrian image from unlabeled data. For training on the unlabeled data, the method of clustering and assigning pseudo-labels has become mainstream, but the pseudo-labels themselves are noisy and will reduce the accuracy. To overcome this problem, several pseudo-label improvement methods have been proposed. But on the one hand, they only use target domain data for fine-tuning and do not make sufficient use of high-quality labeled data in the source domain. On the other hand, they ignore the critical fine-grained features of pedestrians and overfitting problems in the later training period. In this paper, we propose a novel unsupervised cross-domain person re-identification network (IDENet) based on an inter-domain equilibrium structure to improve the quality of pseudo-labels. Specifically, we make full use of both source domain and target domain information and construct a small learning network to equalize label allocation between the two domains. Based on it, we also develop a dynamic neural network with adaptive convolution kernels to generate adaptive residuals for adapting domain-agnostic deep fine-grained features. In addition, we design the network structure based on ordinary differential equations and embed modules to solve the problem of network overfitting. Extensive cross-domain experimental results on Market1501, PersonX, and MSMT17 prove that our proposed method outperforms the state-of-the-art methods.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"2133-2146"},"PeriodicalIF":0.0,"publicationDate":"2025-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143744939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Segment Anything Model Is a Good Teacher for Local Feature Learning
Jingqian Wu;Rongtao Xu;Zach Wood-Doughty;Changwei Wang;Shibiao Xu;Edmund Y. Lam
Local feature detection and description play an important role in many computer vision tasks, which are designed to detect and describe keypoints in any scene and any downstream task. Data-driven local feature learning methods need to rely on pixel-level correspondence for training. However, a vast number of existing approaches ignored the semantic information on which humans rely to describe image pixels. In addition, it is not feasible to enhance generic scene keypoints detection and description simply by using traditional common semantic segmentation models because they can only recognize a limited number of coarse-grained object classes. In this paper, we propose SAMFeat to introduce SAM (segment anything model), a foundation model trained on 11 million images, as a teacher to guide local feature learning. SAMFeat learns additional semantic information brought by SAM and thus is inspired by higher performance even with limited training samples. To do so, first, we construct an auxiliary task of Attention-weighted Semantic Relation Distillation (ASRD), which adaptively distillates feature relations with category-agnostic semantic information learned by the SAM encoder into a local feature learning network, to improve local feature description using semantic discrimination. Second, we develop a technique called Weakly Supervised Contrastive Learning Based on Semantic Grouping (WSC), which utilizes semantic groupings derived from SAM as weakly supervised signals, to optimize the metric space of local descriptors. Third, we design an Edge Attention Guidance (EAG) to further improve the accuracy of local feature detection and description by prompting the network to pay more attention to the edge region guided by SAM. SAMFeat’s performance on various tasks, such as image matching on HPatches, and long-term visual localization on Aachen Day-Night showcases its superiority over previous local features. The release code is available at https://github.com/vignywang/SAMFeat.
{"title":"Segment Anything Model Is a Good Teacher for Local Feature Learning","authors":"Jingqian Wu;Rongtao Xu;Zach Wood-Doughty;Changwei Wang;Shibiao Xu;Edmund Y. Lam","doi":"10.1109/TIP.2025.3554033","DOIUrl":"10.1109/TIP.2025.3554033","url":null,"abstract":"Local feature detection and description play an important role in many computer vision tasks, which are designed to detect and describe keypoints in any scene and any downstream task. Data-driven local feature learning methods need to rely on pixel-level correspondence for training. However, a vast number of existing approaches ignored the semantic information on which humans rely to describe image pixels. In addition, it is not feasible to enhance generic scene keypoints detection and description simply by using traditional common semantic segmentation models because they can only recognize a limited number of coarse-grained object classes. In this paper, we propose SAMFeat to introduce SAM (segment anything model), a foundation model trained on 11 million images, as a teacher to guide local feature learning. SAMFeat learns additional semantic information brought by SAM and thus is inspired by higher performance even with limited training samples. To do so, first, we construct an auxiliary task of Attention-weighted Semantic Relation Distillation (ASRD), which adaptively distillates feature relations with category-agnostic semantic information learned by the SAM encoder into a local feature learning network, to improve local feature description using semantic discrimination. Second, we develop a technique called Weakly Supervised Contrastive Learning Based on Semantic Grouping (WSC), which utilizes semantic groupings derived from SAM as weakly supervised signals, to optimize the metric space of local descriptors. Third, we design an Edge Attention Guidance (EAG) to further improve the accuracy of local feature detection and description by prompting the network to pay more attention to the edge region guided by SAM. SAMFeat’s performance on various tasks, such as image matching on HPatches, and long-term visual localization on Aachen Day-Night showcases its superiority over previous local features. The release code is available at <uri>https://github.com/vignywang/SAMFeat</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"2097-2111"},"PeriodicalIF":0.0,"publicationDate":"2025-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143733961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Frequency-Spatial Complementation: Unified Channel-Specific Style Attack for Cross-Domain Few-Shot Learning
Zhong Ji;Zhilong Wang;Xiyao Liu;Yunlong Yu;Yanwei Pang;Jungong Han
Cross-Domain Few-Shot Learning (CD-FSL) addresses the challenges of recognizing targets with out-of-domain data when only a few instances are available. Many current CD-FSL approaches primarily focus on enhancing the generalization capabilities of models in spatial domain, which neglects the role of the frequency domain in domain generalization. To take advantage of frequency domain in processing global information, we propose a Frequency-Spatial Complementation (FSC) model, which combines frequency domain information with spatial domain information to learn domain-invariant information from attacked data style. Specifically, we design a Frequency and Spatial Fusion (FusionFS) module to enhance the ability of the model to capture style-related information. Besides, we propose two attack strategies, i.e., the Gradient-guided Unified Style Attack (GUSA) strategy and the Channel-specific Attack Intensity Calculation (CAIC) strategy, which conduct targeted attacks on different channels to provide more diversified style data during the training phase, especially in single-source domain scenarios where the source domain data style is homogeneous. Extensive experiments across eight target domains demonstrate that our method significantly improves the model’s performance under various styles.
{"title":"Frequency-Spatial Complementation: Unified Channel-Specific Style Attack for Cross-Domain Few-Shot Learning","authors":"Zhong Ji;Zhilong Wang;Xiyao Liu;Yunlong Yu;Yanwei Pang;Jungong Han","doi":"10.1109/TIP.2025.3553781","DOIUrl":"10.1109/TIP.2025.3553781","url":null,"abstract":"Cross-Domain Few-Shot Learning (CD-FSL) addresses the challenges of recognizing targets with out-of-domain data when only a few instances are available. Many current CD-FSL approaches primarily focus on enhancing the generalization capabilities of models in spatial domain, which neglects the role of the frequency domain in domain generalization. To take advantage of frequency domain in processing global information, we propose a Frequency-Spatial Complementation (FSC) model, which combines frequency domain information with spatial domain information to learn domain-invariant information from attacked data style. Specifically, we design a Frequency and Spatial Fusion (FusionFS) module to enhance the ability of the model to capture style-related information. Besides, we propose two attack strategies, i.e., the Gradient-guided Unified Style Attack (GUSA) strategy and the Channel-specific Attack Intensity Calculation (CAIC) strategy, which conduct targeted attacks on different channels to provide more diversified style data during the training phase, especially in single-source domain scenarios where the source domain data style is homogeneous. Extensive experiments across eight target domains demonstrate that our method significantly improves the model’s performance under various styles.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"2242-2253"},"PeriodicalIF":0.0,"publicationDate":"2025-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143733994","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE transactions on image processing : a publication of the IEEE Signal Processing Society
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1