Pub Date : 2025-11-28DOI: 10.1109/TIP.2025.3633580
Shuren Qi;Chao Wang;Zhiqiu Huang;Yushu Zhang;Xiangyu Chen;Yi Zhang;Tieyong Zeng;Fenglei Fan
Generative artificial intelligence has shown great success in visual content synthesis such that humans struggle to distinguish between real and synthesized images. Forensic research seeks to reveal artifacts in such generated images, ensuring information security or improving generation capability. In this regard, the robustness and interpretability are important for the trustworthy purpose of forensic tasks. However, typical forensic models and their underlying data representations rely on empirical learning algorithms, which cannot effectively handle the high robustness and interpretability requirements beyond experience. As an effective solution, we extend the classical geometric invariants to the forensic research of large-scale generated images. Invariants are handcrafted representations with robust and interpretable geometric principles. However, their discriminability is far from the large scale of today’s forensic tasks. We boost the discriminability by extending the classical invariants to the hierarchical architecture of convolutional neural networks. The resulting overcompleteness allows for an automatic selection of task-discriminative features, while retaining the previous advantages of robustness and interpretability. From generative adversarial networks to diffusion models, the forensic with our boosted invariants demonstrates state-of-the-art discriminability against large-scale content diversity. It also exhibits high efficiency on training examples, intrinsic invariance to geometric variations, and better interpretability of the forensic process.
{"title":"Boosting Geometric Invariants for Discriminative Forensics of Large-Scale Generated Visual Content","authors":"Shuren Qi;Chao Wang;Zhiqiu Huang;Yushu Zhang;Xiangyu Chen;Yi Zhang;Tieyong Zeng;Fenglei Fan","doi":"10.1109/TIP.2025.3633580","DOIUrl":"10.1109/TIP.2025.3633580","url":null,"abstract":"Generative artificial intelligence has shown great success in visual content synthesis such that humans struggle to distinguish between real and synthesized images. Forensic research seeks to reveal artifacts in such generated images, ensuring information security or improving generation capability. In this regard, the robustness and interpretability are important for the trustworthy purpose of forensic tasks. However, typical forensic models and their underlying data representations rely on empirical learning algorithms, which cannot effectively handle the high robustness and interpretability requirements beyond experience. As an effective solution, we extend the classical geometric invariants to the forensic research of large-scale generated images. Invariants are handcrafted representations with robust and interpretable geometric principles. However, their discriminability is far from the large scale of today’s forensic tasks. We boost the discriminability by extending the classical invariants to the hierarchical architecture of convolutional neural networks. The resulting overcompleteness allows for an automatic selection of task-discriminative features, while retaining the previous advantages of robustness and interpretability. From generative adversarial networks to diffusion models, the forensic with our boosted invariants demonstrates state-of-the-art discriminability against large-scale content diversity. It also exhibits high efficiency on training examples, intrinsic invariance to geometric variations, and better interpretability of the forensic process.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"7959-7974"},"PeriodicalIF":13.7,"publicationDate":"2025-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145613236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-27DOI: 10.1109/TIP.2025.3635025
Hanwei Zhu;Baoliang Chen;Lingyu Zhu;Shiqi Wang;Weisi Lin
Deep neural networks pre-trained on ImageNet have demonstrated remarkable transferability for developing effective full-reference image quality assessment (FR-IQA) models. However, existing approaches typically demand pixel-level alignment between reference and distorted images—a requirement that poses significant challenges in practical scenarios involving natural photography and texture similarity evaluation. To address this limitation, we propose a novel FR-IQA model leveraging deep statistical similarity derived from pre-trained features without relying on spatial co-location of these features or requiring fine-tuning with mean opinion scores. Specifically, we employ distance correlation, a potent yet relatively underexplored statistical measure, to quantify similarity between reference and distorted images within a deep feature space. The distance correlation is computed via the ratio of the distance covariance to the product of their respective distance standard deviations, for which we derive a closed-form solution using the inner product of deep double-centered distance matrices. Extensive experimental evaluations across diverse IQA benchmarks demonstrate the superiority and robustness of the proposed model. Furthermore, we demonstrate the utility of our model for optimizing texture synthesis and neural style transfer tasks, achieving state-of-the-art performance in both quantitative measures and qualitative assessments. The implementation is publicly available at https://github.com/h4nwei/DeepDC
{"title":"DeepDC: Deep Distance Correlation as a Perceptual Image Quality Evaluator","authors":"Hanwei Zhu;Baoliang Chen;Lingyu Zhu;Shiqi Wang;Weisi Lin","doi":"10.1109/TIP.2025.3635025","DOIUrl":"10.1109/TIP.2025.3635025","url":null,"abstract":"Deep neural networks pre-trained on ImageNet have demonstrated remarkable transferability for developing effective full-reference image quality assessment (FR-IQA) models. However, existing approaches typically demand pixel-level alignment between reference and distorted images—a requirement that poses significant challenges in practical scenarios involving natural photography and texture similarity evaluation. To address this limitation, we propose a novel FR-IQA model leveraging deep statistical similarity derived from pre-trained features without relying on spatial co-location of these features or requiring fine-tuning with mean opinion scores. Specifically, we employ distance correlation, a potent yet relatively underexplored statistical measure, to quantify similarity between reference and distorted images within a deep feature space. The distance correlation is computed via the ratio of the distance covariance to the product of their respective distance standard deviations, for which we derive a closed-form solution using the inner product of deep double-centered distance matrices. Extensive experimental evaluations across diverse IQA benchmarks demonstrate the superiority and robustness of the proposed model. Furthermore, we demonstrate the utility of our model for optimizing texture synthesis and neural style transfer tasks, achieving state-of-the-art performance in both quantitative measures and qualitative assessments. The implementation is publicly available at <uri>https://github.com/h4nwei/DeepDC</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"7859-7873"},"PeriodicalIF":13.7,"publicationDate":"2025-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145610931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-27DOI: 10.1109/TIP.2025.3635483
Zhaodong Ding;Chenglong Li;Tao Wang;Futian Wang
Transformer-based RGBT tracking has attracted much attention due to the strong modeling capacity of self attention and cross attention mechanisms. These attention mechanisms utilize the correlations among tokens to construct powerful feature representations, but are easily affected by low-quality tokens. To address this issue, we propose a novel Quality-aware Spatio-temporal Transformer Network (QSTNet), which calculates the quality weights of tokens in search regions based on the correlation with multimodal template tokens to suppress the negative effects of low-quality tokens in spatio-temporal feature representations, for robust RGBT tracking. In particular, we argue that the correlation between search tokens of one modality and multimodal template tokens could reflect the quality of these search tokens, and thus design the Quality-aware Token Weighting Module (QTWM) based on the correlation matrix of search and template tokens to suppress the negative effects of low-quality tokens. Specifically, we calculate the difference matrix derived from the attention matrices of the search tokens from both modalities and the multimodal template tokens, and then assign the quality weight for each search token based on the difference matrix, which reflects the relative correlation of search tokens from different modalities to multimodal template tokens. In addition, we propose the Prompt-based Spatio-temporal Encoder Module (PSEM) to utilize spatio-temporal multimodal information while alleviating the impact of low-quality spatio-temporal features. Extensive experiments on four RGBT benchmark datasets demonstrate that the proposed QSTNet exhibits superior performance compared to other state-of-the-art tracking methods. Our code and supplementary video are now available: https://zhaodongah.github.io/QSTNet
{"title":"Quality-Aware Spatio-Temporal Transformer Network for RGBT Tracking","authors":"Zhaodong Ding;Chenglong Li;Tao Wang;Futian Wang","doi":"10.1109/TIP.2025.3635483","DOIUrl":"10.1109/TIP.2025.3635483","url":null,"abstract":"Transformer-based RGBT tracking has attracted much attention due to the strong modeling capacity of self attention and cross attention mechanisms. These attention mechanisms utilize the correlations among tokens to construct powerful feature representations, but are easily affected by low-quality tokens. To address this issue, we propose a novel Quality-aware Spatio-temporal Transformer Network (QSTNet), which calculates the quality weights of tokens in search regions based on the correlation with multimodal template tokens to suppress the negative effects of low-quality tokens in spatio-temporal feature representations, for robust RGBT tracking. In particular, we argue that the correlation between search tokens of one modality and multimodal template tokens could reflect the quality of these search tokens, and thus design the Quality-aware Token Weighting Module (QTWM) based on the correlation matrix of search and template tokens to suppress the negative effects of low-quality tokens. Specifically, we calculate the difference matrix derived from the attention matrices of the search tokens from both modalities and the multimodal template tokens, and then assign the quality weight for each search token based on the difference matrix, which reflects the relative correlation of search tokens from different modalities to multimodal template tokens. In addition, we propose the Prompt-based Spatio-temporal Encoder Module (PSEM) to utilize spatio-temporal multimodal information while alleviating the impact of low-quality spatio-temporal features. Extensive experiments on four RGBT benchmark datasets demonstrate that the proposed QSTNet exhibits superior performance compared to other state-of-the-art tracking methods. Our code and supplementary video are now available: <uri>https://zhaodongah.github.io/QSTNet</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"7845-7858"},"PeriodicalIF":13.7,"publicationDate":"2025-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145610933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-26DOI: 10.1109/TIP.2025.3635479
Shuyi Xu;He Sun;Xu Sun;Li Ni;Lianru Gao
Small vehicles (SV) detection is crucial for urban security and traffic management. However, detecting such targets from a single image presents significant challenges due to the difficulty in discerning their dynamic movements. In this paper, we propose a deep joint image-level and feature-level processing network, IFNet, designed for detecting changes in SV using bi-temporal hyperspectral images. At the image-level, a new Gumbel Softmax trick (GS)-based band selection strategy is introduced to address the problem of inconsistent spectral resolutions of bi-temporal images. At the feature-level, to tackle the challenge of capturing edge and shape details of SV, we propose a feature-based edge enhancement module, it can extract the target edge using high-level difference features, and the refined change map will be generated with the guidance of the edge map. Moreover, current deep learning-based hyperspectral change detection (HCD) methods are limited by HCD datasets. Therefore, we propose a benchmark dataset, the Hyperspectral Vehicle Change Detection (HVCD) dataset, which consists of 201 pairs of aerial hyperspectral images, each with a size of $256times 256$ , and exhibits inconsistent spectral resolutions across the bi-temporal data. Extensive experiments conducted on the HVCD dataset demonstrate that our IFNet obtains state-of-the-art performance with an acceptable computational cost.
{"title":"A Hyperspectral Change Detection Method for Small Vehicles","authors":"Shuyi Xu;He Sun;Xu Sun;Li Ni;Lianru Gao","doi":"10.1109/TIP.2025.3635479","DOIUrl":"10.1109/TIP.2025.3635479","url":null,"abstract":"Small vehicles (SV) detection is crucial for urban security and traffic management. However, detecting such targets from a single image presents significant challenges due to the difficulty in discerning their dynamic movements. In this paper, we propose a deep joint image-level and feature-level processing network, IFNet, designed for detecting changes in SV using bi-temporal hyperspectral images. At the image-level, a new Gumbel Softmax trick (GS)-based band selection strategy is introduced to address the problem of inconsistent spectral resolutions of bi-temporal images. At the feature-level, to tackle the challenge of capturing edge and shape details of SV, we propose a feature-based edge enhancement module, it can extract the target edge using high-level difference features, and the refined change map will be generated with the guidance of the edge map. Moreover, current deep learning-based hyperspectral change detection (HCD) methods are limited by HCD datasets. Therefore, we propose a benchmark dataset, the Hyperspectral Vehicle Change Detection (HVCD) dataset, which consists of 201 pairs of aerial hyperspectral images, each with a size of <inline-formula> <tex-math>$256times 256$ </tex-math></inline-formula>, and exhibits inconsistent spectral resolutions across the bi-temporal data. Extensive experiments conducted on the HVCD dataset demonstrate that our IFNet obtains state-of-the-art performance with an acceptable computational cost.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"7874-7888"},"PeriodicalIF":13.7,"publicationDate":"2025-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145609209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-25DOI: 10.1109/TIP.2025.3634988
Zhongling Huang;Yihan Zhuang;Zipei Zhong;Feng Xu;Gong Cheng;Junwei Han
Synthetic aperture radar (SAR) image simulation has attracted much attention due to its great potential to supplement the scarce training data for deep learning algorithms. Consequently, evaluating the quality of the simulated SAR image is crucial for practical applications. The current literature primarily uses image quality assessment (IQA) techniques for evaluation that rely on human observers’ perceptions. However, because of the unique imaging mechanism of SAR, these techniques may produce evaluation results that are not entirely valid. The distribution inconsistency between real and simulated data is the main obstacle that influences the utility of simulated SAR images. To this end, we propose a novel trustworthy utility evaluation framework with a counterfactual explanation for simulated SAR images for the first time, denoted as X-Fake. It unifies a probabilistic evaluator and a causal explainer to achieve a trustworthy utility assessment. We construct the evaluator using a probabilistic Bayesian deep model to learn the posterior distribution, conditioned on real data. Quantitatively, the predicted uncertainty of simulated data can reflect the distribution discrepancy. We build the causal explainer with an introspective variational auto-encoder (IntroVAE) to generate high-resolution counterfactuals. The latent code of IntroVAE is finally optimized with evaluation indicators and prior information to generate the counterfactual explanation, thus revealing the inauthentic details of simulated data explicitly. The proposed framework is validated on four simulated SAR image datasets obtained from electromagnetic models and generative artificial intelligence approaches. The results demonstrate the proposed X-Fake framework outperforms other IQA methods in terms of utility. Furthermore, the results illustrate that the generated counterfactual explanations are trustworthy, and can further improve the data utility in applications.
{"title":"X-Fake: Juggling Utility Evaluation and Explanation of Simulated SAR Images","authors":"Zhongling Huang;Yihan Zhuang;Zipei Zhong;Feng Xu;Gong Cheng;Junwei Han","doi":"10.1109/TIP.2025.3634988","DOIUrl":"10.1109/TIP.2025.3634988","url":null,"abstract":"Synthetic aperture radar (SAR) image simulation has attracted much attention due to its great potential to supplement the scarce training data for deep learning algorithms. Consequently, evaluating the quality of the simulated SAR image is crucial for practical applications. The current literature primarily uses image quality assessment (IQA) techniques for evaluation that rely on human observers’ perceptions. However, because of the unique imaging mechanism of SAR, these techniques may produce evaluation results that are not entirely valid. The distribution inconsistency between real and simulated data is the main obstacle that influences the utility of simulated SAR images. To this end, we propose a novel trustworthy utility evaluation framework with a counterfactual explanation for simulated SAR images for the first time, denoted as X-Fake. It unifies a probabilistic evaluator and a causal explainer to achieve a trustworthy utility assessment. We construct the evaluator using a probabilistic Bayesian deep model to learn the posterior distribution, conditioned on real data. Quantitatively, the predicted uncertainty of simulated data can reflect the distribution discrepancy. We build the causal explainer with an introspective variational auto-encoder (IntroVAE) to generate high-resolution counterfactuals. The latent code of IntroVAE is finally optimized with evaluation indicators and prior information to generate the counterfactual explanation, thus revealing the inauthentic details of simulated data explicitly. The proposed framework is validated on four simulated SAR image datasets obtained from electromagnetic models and generative artificial intelligence approaches. The results demonstrate the proposed X-Fake framework outperforms other IQA methods in terms of utility. Furthermore, the results illustrate that the generated counterfactual explanations are trustworthy, and can further improve the data utility in applications.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"7830-7844"},"PeriodicalIF":13.7,"publicationDate":"2025-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145599029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Although text-guided infrared-visible image fusion helps improve content understanding under extreme illumination, existing methods usually ignore semantic differences between textual and visual features, resulting in limited improvement. To address this challenge, we propose a Text-Guided Semantic Alignment Network, termed TSANet, for extreme-illumination infrared-visible image fusion. The network follows an encoder-decoder structure, with two image encoders, two text encoders, and one decoder. It uses a Semantic Alignment and Fusion (SAF) block to bridge the two image encoders in each layer. Specifically, the SAF block consists of two parallel Semantic Alignment (SA) modules, corresponding to the infrared and visible modalities, respectively, and a Spatial-Frequency Interaction (SFI) module. The SA module aligns the visual feature from the image encoder with its corresponding textual feature from the text encoder, to guide the network focus on key semantic regions of infrared and visible images. The SFI module aggregates the spatial and frequency information extracted from the modality-aligned features of two SA modules for complementary representation learning. The network progressively complements two image modalities by connecting the SAF blocks from top to down, and finally provides a visually pleasing fusion effect by feeding the output of the last block into the decoder. Recognizing that existing datasets lack illumination diversity, we contribute a new dataset specifically designed for extreme-illumination image fusion. Extensive experiments show the effectiveness and superiority of TSANet over seven state-of-the-art methods. The source code and dataset are available at https://github.com/WentaoLi-CV/TSANet
{"title":"Text-Guided Semantic Alignment Network With Spatial-Frequency Interaction for Infrared-Visible Image Fusion Under Extreme Illumination","authors":"Guanghui Yue;Wentao Li;Cheng Zhao;Zhiliang Wu;Tianwei Zhou;Qiuping Jiang;Runmin Cong","doi":"10.1109/TIP.2025.3635048","DOIUrl":"10.1109/TIP.2025.3635048","url":null,"abstract":"Although text-guided infrared-visible image fusion helps improve content understanding under extreme illumination, existing methods usually ignore semantic differences between textual and visual features, resulting in limited improvement. To address this challenge, we propose a Text-Guided Semantic Alignment Network, termed TSANet, for extreme-illumination infrared-visible image fusion. The network follows an encoder-decoder structure, with two image encoders, two text encoders, and one decoder. It uses a Semantic Alignment and Fusion (SAF) block to bridge the two image encoders in each layer. Specifically, the SAF block consists of two parallel Semantic Alignment (SA) modules, corresponding to the infrared and visible modalities, respectively, and a Spatial-Frequency Interaction (SFI) module. The SA module aligns the visual feature from the image encoder with its corresponding textual feature from the text encoder, to guide the network focus on key semantic regions of infrared and visible images. The SFI module aggregates the spatial and frequency information extracted from the modality-aligned features of two SA modules for complementary representation learning. The network progressively complements two image modalities by connecting the SAF blocks from top to down, and finally provides a visually pleasing fusion effect by feeding the output of the last block into the decoder. Recognizing that existing datasets lack illumination diversity, we contribute a new dataset specifically designed for extreme-illumination image fusion. Extensive experiments show the effectiveness and superiority of TSANet over seven state-of-the-art methods. The source code and dataset are available at <uri>https://github.com/WentaoLi-CV/TSANet</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"7943-7958"},"PeriodicalIF":13.7,"publicationDate":"2025-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145599031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-25DOI: 10.1109/TIP.2025.3635015
Qi Liu;Suyuan Liu;Jianhua Dai;Xueling Zhu;Xinwang Liu
In recent years, multi-view unsupervised feature selection has gained significant interest for its ability to efficiently handle multi-view datasets while offering better interpretability. Existing multi-view unsupervised feature selection methods construct graphs based on the relationship between samples. In fact, in feature selection, it is more important to focus on the relationships between features. However, constructing a complete graph to capture the relationship between features would incur a space and time complexity of $O(d^{2})$ or even higher. Therefore, we introduce an anchor-based strategy and build a feature bipartite graph to reduce complexity. In addition, since existing methods cannot directly extract feature importance from a feature bipartite graph, we design an effective and low-complexity method to directly obtain feature scores from a feature bipartite graph. Compared with the feature importance extraction method based on the complete graph, our proposed method reduces the time complexity from $O(d^{3})$ to $O(d)$ . To the best of our knowledge, our proposed method is the first multi-view unsupervised feature selection algorithm that achieves $O(nd)$ space and time complexity without data segmentation. Specifically, this method adaptively learns feature-level anchor graph structures through self-expressive multi-view subspace learning, which can effectively capture the structural information between features and anchors. Meanwhile, the proposed method projects low-dimensional anchors to common dimensions and aligns them with consensus anchors to capture the consistency and complementary information between different views. The superiority of the proposed algorithm is demonstrated by comparing it with seven state-of-the-art algorithms on five public image and two biological information multi-view datasets. The code of the proposed method is publicly available at https://github.com/getupLiu/AFRC
{"title":"Linear Complexity Multi-View Unsupervised Feature Selection via Anchor-Based Feature Relationship Construction","authors":"Qi Liu;Suyuan Liu;Jianhua Dai;Xueling Zhu;Xinwang Liu","doi":"10.1109/TIP.2025.3635015","DOIUrl":"10.1109/TIP.2025.3635015","url":null,"abstract":"In recent years, multi-view unsupervised feature selection has gained significant interest for its ability to efficiently handle multi-view datasets while offering better interpretability. Existing multi-view unsupervised feature selection methods construct graphs based on the relationship between samples. In fact, in feature selection, it is more important to focus on the relationships between features. However, constructing a complete graph to capture the relationship between features would incur a space and time complexity of <inline-formula> <tex-math>$O(d^{2})$ </tex-math></inline-formula> or even higher. Therefore, we introduce an anchor-based strategy and build a feature bipartite graph to reduce complexity. In addition, since existing methods cannot directly extract feature importance from a feature bipartite graph, we design an effective and low-complexity method to directly obtain feature scores from a feature bipartite graph. Compared with the feature importance extraction method based on the complete graph, our proposed method reduces the time complexity from <inline-formula> <tex-math>$O(d^{3})$ </tex-math></inline-formula> to <inline-formula> <tex-math>$O(d)$ </tex-math></inline-formula>. To the best of our knowledge, our proposed method is the first multi-view unsupervised feature selection algorithm that achieves <inline-formula> <tex-math>$O(nd)$ </tex-math></inline-formula> space and time complexity without data segmentation. Specifically, this method adaptively learns feature-level anchor graph structures through self-expressive multi-view subspace learning, which can effectively capture the structural information between features and anchors. Meanwhile, the proposed method projects low-dimensional anchors to common dimensions and aligns them with consensus anchors to capture the consistency and complementary information between different views. The superiority of the proposed algorithm is demonstrated by comparing it with seven state-of-the-art algorithms on five public image and two biological information multi-view datasets. The code of the proposed method is publicly available at <uri>https://github.com/getupLiu/AFRC</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"7889-7902"},"PeriodicalIF":13.7,"publicationDate":"2025-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145599028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Underwater data is inherently scarce and exhibits complex distributions, making it challenging to train high-performance models from scratch. In contrast, in-air models are structurally mature, resource-rich, and offer strong potential for transfer. However, significant discrepancies in visual characteristics and feature distributions between underwater and in-air environments often lead to severe performance degradation when applying in-air models directly. To address this issue, we propose IA2U, a lightweight plugin designed for efficient underwater adaptation without modifying the original model architecture. IA2U can be flexibly integrated into arbitrary in-air networks, offering high generalizability and low deployment costs. Specifically, IA2U incorporates three types of prior knowledge—water type, degradation pattern, and sample semantics—which are embedded into intermediate layers through feature injection and channel-wise modulation to guide the network’s response to underwater-specific features. Furthermore, a multi-scale feature alignment module is introduced to dynamically balance information across different resolution paths, enhancing consistency and contextual representation. Extensive experiments demonstrate that IA2U significantly improves both image enhancement and object detection performance. Specifically, on the UIEB dataset, IA2U boosts Shallow-UWNet by 5.2 dB in PSNR and reduces LPIPS by 52%; on the RUOD dataset, it increases AP by 1.8% when applied to the PAA detector. IA2U provides an effective and scalable solution for building robust underwater perception systems with minimal adaptation costs. Our code is available at https://github.com/zhoujingchun03/IA2U
{"title":"Multi-Prior Fusion Transfer Plugin for Adapting In-Air Models to Underwater Image Enhancement and Detection","authors":"Jingchun Zhou;Dehuan Zhang;Zongxin He;Qilin Gai;Qiuping Jiang","doi":"10.1109/TIP.2025.3634001","DOIUrl":"10.1109/TIP.2025.3634001","url":null,"abstract":"Underwater data is inherently scarce and exhibits complex distributions, making it challenging to train high-performance models from scratch. In contrast, in-air models are structurally mature, resource-rich, and offer strong potential for transfer. However, significant discrepancies in visual characteristics and feature distributions between underwater and in-air environments often lead to severe performance degradation when applying in-air models directly. To address this issue, we propose IA2U, a lightweight plugin designed for efficient underwater adaptation without modifying the original model architecture. IA2U can be flexibly integrated into arbitrary in-air networks, offering high generalizability and low deployment costs. Specifically, IA2U incorporates three types of prior knowledge—water type, degradation pattern, and sample semantics—which are embedded into intermediate layers through feature injection and channel-wise modulation to guide the network’s response to underwater-specific features. Furthermore, a multi-scale feature alignment module is introduced to dynamically balance information across different resolution paths, enhancing consistency and contextual representation. Extensive experiments demonstrate that IA2U significantly improves both image enhancement and object detection performance. Specifically, on the UIEB dataset, IA2U boosts Shallow-UWNet by 5.2 dB in PSNR and reduces LPIPS by 52%; on the RUOD dataset, it increases AP by 1.8% when applied to the PAA detector. IA2U provides an effective and scalable solution for building robust underwater perception systems with minimal adaptation costs. Our code is available at <uri>https://github.com/zhoujingchun03/IA2U</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"7773-7785"},"PeriodicalIF":13.7,"publicationDate":"2025-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145593068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hyperspectral videos (HSVs), with their inherent spatial-spectral-temporal structure, offer distinct advantages in challenging tracking scenarios such as cluttered backgrounds and small objects. However, existing methods primarily focus on spatial interactions between the template and search regions, often overlooking spectral interactions, leading to suboptimal performance. To address this issue, this paper investigates spectral interactions from both the architectural and training perspectives. At the architectural level, we first establish band-wise long-range spatial relationships between the template and search regions using Transformers. We then model spectral interactions using the inclusion-exclusion principle from set theory, treating them as the union of spatial interactions across all bands. This enables the effective integration of both shared and band-specific spatial cues. At the training level, we introduce a spectral loss to enforce material distribution alignment between the template and predicted regions, enhancing robustness to shape deformation and appearance variations. Extensive experiments demonstrate that our tracker achieves state-of-the-art tracking performance. The source code, trained models and results will be publicly available via https://github.com/bearshng/suit to support reproducibility
{"title":"SUIT: Spatial-Spectral Union-Intersection Interaction Network for Hyperspectral Object Tracking","authors":"Fengchao Xiong;Zhenxing Wu;Jun Zhou;Sen Jia;Yuntao Qian","doi":"10.1109/TIP.2025.3633177","DOIUrl":"10.1109/TIP.2025.3633177","url":null,"abstract":"Hyperspectral videos (HSVs), with their inherent spatial-spectral-temporal structure, offer distinct advantages in challenging tracking scenarios such as cluttered backgrounds and small objects. However, existing methods primarily focus on spatial interactions between the template and search regions, often overlooking spectral interactions, leading to suboptimal performance. To address this issue, this paper investigates spectral interactions from both the architectural and training perspectives. At the architectural level, we first establish band-wise long-range spatial relationships between the template and search regions using Transformers. We then model spectral interactions using the inclusion-exclusion principle from set theory, treating them as the union of spatial interactions across all bands. This enables the effective integration of both shared and band-specific spatial cues. At the training level, we introduce a spectral loss to enforce material distribution alignment between the template and predicted regions, enhancing robustness to shape deformation and appearance variations. Extensive experiments demonstrate that our tracker achieves state-of-the-art tracking performance. The source code, trained models and results will be publicly available via <uri>https://github.com/bearshng/suit</uri> to support reproducibility","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"7786-7800"},"PeriodicalIF":13.7,"publicationDate":"2025-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145593514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-24DOI: 10.1109/TIP.2025.3633566
Haifei Zeng;Wen Li;Xiaofei Peng;Mingqing Xiao
In this paper, we present a novel non-convex tensor completion model specifically tailored for multidimensional data. Our approach introduces a three-directional non-convex tensor rank surrogate regularized by the Minimax Concave Penalty (MCP) function. Crucially, the method processes data by simultaneously exploiting low-rank structures across its three modal directions, with the MCP function effectively mitigating the over-penalization of large singular values—a common drawback in convex nuclear norm minimization. To address the inherent challenges of this non-convex optimization, we develop an innovative approximate convex model that accurately captures the original formulation’s essence. We then develop a robust convex Alternating Direction Method of Multipliers (ADMM)-based algorithm, supported by a rigorous convergence guarantee, ensuring both theoretical soundness and practical reliability. Extensive experiments on a variety of real-world datasets demonstrate the superior performance and robustness of the proposed method compared to state-of-the-art approaches.
{"title":"Multidimensional Imaging Data Completion via Weighted Three-Directional Minimax Concave Penalty Regularization","authors":"Haifei Zeng;Wen Li;Xiaofei Peng;Mingqing Xiao","doi":"10.1109/TIP.2025.3633566","DOIUrl":"10.1109/TIP.2025.3633566","url":null,"abstract":"In this paper, we present a novel non-convex tensor completion model specifically tailored for multidimensional data. Our approach introduces a three-directional non-convex tensor rank surrogate regularized by the Minimax Concave Penalty (MCP) function. Crucially, the method processes data by simultaneously exploiting low-rank structures across its three modal directions, with the MCP function effectively mitigating the over-penalization of large singular values—a common drawback in convex nuclear norm minimization. To address the inherent challenges of this non-convex optimization, we develop an innovative approximate convex model that accurately captures the original formulation’s essence. We then develop a robust convex Alternating Direction Method of Multipliers (ADMM)-based algorithm, supported by a rigorous convergence guarantee, ensuring both theoretical soundness and practical reliability. Extensive experiments on a variety of real-world datasets demonstrate the superior performance and robustness of the proposed method compared to state-of-the-art approaches.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"7801-7816"},"PeriodicalIF":13.7,"publicationDate":"2025-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145593064","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}