Pub Date : 2025-12-08DOI: 10.1109/TIP.2025.3635016
Jie Zhao;Xin Chen;Shengming Li;Chunjuan Bo;Dong Wang;Huchuan Lu
Due to the substantial gap between vision and language modalities, along with the mismatch problem between fixed language descriptions and dynamic visual information, existing vision-language tracking methods exhibit performance on par with or slightly worse than vision-only tracking. Effectively exploiting the rich semantics of language to enhance tracking robustness remains an open challenge. To address these issues, we propose a self-adaptive vision-language tracking framework that leverages the pre-trained multi-modal CLIP model to obtain well-aligned visual-language representations. A novel context-aware prompting mechanism is introduced to dynamically adapt linguistic cues based on the evolving visual context during tracking. Specifically, our context prompter extracts dynamic visual features from the current search image and integrates them into the text encoding process, enabling self-updating language embeddings. Furthermore, our framework employs a unified one-stream Transformer architecture, supporting joint training for both vision-only and vision-language tracking scenarios. Our method not only bridges the modality gap but also enhances robustness by allowing language features to evolve with visual context. Extensive experiments on four vision-language tracking benchmarks demonstrate that our method effectively leverages the advantages of language to enhance visual tracking. Our large model can obtain 55.0% AUC on $text {LaSOT}_{text {EXT}}$ and 69.0% AUC on TNL2K. Additionally, our language-only tracking model achieves performance comparable to that of state-of-the-art vision-only tracking methods on TNL2K. Code is available at https://github.com/zj5559/SAVLT
{"title":"Self-Adaptive Vision-Language Tracking With Context Prompting","authors":"Jie Zhao;Xin Chen;Shengming Li;Chunjuan Bo;Dong Wang;Huchuan Lu","doi":"10.1109/TIP.2025.3635016","DOIUrl":"10.1109/TIP.2025.3635016","url":null,"abstract":"Due to the substantial gap between vision and language modalities, along with the mismatch problem between fixed language descriptions and dynamic visual information, existing vision-language tracking methods exhibit performance on par with or slightly worse than vision-only tracking. Effectively exploiting the rich semantics of language to enhance tracking robustness remains an open challenge. To address these issues, we propose a self-adaptive vision-language tracking framework that leverages the pre-trained multi-modal CLIP model to obtain well-aligned visual-language representations. A novel context-aware prompting mechanism is introduced to dynamically adapt linguistic cues based on the evolving visual context during tracking. Specifically, our context prompter extracts dynamic visual features from the current search image and integrates them into the text encoding process, enabling self-updating language embeddings. Furthermore, our framework employs a unified one-stream Transformer architecture, supporting joint training for both vision-only and vision-language tracking scenarios. Our method not only bridges the modality gap but also enhances robustness by allowing language features to evolve with visual context. Extensive experiments on four vision-language tracking benchmarks demonstrate that our method effectively leverages the advantages of language to enhance visual tracking. Our large model can obtain 55.0% AUC on <inline-formula> <tex-math>$text {LaSOT}_{text {EXT}}$ </tex-math></inline-formula> and 69.0% AUC on TNL2K. Additionally, our language-only tracking model achieves performance comparable to that of state-of-the-art vision-only tracking methods on TNL2K. Code is available at <uri>https://github.com/zj5559/SAVLT</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"8046-8058"},"PeriodicalIF":13.7,"publicationDate":"2025-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145704003","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Infrared small target detection (IRSTD) is of great practical significance in many real-world applications, such as maritime rescue and early warning systems, benefiting from the unique and excellent infrared imaging ability in adverse weather and low-light conditions. Nevertheless, segmenting small targets from the background remains a challenge. When the subsampling frequency during image processing does not satisfy the Nyquist criterion, the aliasing effect occurs, which makes it extremely difficult to identify small targets. To address this challenge, we propose a novel Wavelet Mamba with Reversible Structure Network (WMRNet) for infrared small target detection in this paper. Specifically, WMRNet consists of a Discrete Wavelet Mamba (DW-Mamba) module and a Third-order Difference Equation guided Reversible (TDE-Rev) structure. DW-Mamba employs the Discrete Wavelet Transform to decompose images into multiple subbands, integrating this information into the state equations of a state space model. This method minimizes frequency interference while preserving a global perspective, thereby effectively reducing background aliasing. The TDE-Rev aims to suppress edge aliasing effects by refining the target edges, which first processes features with an explicit neural structure derived from the second-order difference equations and then promotes feature interactions through a reversible structure. Extensive experiments on the public IRSTD-1k and SIRST datasets demonstrate that the proposed WMRNet outperforms the state-of-the-art methods.
{"title":"WMRNet: Wavelet Mamba With Reversible Structure for Infrared Small Target Detection","authors":"Mingjin Zhang;Xiaolong Li;Jie Guo;Yunsong Li;Xinbo Gao","doi":"10.1109/TIP.2025.3637729","DOIUrl":"10.1109/TIP.2025.3637729","url":null,"abstract":"Infrared small target detection (IRSTD) is of great practical significance in many real-world applications, such as maritime rescue and early warning systems, benefiting from the unique and excellent infrared imaging ability in adverse weather and low-light conditions. Nevertheless, segmenting small targets from the background remains a challenge. When the subsampling frequency during image processing does not satisfy the Nyquist criterion, the aliasing effect occurs, which makes it extremely difficult to identify small targets. To address this challenge, we propose a novel Wavelet Mamba with Reversible Structure Network (WMRNet) for infrared small target detection in this paper. Specifically, WMRNet consists of a Discrete Wavelet Mamba (DW-Mamba) module and a Third-order Difference Equation guided Reversible (TDE-Rev) structure. DW-Mamba employs the Discrete Wavelet Transform to decompose images into multiple subbands, integrating this information into the state equations of a state space model. This method minimizes frequency interference while preserving a global perspective, thereby effectively reducing background aliasing. The TDE-Rev aims to suppress edge aliasing effects by refining the target edges, which first processes features with an explicit neural structure derived from the second-order difference equations and then promotes feature interactions through a reversible structure. Extensive experiments on the public IRSTD-1k and SIRST datasets demonstrate that the proposed WMRNet outperforms the state-of-the-art methods.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"8229-8242"},"PeriodicalIF":13.7,"publicationDate":"2025-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145680387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-05DOI: 10.1109/TIP.2025.3638151
Min Zhao;Linruize Tang;Jie Chen;Bo Huang
Hyperspectral unmixing aims to decompose the mixed pixels into pure spectra and calculate their corresponding fractional abundances. It holds a critical position in hyperspectral image processing. Traditional model-based unmixing methods use convex optimization to iteratively solve the unmixing problem with hand-crafted regularizers. While their performance is limited by these manually designed constraints, which may not fully capture the structural information of the data. Recently, deep learning-based unmixing methods have shown remarkable capability for this task. However, they have limited generalizability and lack interpretability. In this paper, we propose a novel hyperspectral unmixing method regularized by a diffusion model (URDM) to overcome these shortcomings. Our method leverages the advantages of both conventional optimization algorithms and deep generative models. Specifically, we formulate the unmixing objective function from a variational perspective and integrate it into a diffusion sampling process to introduce generative priors from a denoising diffusion probabilistic model (DDPM). Since the original objective function is challenging to optimize, we introduce a splitting-based strategy to decouple it into simpler subproblems. Extensive experiment results conducted on both synthetic and real datasets demonstrate the efficiency and superior performance of our proposed method.
{"title":"URDM: Hyperspectral Unmixing Regularized by Diffusion Models","authors":"Min Zhao;Linruize Tang;Jie Chen;Bo Huang","doi":"10.1109/TIP.2025.3638151","DOIUrl":"10.1109/TIP.2025.3638151","url":null,"abstract":"Hyperspectral unmixing aims to decompose the mixed pixels into pure spectra and calculate their corresponding fractional abundances. It holds a critical position in hyperspectral image processing. Traditional model-based unmixing methods use convex optimization to iteratively solve the unmixing problem with hand-crafted regularizers. While their performance is limited by these manually designed constraints, which may not fully capture the structural information of the data. Recently, deep learning-based unmixing methods have shown remarkable capability for this task. However, they have limited generalizability and lack interpretability. In this paper, we propose a novel hyperspectral unmixing method regularized by a diffusion model (URDM) to overcome these shortcomings. Our method leverages the advantages of both conventional optimization algorithms and deep generative models. Specifically, we formulate the unmixing objective function from a variational perspective and integrate it into a diffusion sampling process to introduce generative priors from a denoising diffusion probabilistic model (DDPM). Since the original objective function is challenging to optimize, we introduce a splitting-based strategy to decouple it into simpler subproblems. Extensive experiment results conducted on both synthetic and real datasets demonstrate the efficiency and superior performance of our proposed method.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"8072-8085"},"PeriodicalIF":13.7,"publicationDate":"2025-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145680389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Convolutional neural networks (CNNs) can automatically learn data patterns to express face images for facial expression recognition (FER). However, they may ignore effect of facial segmentation of FER. In this paper, we propose a perception CNN for FER as well as PCNN. Firstly, PCNN can use five parallel networks to simultaneously learn local facial features based on eyes, cheeks and mouth to realize the sensitive capture of the subtle changes in FER. Secondly, we utilize a multi-domain interaction mechanism to register and fuse between local sense organ features and global facial structural features to better express face images for FER. Finally, we design a two-phase loss function to restrict accuracy of obtained sense information and reconstructed face images to guarantee performance of obtained PCNN in FER. Experimental results show that our PCNN achieves superior results on several lab and real-world FER benchmarks: CK+, JAFFE, FER2013, FERPlus, RAF-DB and Occlusion and Pose Variant Dataset. Its code is available at https://github.com/hellloxiaotian/PCNN
{"title":"A Perception CNN for Facial Expression Recognition","authors":"Chunwei Tian;Jingyuan Xie;Lingjun Li;Wangmeng Zuo;Yanning Zhang;David Zhang","doi":"10.1109/TIP.2025.3637715","DOIUrl":"10.1109/TIP.2025.3637715","url":null,"abstract":"Convolutional neural networks (CNNs) can automatically learn data patterns to express face images for facial expression recognition (FER). However, they may ignore effect of facial segmentation of FER. In this paper, we propose a perception CNN for FER as well as PCNN. Firstly, PCNN can use five parallel networks to simultaneously learn local facial features based on eyes, cheeks and mouth to realize the sensitive capture of the subtle changes in FER. Secondly, we utilize a multi-domain interaction mechanism to register and fuse between local sense organ features and global facial structural features to better express face images for FER. Finally, we design a two-phase loss function to restrict accuracy of obtained sense information and reconstructed face images to guarantee performance of obtained PCNN in FER. Experimental results show that our PCNN achieves superior results on several lab and real-world FER benchmarks: CK+, JAFFE, FER2013, FERPlus, RAF-DB and Occlusion and Pose Variant Dataset. Its code is available at <uri>https://github.com/hellloxiaotian/PCNN</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"8101-8113"},"PeriodicalIF":13.7,"publicationDate":"2025-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145680388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-04DOI: 10.1109/TIP.2025.3637694
Xingyu Cui;Huanjing Yue;Shida Sun;Yue Li;Yusen Hou;Zhiwei Xiong;Jingyu Yang
Non-line-of-sight (NLOS) imaging aims to reconstruct scenes hidden from direct view and has broad applications in robotic vision, rescue operations, autonomous driving, and remote sensing. However, most existing methods rely on densely sampled transients from large, continuous relay surfaces, which limits their practicality in real-world scenarios with aperture constraints. To address this limitation, we propose an unsupervised zero-shot framework tailored for confocal NLOS imaging with aperture-limited relay surfaces. Our method leverages latent diffusion models to recover fully-sampled transients from undersampled versions by enforcing measurement consistency during the sampling process. To further improve recovered transient quality, we introduce a progressive recovery strategy that incrementally recovers missing transient values, effectively mitigating the impact of severe aperture limitations. In addition, to suppress error propagation during recovery, we develop a backpropagation-based error correction reconstruction algorithm that refines intermediate recovered transients by enforcing sparsity regularization in the voxel domain, enabling high-fidelity final reconstructions. Extensive experiments on both simulated and real-world datasets validate the robustness and generalization capability of our method across diverse aperture-limited relay surfaces. Notably, our method follows a zero-shot paradigm, requiring only a single pretraining stage without paired data or pattern-specific retraining, which makes it a more practical and generalizable framework for NLOS imaging.
{"title":"TransDiff: Unsupervised Non-Line-of-Sight Imaging With Aperture-Limited Relay Surfaces","authors":"Xingyu Cui;Huanjing Yue;Shida Sun;Yue Li;Yusen Hou;Zhiwei Xiong;Jingyu Yang","doi":"10.1109/TIP.2025.3637694","DOIUrl":"10.1109/TIP.2025.3637694","url":null,"abstract":"Non-line-of-sight (NLOS) imaging aims to reconstruct scenes hidden from direct view and has broad applications in robotic vision, rescue operations, autonomous driving, and remote sensing. However, most existing methods rely on densely sampled transients from large, continuous relay surfaces, which limits their practicality in real-world scenarios with aperture constraints. To address this limitation, we propose an unsupervised zero-shot framework tailored for confocal NLOS imaging with aperture-limited relay surfaces. Our method leverages latent diffusion models to recover fully-sampled transients from undersampled versions by enforcing measurement consistency during the sampling process. To further improve recovered transient quality, we introduce a progressive recovery strategy that incrementally recovers missing transient values, effectively mitigating the impact of severe aperture limitations. In addition, to suppress error propagation during recovery, we develop a backpropagation-based error correction reconstruction algorithm that refines intermediate recovered transients by enforcing sparsity regularization in the voxel domain, enabling high-fidelity final reconstructions. Extensive experiments on both simulated and real-world datasets validate the robustness and generalization capability of our method across diverse aperture-limited relay surfaces. Notably, our method follows a zero-shot paradigm, requiring only a single pretraining stage without paired data or pattern-specific retraining, which makes it a more practical and generalizable framework for NLOS imaging.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"8018-8031"},"PeriodicalIF":13.7,"publicationDate":"2025-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145673713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-04DOI: 10.1109/TIP.2025.3637675
Anjing Guo;Renwei Dian;Nan Wang;Shutao Li
The modulation transfer function tailored image filter (MTF-TIF) has long been regarded as the optimal filter for multispectral image pansharpening. It excels at simulating the camera’s frequency response, thereby capturing finer image details and significantly improving pansharpening performance. However, we are skeptical about whether the pre-measured MTF is sufficient to describe the characteristics of actually acquired panchromatic image (PAN) and multispectral image (MSI). For example, any image resampling operations in geometric correction or image registration inevitably change the sharpness of acquired PAN and MSI, and the processed images no longer conform to the camera’s MTF. Further, following the Wald protocol, in deep learning (DL) methods using MTF-TIF for downsampling images to construct training data does not satisfy the generalization consistency of training and testing. To prove our point, we propose a pair of symmetric frameworks based on DL in this paper, to find better image filters suitable for both traditional and DL pansharpening methods. We embed two learnable filters into the frameworks to simulate the optimal image filter, namely anisotropic Gaussian image filter and arbitrary image filter. Further, the proposed frameworks can capture subtle offsets between images and maintain the smoothness of the global deformation field. Extensive experiments on various satellite datasets demonstrate that the proposed frameworks can find better image filters than MTF-TIFs, which can achieve better pansharpening performance with stronger generalization ability.
{"title":"Better Image Filter for Pansharpening","authors":"Anjing Guo;Renwei Dian;Nan Wang;Shutao Li","doi":"10.1109/TIP.2025.3637675","DOIUrl":"10.1109/TIP.2025.3637675","url":null,"abstract":"The modulation transfer function tailored image filter (MTF-TIF) has long been regarded as the optimal filter for multispectral image pansharpening. It excels at simulating the camera’s frequency response, thereby capturing finer image details and significantly improving pansharpening performance. However, we are skeptical about whether the pre-measured MTF is sufficient to describe the characteristics of actually acquired panchromatic image (PAN) and multispectral image (MSI). For example, any image resampling operations in geometric correction or image registration inevitably change the sharpness of acquired PAN and MSI, and the processed images no longer conform to the camera’s MTF. Further, following the Wald protocol, in deep learning (DL) methods using MTF-TIF for downsampling images to construct training data does not satisfy the generalization consistency of training and testing. To prove our point, we propose a pair of symmetric frameworks based on DL in this paper, to find better image filters suitable for both traditional and DL pansharpening methods. We embed two learnable filters into the frameworks to simulate the optimal image filter, namely anisotropic Gaussian image filter and arbitrary image filter. Further, the proposed frameworks can capture subtle offsets between images and maintain the smoothness of the global deformation field. Extensive experiments on various satellite datasets demonstrate that the proposed frameworks can find better image filters than MTF-TIFs, which can achieve better pansharpening performance with stronger generalization ability.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"8171-8184"},"PeriodicalIF":13.7,"publicationDate":"2025-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145673712","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Adversarial distillation (AD) aims to mitigate deep neural networks’ inherent vulnerability to adversarial attacks, thereby providing robust protection for compact models through teacher-student interactions. Despite advancements, existing AD studies still suffer from insufficient robustness due to the limitations of fixed attack strength and attention region shifts. To address these challenges, we propose a strength-adaptive Info-maximizing Adversarial Robustness Distillation paradigm, namely “InfoARD”, which strategically incorporates the Attack-Strength Adaptation (ASA) and Mutual-Information Maximization (MIM) to enhance adversarial robustness against adversarial attacks and perturbations. Unlike previous adversarial training (AT) methods that utilize fixed attack strength, the ASA mechanism is designed to capture smoother and generalized classification boundaries by dynamically tailoring the attack strength based on the characteristics of individual instances. Benefiting from mutual information constraints, our MIM strategy ensures the student model effectively learns from various levels of feature representations and attention patterns, thereby deepening the student model’s understanding of the teacher model’s decision-making processes. Furthermore, a comprehensive multi-granularity distillation is conducted to capture knowledge across multiple dimensions, enabling a more effective transfer of knowledge from the teacher model to the student model. Note that our InfoARD can be seamlessly integrated into existing AD frameworks, further boosting the adversarial robustness of deep learning models. Extensive experiments on various challenging datasets consistently demonstrate the effectiveness and robustness of our InfoARD, surpassing previous state-of-the-art methods.
{"title":"InfoARD: Enhancing Adversarial Robustness Distillation With Attack-Strength Adaptation and Mutual-Information Maximization","authors":"Ruihan Liu;Jieyi Cai;Yishu Liu;Sudong Cai;Bingzhi Chen;Yulan Guo;Mohammed Bennamoun","doi":"10.1109/TIP.2025.3637689","DOIUrl":"10.1109/TIP.2025.3637689","url":null,"abstract":"Adversarial distillation (AD) aims to mitigate deep neural networks’ inherent vulnerability to adversarial attacks, thereby providing robust protection for compact models through teacher-student interactions. Despite advancements, existing AD studies still suffer from insufficient robustness due to the limitations of fixed attack strength and attention region shifts. To address these challenges, we propose a strength-adaptive Info-maximizing Adversarial Robustness Distillation paradigm, namely “InfoARD”, which strategically incorporates the Attack-Strength Adaptation (ASA) and Mutual-Information Maximization (MIM) to enhance adversarial robustness against adversarial attacks and perturbations. Unlike previous adversarial training (AT) methods that utilize fixed attack strength, the ASA mechanism is designed to capture smoother and generalized classification boundaries by dynamically tailoring the attack strength based on the characteristics of individual instances. Benefiting from mutual information constraints, our MIM strategy ensures the student model effectively learns from various levels of feature representations and attention patterns, thereby deepening the student model’s understanding of the teacher model’s decision-making processes. Furthermore, a comprehensive multi-granularity distillation is conducted to capture knowledge across multiple dimensions, enabling a more effective transfer of knowledge from the teacher model to the student model. Note that our InfoARD can be seamlessly integrated into existing AD frameworks, further boosting the adversarial robustness of deep learning models. Extensive experiments on various challenging datasets consistently demonstrate the effectiveness and robustness of our InfoARD, surpassing previous state-of-the-art methods.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"35 ","pages":"276-289"},"PeriodicalIF":13.7,"publicationDate":"2025-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145673710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The existing attention-based label-free weakly supervised group activity recognition methods can automatically learn tokens related to the actors. And they have difficulties generating sufficiently diverse token embeddings. To address these issues, we automatically obtain the grayscale motion mask of all the moving objects based on the motion direction not the motion amplitude. A Motion-Guided Mask Generator module (MGMG) is proposed to estimate the attention region mask under the supervision of the grayscale motion mask. MGMG involves four parts. A correlation layer measures the relative displacement between two adjacent feature maps. A cosine attention mechanism is designed to reduce the module’s sensitivity to feature amplitude changes. A mask generator is built to generate the attention region mask. And a specifically designed activation function is used to refine the attention region mask and to enhance its focus on actor motion regions. We also customize a normalized relative error loss function for MGMG module. This loss can address the value range mismatch problem for the estimated attention mask as well as the grayscale motion mask. Furthermore, a Motion Attention-Guided Relational Reasoning (MAGRR) framework is presented for the weakly supervised condition. It uses the MGMG module to estimate the attention region automatically, and a Spatial-temporal Aggregation Stack (SAS) module to activate the attention regions of the features at the spatial level, then transform them into multiple tokens, which are further captured by the attention mechanism for their temporal dependencies and interrelationships. MAGRR is experimented on the Collective Activity dataset and the Collective Activity Extension dataset, achieving state-of-the-art performance and competitive performance on the Volleyball and the NBA datasets.
{"title":"Motion Attention-Guided Relational Reasoning for Weakly Supervised Group Activity Recognition","authors":"Yihao Zheng;Zhuming Wang;Lifang Wu;Liang Wang;Chang Wen Chen","doi":"10.1109/TIP.2025.3636094","DOIUrl":"10.1109/TIP.2025.3636094","url":null,"abstract":"The existing attention-based label-free weakly supervised group activity recognition methods can automatically learn tokens related to the actors. And they have difficulties generating sufficiently diverse token embeddings. To address these issues, we automatically obtain the grayscale motion mask of all the moving objects based on the motion direction not the motion amplitude. A Motion-Guided Mask Generator module (MGMG) is proposed to estimate the attention region mask under the supervision of the grayscale motion mask. MGMG involves four parts. A correlation layer measures the relative displacement between two adjacent feature maps. A cosine attention mechanism is designed to reduce the module’s sensitivity to feature amplitude changes. A mask generator is built to generate the attention region mask. And a specifically designed activation function is used to refine the attention region mask and to enhance its focus on actor motion regions. We also customize a normalized relative error loss function for MGMG module. This loss can address the value range mismatch problem for the estimated attention mask as well as the grayscale motion mask. Furthermore, a Motion Attention-Guided Relational Reasoning (MAGRR) framework is presented for the weakly supervised condition. It uses the MGMG module to estimate the attention region automatically, and a Spatial-temporal Aggregation Stack (SAS) module to activate the attention regions of the features at the spatial level, then transform them into multiple tokens, which are further captured by the attention mechanism for their temporal dependencies and interrelationships. MAGRR is experimented on the Collective Activity dataset and the Collective Activity Extension dataset, achieving state-of-the-art performance and competitive performance on the Volleyball and the NBA datasets.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"8159-8170"},"PeriodicalIF":13.7,"publicationDate":"2025-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145663921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-03DOI: 10.1109/TIP.2025.3636676
Jinyang Liu;Shutao Li;Heng Yang;Renwei Dian;Yuanye Liu
Existing hyperspectral fusion computational imaging methods primarily rely on using high-resolution multispectral images (HRMSI) to provide spatial details for low-resolution hyperspectral images (LRHSI), thereby enabling the reconstruction of hyperspectral images. However, these methods are often limited by the low spectral resolution of the HRMSI, making the sampled tensors unable to provide effective information for the LRHSI in a finer spectral range. To achieve more accurate computational imaging results, we propose a Heterospectral Structure Compensation Sampling (HSC-sampling) mechanism. Unlike traditional spatial sampling methods, which directly calculate the interpolation between adjacent pixels, this mechanism analyzes the structural complementarity among different bands in LRHSI. It utilizes the information from other bands to compensate for the missing details in the current band. Additionally, a novel Multi-phase Mixed Modeling (M2M) approach is designed, expanding the model’s analytical capabilities into multiple phases to accommodate the high-dimensional nature of HSI data. Specifically, it extracts fusion features from three phases and organizes the generated features along with the input features into a multi-variate mixed cube based on phase relationships, thereby capturing feature correlations across different phases. Based on the HSC-sampling mechanism and the M2M approach, we construct a Merging Residual Concatenation (MRC) hyperspectral fusion computational imaging network. Compared to other state-of-the-art methods, this network achieves significant improvements in fusion performance across multiple datasets. Moreover, the effectiveness of the HSC-sampling mechanism has been demonstrated in various hyperspectral imaging tasks. Code is available at: https://github.com/1318133/HSC-Sampling
{"title":"Heterospectral Structure Compensation Sampling for Hyperspectral Fusion Computational Imaging","authors":"Jinyang Liu;Shutao Li;Heng Yang;Renwei Dian;Yuanye Liu","doi":"10.1109/TIP.2025.3636676","DOIUrl":"10.1109/TIP.2025.3636676","url":null,"abstract":"Existing hyperspectral fusion computational imaging methods primarily rely on using high-resolution multispectral images (HRMSI) to provide spatial details for low-resolution hyperspectral images (LRHSI), thereby enabling the reconstruction of hyperspectral images. However, these methods are often limited by the low spectral resolution of the HRMSI, making the sampled tensors unable to provide effective information for the LRHSI in a finer spectral range. To achieve more accurate computational imaging results, we propose a Heterospectral Structure Compensation Sampling (HSC-sampling) mechanism. Unlike traditional spatial sampling methods, which directly calculate the interpolation between adjacent pixels, this mechanism analyzes the structural complementarity among different bands in LRHSI. It utilizes the information from other bands to compensate for the missing details in the current band. Additionally, a novel Multi-phase Mixed Modeling (M2M) approach is designed, expanding the model’s analytical capabilities into multiple phases to accommodate the high-dimensional nature of HSI data. Specifically, it extracts fusion features from three phases and organizes the generated features along with the input features into a multi-variate mixed cube based on phase relationships, thereby capturing feature correlations across different phases. Based on the HSC-sampling mechanism and the M2M approach, we construct a Merging Residual Concatenation (MRC) hyperspectral fusion computational imaging network. Compared to other state-of-the-art methods, this network achieves significant improvements in fusion performance across multiple datasets. Moreover, the effectiveness of the HSC-sampling mechanism has been demonstrated in various hyperspectral imaging tasks. Code is available at: <uri>https://github.com/1318133/HSC-Sampling</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"7930-7942"},"PeriodicalIF":13.7,"publicationDate":"2025-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145663922","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-03DOI: 10.1109/TIP.2025.3634979
Yang Xin;Xiang Zhong;Yu Zhou;Jianmin Jiang
At present, deep face recognition models working on millions of images are confronted with the challenge that such large-scale datasets are often corrupted with noises and mislabeled identities yet most deep models are primarily designed for clean datasets. In this paper, we propose a robust deep face recognition model by exploiting the advantage of integrating the strength of margin-based learning models with the strength of mining-based approaches to effectively mitigate the impact of noises during training. By monitoring the recognition performances at a batch level to provide optimization-oriented feedback, we introduce a noise-adaptive mining strategy to dynamically adjust the emphasis balance between hard and noise samples, enabling direct training on noisy datasets without the requirement of pre-training. With a novel anti-noise loss function, learning is empowered for direct and robust training on noisy datasets yet its effectiveness over clean datasets is still preserved, sustaining effective mining of both clean and noisy samples whilst weakening its learning intensiveness over noisy samples. Extensive experiments reveal that: (i) our proposed achieves competitive performances in comparison with representative existing SoTA models when trained with clean datasets; (ii) when trained with both real-world and synthesized noisy datasets, our proposed significantly outperforms the existing models, especially when the synthesized datasets are corrupted with both close-set and open-set noises; (iii) while the existing deep models suffer from an average performance drop of around 20% over noise-corrupted large scale datasets, our proposed still delivers accuracy rates of more than 95%. Our source codes are publicly available on GitHub.
{"title":"Robust Face Recognition via Adaptive Mining and Margining of Noise and Hard Samples","authors":"Yang Xin;Xiang Zhong;Yu Zhou;Jianmin Jiang","doi":"10.1109/TIP.2025.3634979","DOIUrl":"10.1109/TIP.2025.3634979","url":null,"abstract":"At present, deep face recognition models working on millions of images are confronted with the challenge that such large-scale datasets are often corrupted with noises and mislabeled identities yet most deep models are primarily designed for clean datasets. In this paper, we propose a robust deep face recognition model by exploiting the advantage of integrating the strength of margin-based learning models with the strength of mining-based approaches to effectively mitigate the impact of noises during training. By monitoring the recognition performances at a batch level to provide optimization-oriented feedback, we introduce a noise-adaptive mining strategy to dynamically adjust the emphasis balance between hard and noise samples, enabling direct training on noisy datasets without the requirement of pre-training. With a novel anti-noise loss function, learning is empowered for direct and robust training on noisy datasets yet its effectiveness over clean datasets is still preserved, sustaining effective mining of both clean and noisy samples whilst weakening its learning intensiveness over noisy samples. Extensive experiments reveal that: (i) our proposed achieves competitive performances in comparison with representative existing SoTA models when trained with clean datasets; (ii) when trained with both real-world and synthesized noisy datasets, our proposed significantly outperforms the existing models, especially when the synthesized datasets are corrupted with both close-set and open-set noises; (iii) while the existing deep models suffer from an average performance drop of around 20% over noise-corrupted large scale datasets, our proposed still delivers accuracy rates of more than 95%. Our source codes are publicly available on GitHub.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"8114-8129"},"PeriodicalIF":13.7,"publicationDate":"2025-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145663920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}