Pub Date : 2025-06-02DOI: 10.1109/TCSVT.2025.3575957
Ran Ran;Jiwei Wei;Shiyuan He;Yuyang Zhou;Peng Wang;Yang Yang;Heng Tao Shen
Video grounding tasks have recently gained significant attention. However, existing methods failed to fully comprehend the semantics within queries and videos, often overlooking key content. Moreover, the lack of fine-grained cross-modal alignment and interaction to guide the semantic matching of complex texts and videos lead to inconsistent representational modeling. To address this issue, we propose a Semantic Hierarchical Grounding model, referred to as SHG, and design a cross-modal semantic hierarchical graph to achieve fine-grained semantic understanding. SHG decomposes both the query and each video moment into three levels: global, action, and element. This topology, ranging from global to local, establishes multi-granularity intrinsic connections between the two modalities, fostering a comprehensive understanding of dynamic semantics and fine-grained cross-modal matching. Accordingly, to fully leverage the rich information within the cross-modal semantic hierarchical graph, we employ contrastive learning by seeking samples with the same action and element semantics, then achieve node-moment cross-modal hierarchical matching for global alignment. This approach can unearth fine-grained clues and align semantics across multiple granularities. Moreover, we combine the designed hierarchical graph interaction for coarse-to-fine fusion of text and video, thereby enabling highly accurate video grounding. Extensive experiments conducted on three challenging public datasets (ActivityNet-Captions, TACoS, and Charades-STA) demonstrate that the proposed approach outperforms state-of-the-art techniques, validating its effectiveness.
{"title":"Fine-Grained Alignment and Interaction for Video Grounding With Cross-Modal Semantic Hierarchical Graph","authors":"Ran Ran;Jiwei Wei;Shiyuan He;Yuyang Zhou;Peng Wang;Yang Yang;Heng Tao Shen","doi":"10.1109/TCSVT.2025.3575957","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3575957","url":null,"abstract":"Video grounding tasks have recently gained significant attention. However, existing methods failed to fully comprehend the semantics within queries and videos, often overlooking key content. Moreover, the lack of fine-grained cross-modal alignment and interaction to guide the semantic matching of complex texts and videos lead to inconsistent representational modeling. To address this issue, we propose a Semantic Hierarchical Grounding model, referred to as SHG, and design a cross-modal semantic hierarchical graph to achieve fine-grained semantic understanding. SHG decomposes both the query and each video moment into three levels: global, action, and element. This topology, ranging from global to local, establishes multi-granularity intrinsic connections between the two modalities, fostering a comprehensive understanding of dynamic semantics and fine-grained cross-modal matching. Accordingly, to fully leverage the rich information within the cross-modal semantic hierarchical graph, we employ contrastive learning by seeking samples with the same action and element semantics, then achieve node-moment cross-modal hierarchical matching for global alignment. This approach can unearth fine-grained clues and align semantics across multiple granularities. Moreover, we combine the designed hierarchical graph interaction for coarse-to-fine fusion of text and video, thereby enabling highly accurate video grounding. Extensive experiments conducted on three challenging public datasets (ActivityNet-Captions, TACoS, and Charades-STA) demonstrate that the proposed approach outperforms state-of-the-art techniques, validating its effectiveness.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 11","pages":"11641-11654"},"PeriodicalIF":11.1,"publicationDate":"2025-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145405268","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Snapshot compressive imaging (SCI) captures a 3D hyperspectral image (HSI) using a 2D compressive measurement and reconstructs the desired 3D HSI from that 2D measurement. The effective reconstruction method thus is crucial in SCI. Despite recent successes of deep learning (DL)-based methods over traditional approaches, they often ignore the intrinsic characteristics of HSI and are trained for a specific imaging system using sufficient paired datasets. To address this, we propose a novel self-supervised HSI reconstruction framework called low-rank tensor meets deep prior (LDMeet), which couples model-driven and data-driven methods. The design of LDMeet is inspired by the traditional model-driven low-rank tensor prior constructed based on domain knowledge, which can explore the intrinsic global spatial-spectral correlation of HSI and make the reconstruction method interpretable. To further utilize the powerful learning ability of DL-based approaches, we introduce a self-supervised spatial-spectral guided network (SSG-Net) into LDMeet to learn the implicit deep spatial-spectral prior of HSI without requiring training data, making it adaptable to various imaging systems. An efficient alternating direction method of multiplier (ADMM) is designed to solve the LDMeet model. Comprehensive experiments confirm that our LDMeet achieves superior results compared to self-supervised HSI reconstruction methods, while also yielding competitive results with supervised learning methods.
{"title":"Low-Rank Tensor Meets Deep Prior: Coupling Model-Driven and Data-Driven Methods for Hyperspectral Image Reconstruction","authors":"Yong Chen;Feiwang Yuan;Wenzhen Lai;Jinshan Zeng;Wei He;Qing Huang","doi":"10.1109/TCSVT.2025.3575470","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3575470","url":null,"abstract":"Snapshot compressive imaging (SCI) captures a 3D hyperspectral image (HSI) using a 2D compressive measurement and reconstructs the desired 3D HSI from that 2D measurement. The effective reconstruction method thus is crucial in SCI. Despite recent successes of deep learning (DL)-based methods over traditional approaches, they often ignore the intrinsic characteristics of HSI and are trained for a specific imaging system using sufficient paired datasets. To address this, we propose a novel self-supervised HSI reconstruction framework called low-rank tensor meets deep prior (LDMeet), which couples model-driven and data-driven methods. The design of LDMeet is inspired by the traditional model-driven low-rank tensor prior constructed based on domain knowledge, which can explore the intrinsic global spatial-spectral correlation of HSI and make the reconstruction method interpretable. To further utilize the powerful learning ability of DL-based approaches, we introduce a self-supervised spatial-spectral guided network (SSG-Net) into LDMeet to learn the implicit deep spatial-spectral prior of HSI without requiring training data, making it adaptable to various imaging systems. An efficient alternating direction method of multiplier (ADMM) is designed to solve the LDMeet model. Comprehensive experiments confirm that our LDMeet achieves superior results compared to self-supervised HSI reconstruction methods, while also yielding competitive results with supervised learning methods.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 11","pages":"11685-11697"},"PeriodicalIF":11.1,"publicationDate":"2025-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145405354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cross-view geo-localization provides an offline visual positioning strategy for unmanned aerial vehicles (UAVs) in Global Navigation Satellite System (GNSS)-denied environments. However, it still faces the following challenges, leading to suboptimal localization performance: 1) Existing methods primarily focus on extracting global features or local features by partitioning feature maps, neglecting the exploration of spatial information, which is essential for extracting consistent feature representations and aligning images of identical targets across different views. 2) Cross-view geo-localization encounters the challenge of data imbalance between UAV and satellite images. To address these challenges, the Spatial Hybrid Attention Network with Adaptive Cross-Entropy Loss Function (SHAA) is proposed. To tackle the first issue, the Spatial Hybrid Attention (SHA) method employs a Spatial Shift-MLP (SSM) to focus on the spatial geometric correspondences in feature maps across different views, extracting both global features and fine-grained features. Additionally, the SHA method utilizes a Hybrid Attention (HA) mechanism to enhance feature extraction diversity and robustness by capturing interactions between spatial and channel dimensions, thereby extracting consistent cross-view features and aligning images. For the second challenge, the Adaptive Cross-Entropy (ACE) loss function incorporates adaptive weights to emphasize hard samples, alleviating data imbalance issues and improving training effectiveness. Extensive experiments on widely recognized benchmarks, including University-1652, SUES-200, and DenseUAV, demonstrate that SHAA achieves state-of-the-art performance, outperforming existing methods by over 3.92%. Code will be released at: https://github.com/chennanhua001/SHAA.
{"title":"SHAA: Spatial Hybrid Attention Network With Adaptive Cross-Entropy Loss Function for UAV-View Geo-Localization","authors":"Nanhua Chen;Dongshuo Zhang;Kai Jiang;Meng Yu;Yeqing Zhu;Tai-Shan Lou;Liangyu Zhao","doi":"10.1109/TCSVT.2025.3560637","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3560637","url":null,"abstract":"Cross-view geo-localization provides an offline visual positioning strategy for unmanned aerial vehicles (UAVs) in Global Navigation Satellite System (GNSS)-denied environments. However, it still faces the following challenges, leading to suboptimal localization performance: 1) Existing methods primarily focus on extracting global features or local features by partitioning feature maps, neglecting the exploration of spatial information, which is essential for extracting consistent feature representations and aligning images of identical targets across different views. 2) Cross-view geo-localization encounters the challenge of data imbalance between UAV and satellite images. To address these challenges, the Spatial Hybrid Attention Network with Adaptive Cross-Entropy Loss Function (SHAA) is proposed. To tackle the first issue, the Spatial Hybrid Attention (SHA) method employs a Spatial Shift-MLP (SSM) to focus on the spatial geometric correspondences in feature maps across different views, extracting both global features and fine-grained features. Additionally, the SHA method utilizes a Hybrid Attention (HA) mechanism to enhance feature extraction diversity and robustness by capturing interactions between spatial and channel dimensions, thereby extracting consistent cross-view features and aligning images. For the second challenge, the Adaptive Cross-Entropy (ACE) loss function incorporates adaptive weights to emphasize hard samples, alleviating data imbalance issues and improving training effectiveness. Extensive experiments on widely recognized benchmarks, including University-1652, SUES-200, and DenseUAV, demonstrate that SHAA achieves state-of-the-art performance, outperforming existing methods by over 3.92%. Code will be released at: <uri>https://github.com/chennanhua001/SHAA</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"9398-9413"},"PeriodicalIF":11.1,"publicationDate":"2025-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-14DOI: 10.1109/TCSVT.2025.3560369
Wuzhen Shi;Zibang Xue;Yang Wen
This paper introduces a high-quality talking head generation method that is jointly driven by keypoints and action units, aiming to strike a balance between low-bandwidth transmission and high-quality generation in video conference scenarios. Existing methods for talking head generation often face limitations: they either require an excessive amount of driving information or struggle with accuracy and quality when adapted to low-bandwidth conditions. To address this, we decompose the talking head generation task into two components: a driving task, focused on information-limited control, and an enhancement task, aimed at achieving high-quality, high-definition output. Our proposed method innovatively incorporates the joint driving of keypoints and action units, improving the accuracy of pose and expression generation while remaining suitable for low-bandwidth environments. Furthermore, we implement a multi-step video quality enhancement process, targeting both the entire frame and key regions, while incorporating temporal consistency constraints. By leveraging attention mechanisms, we enhance the realism of the challenging-to-generate mouth regions and mitigate background jitter through background fusion. Finally, a prior-driven super-resolution network is employed to achieve high-quality display. Extensive experiments demonstrate that our method effectively supports low-resolution recording, low-bandwidth transmission, and high-definition display.
{"title":"Keypoints and Action Units Jointly Drive Talking Head Generation for Video Conferencing","authors":"Wuzhen Shi;Zibang Xue;Yang Wen","doi":"10.1109/TCSVT.2025.3560369","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3560369","url":null,"abstract":"This paper introduces a high-quality talking head generation method that is jointly driven by keypoints and action units, aiming to strike a balance between low-bandwidth transmission and high-quality generation in video conference scenarios. Existing methods for talking head generation often face limitations: they either require an excessive amount of driving information or struggle with accuracy and quality when adapted to low-bandwidth conditions. To address this, we decompose the talking head generation task into two components: a driving task, focused on information-limited control, and an enhancement task, aimed at achieving high-quality, high-definition output. Our proposed method innovatively incorporates the joint driving of keypoints and action units, improving the accuracy of pose and expression generation while remaining suitable for low-bandwidth environments. Furthermore, we implement a multi-step video quality enhancement process, targeting both the entire frame and key regions, while incorporating temporal consistency constraints. By leveraging attention mechanisms, we enhance the realism of the challenging-to-generate mouth regions and mitigate background jitter through background fusion. Finally, a prior-driven super-resolution network is employed to achieve high-quality display. Extensive experiments demonstrate that our method effectively supports low-resolution recording, low-bandwidth transmission, and high-definition display.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"8692-8706"},"PeriodicalIF":11.1,"publicationDate":"2025-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-11DOI: 10.1109/TCSVT.2025.3560118
Ruiqi Wu;Bingliang Jiao;Meng Liu;Shining Wang;Wenxuan Wang;Peng Wang
The Visible-Infrared Person Re-identification (VI ReID) aims to achieve cross-modality re-identification by matching pedestrian images from visible and infrared illumination. A crucial challenge in this task is mitigating the impact of modality divergence to enable the VI ReID model to learn cross-modality correspondence. Regarding this challenge, existing methods primarily focus on eliminating the information gap between different modalities by extracting modality-invariant information or supplementing inputs with specific information from another modality. However, these methods may overly focus on bridging the information gap, a challenging issue that could potentially overshadow the inherent complexities of cross-modality ReID itself. Based on this insight, we propose a straightforward yet effective strategy to empower the VI ReID model with sufficient flexibility to adapt diverse modality inputs to achieve cross-modality ReID effectively. Specifically, we introduce a Modality-aware and Instance-aware Visual Prompts (MIP) network, leveraging transformer architecture with customized visual prompts. In our MIP, a set of modality-aware prompts is designed to enable our model to dynamically adapt diverse modality inputs and effectively extract information for identification, thereby alleviating the interference of modality divergence. Besides, we also propose the instance-aware prompts, which are responsible for guiding the model to adapt individual pedestrians and capture discriminative clues for accurate identification. Through extensive experiments on four mainstream VI ReID datasets, the effectiveness of our designed modules is evaluated. Furthermore, our proposed MIP network outperforms most current state-of-the-art methods.
{"title":"Enhancing Visible-Infrared Person Re-Identification With Modality- and Instance-Aware Adaptation Learning","authors":"Ruiqi Wu;Bingliang Jiao;Meng Liu;Shining Wang;Wenxuan Wang;Peng Wang","doi":"10.1109/TCSVT.2025.3560118","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3560118","url":null,"abstract":"The Visible-Infrared Person Re-identification (VI ReID) aims to achieve cross-modality re-identification by matching pedestrian images from visible and infrared illumination. A crucial challenge in this task is mitigating the impact of modality divergence to enable the VI ReID model to learn cross-modality correspondence. Regarding this challenge, existing methods primarily focus on eliminating the information gap between different modalities by extracting modality-invariant information or supplementing inputs with specific information from another modality. However, these methods may overly focus on bridging the information gap, a challenging issue that could potentially overshadow the inherent complexities of cross-modality ReID itself. Based on this insight, we propose a straightforward yet effective strategy to empower the VI ReID model with sufficient flexibility to adapt diverse modality inputs to achieve cross-modality ReID effectively. Specifically, we introduce a Modality-aware and Instance-aware Visual Prompts (MIP) network, leveraging transformer architecture with customized visual prompts. In our MIP, a set of modality-aware prompts is designed to enable our model to dynamically adapt diverse modality inputs and effectively extract information for identification, thereby alleviating the interference of modality divergence. Besides, we also propose the instance-aware prompts, which are responsible for guiding the model to adapt individual pedestrians and capture discriminative clues for accurate identification. Through extensive experiments on four mainstream VI ReID datasets, the effectiveness of our designed modules is evaluated. Furthermore, our proposed MIP network outperforms most current state-of-the-art methods.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 8","pages":"8086-8103"},"PeriodicalIF":11.1,"publicationDate":"2025-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144781983","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hyperspectral image (HSI) classification involves assigning unique labels to each pixel to identify various land cover categories. While deep classifiers have achieved high predictive accuracy in this field, they lack the ability to rigorously quantify confidence in their predictions. This limitation restricts their application in critical contexts where the cost of prediction errors is significant, as quantifying the uncertainty of model predictions is crucial for the safe deployment of predictive models. To address this limitation, a rigorous theoretical proof is presented first, which demonstrates the validity of Conformal Prediction, an emerging uncertainty quantification technique, in the context of HSI classification. Building on this foundation, a conformal procedure is designed to equip any pre-trained HSI classifier with trustworthy prediction sets, ensuring that the true labels are included with a user-defined probability (e.g., 95%). Furthermore, a novel framework of Conformal Prediction specifically designed for HSI data, called Spatial-Aware Conformal Prediction ( SACP ), is proposed. This framework integrates essential spatial information of HSI by aggregating the non-conformity scores of pixels with high spatial correlation, effectively improving the statistical efficiency of prediction sets. Both theoretical and empirical results validate the effectiveness of the proposed approaches. The source code is available at https://github.com/J4ckLiu/SACP
{"title":"Spatial-Aware Conformal Prediction for Trustworthy Hyperspectral Image Classification","authors":"Kangdao Liu;Tianhao Sun;Hao Zeng;Yongshan Zhang;Chi-Man Pun;Chi-Man Vong","doi":"10.1109/TCSVT.2025.3558753","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3558753","url":null,"abstract":"Hyperspectral image (HSI) classification involves assigning unique labels to each pixel to identify various land cover categories. While deep classifiers have achieved high predictive accuracy in this field, they lack the ability to rigorously quantify confidence in their predictions. This limitation restricts their application in critical contexts where the cost of prediction errors is significant, as quantifying the uncertainty of model predictions is crucial for the safe deployment of predictive models. To address this limitation, a rigorous theoretical proof is presented first, which demonstrates the validity of Conformal Prediction, an emerging uncertainty quantification technique, in the context of HSI classification. Building on this foundation, a conformal procedure is designed to equip any pre-trained HSI classifier with trustworthy prediction sets, ensuring that the true labels are included with a user-defined probability (e.g., 95%). Furthermore, a novel framework of Conformal Prediction specifically designed for HSI data, called Spatial-Aware Conformal Prediction ( <monospace>SACP</monospace> ), is proposed. This framework integrates essential spatial information of HSI by aggregating the non-conformity scores of pixels with high spatial correlation, effectively improving the statistical efficiency of prediction sets. Both theoretical and empirical results validate the effectiveness of the proposed approaches. The source code is available at <uri>https://github.com/J4ckLiu/SACP</uri>","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"8754-8766"},"PeriodicalIF":11.1,"publicationDate":"2025-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Due to the challenges of obtaining data from valuable targets, few-shot learning plays a critical role in synthetic aperture radar (SAR) target recognition. However, the high noise levels and complex backgrounds inherent in SAR data make this technology difficult to implement. To improve the recognition accuracy, in this paper, we propose a novel vision-language framework, VLF-SAR, with two specialized models: VLF-SAR-P for polarimetric SAR (PolSAR) data and VLF-SAR-T for traditional SAR data. Both models start with a frequency embedded module (FEM) to generate key structural features. For VLF-SAR-P, a polarimetric feature selector (PFS) is further introduced to identify the most relevant polarimetric features. Also, a novel adaptive multimodal triple attention mechanism (AMTAM) is designed to facilitate dynamic interactions between different kinds of features. For VLF-SAR-T, after FEM, a multimodal fusion attention mechanism (MFAM) is correspondingly proposed to fuse and adapt information extracted from frozen contrastive language-image pre-training (CLIP) encoders across different modalities. Extensive experiments on the OpenSARShip2.0, FUSAR-Ship, and SAR-AirCraft-1.0 datasets demonstrate the superiority of VLF-SAR over some state-of-the-art methods, offering a promising approach for few-shot SAR target recognition.
{"title":"VLF-SAR: A Novel Vision-Language Framework for Few-Shot SAR Target Recognition","authors":"Nishang Xie;Tao Zhang;Lanyu Zhang;Jie Chen;Feiming Wei;Wenxian Yu","doi":"10.1109/TCSVT.2025.3558801","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3558801","url":null,"abstract":"Due to the challenges of obtaining data from valuable targets, few-shot learning plays a critical role in synthetic aperture radar (SAR) target recognition. However, the high noise levels and complex backgrounds inherent in SAR data make this technology difficult to implement. To improve the recognition accuracy, in this paper, we propose a novel vision-language framework, VLF-SAR, with two specialized models: VLF-SAR-P for polarimetric SAR (PolSAR) data and VLF-SAR-T for traditional SAR data. Both models start with a frequency embedded module (FEM) to generate key structural features. For VLF-SAR-P, a polarimetric feature selector (PFS) is further introduced to identify the most relevant polarimetric features. Also, a novel adaptive multimodal triple attention mechanism (AMTAM) is designed to facilitate dynamic interactions between different kinds of features. For VLF-SAR-T, after FEM, a multimodal fusion attention mechanism (MFAM) is correspondingly proposed to fuse and adapt information extracted from frozen contrastive language-image pre-training (CLIP) encoders across different modalities. Extensive experiments on the OpenSARShip2.0, FUSAR-Ship, and SAR-AirCraft-1.0 datasets demonstrate the superiority of VLF-SAR over some state-of-the-art methods, offering a promising approach for few-shot SAR target recognition.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"9530-9544"},"PeriodicalIF":11.1,"publicationDate":"2025-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-10DOI: 10.1109/TCSVT.2025.3559614
Mianzhao Wang;Fan Shi;Xu Cheng;Shengyong Chen
Event data can asynchronously capture variations in light intensity, thereby implicitly providing valuable complementary cues for RGB-Event tracking. Existing methods typically employ a direct interaction mechanism to fuse RGB and event data. However, due to differences in imaging mechanisms, the representational disparity between these two data types is not fixed, which can lead to tracking failures in certain challenging scenarios. To address this issue, we propose a novel prior knowledge-driven hybrid prompter learning framework for RGB-Event tracking. Specifically, we develop a frame-event hybrid prompter that leverages prior tracking knowledge from the foundation model as intermediate modal support to mitigate the heterogeneity between RGB and event data. By leveraging its rich prior tracking knowledge, the intermediate modal reduces the gap between the dense RGB and sparse event data interactions, effectively guiding complementary learning between modalities. Meanwhile, to mitigate the internal learning disparities between the lightweight hybrid prompter and the deep transformer model, we introduce a pseudo-prompt learning strategy that lies between full fine-tuning and partial fine-tuning. This strategy adopts a divide-and-conquer approach to assign different learning rates to modules with distinct functions, effectively reducing the dominant influence of RGB information in complex scenarios. Extensive experiments conducted on two public RGB-Event tracking datasets show that the proposed HPL outperforms state-of-the-art tracking methods, achieving exceptional performance.
{"title":"Prior Knowledge-Driven Hybrid Prompter Learning for RGB-Event Tracking","authors":"Mianzhao Wang;Fan Shi;Xu Cheng;Shengyong Chen","doi":"10.1109/TCSVT.2025.3559614","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3559614","url":null,"abstract":"Event data can asynchronously capture variations in light intensity, thereby implicitly providing valuable complementary cues for RGB-Event tracking. Existing methods typically employ a direct interaction mechanism to fuse RGB and event data. However, due to differences in imaging mechanisms, the representational disparity between these two data types is not fixed, which can lead to tracking failures in certain challenging scenarios. To address this issue, we propose a novel prior knowledge-driven hybrid prompter learning framework for RGB-Event tracking. Specifically, we develop a frame-event hybrid prompter that leverages prior tracking knowledge from the foundation model as intermediate modal support to mitigate the heterogeneity between RGB and event data. By leveraging its rich prior tracking knowledge, the intermediate modal reduces the gap between the dense RGB and sparse event data interactions, effectively guiding complementary learning between modalities. Meanwhile, to mitigate the internal learning disparities between the lightweight hybrid prompter and the deep transformer model, we introduce a pseudo-prompt learning strategy that lies between full fine-tuning and partial fine-tuning. This strategy adopts a divide-and-conquer approach to assign different learning rates to modules with distinct functions, effectively reducing the dominant influence of RGB information in complex scenarios. Extensive experiments conducted on two public RGB-Event tracking datasets show that the proposed HPL outperforms state-of-the-art tracking methods, achieving exceptional performance.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"8679-8691"},"PeriodicalIF":11.1,"publicationDate":"2025-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-09DOI: 10.1109/TCSVT.2025.3558785
Jiaxin Chen;Jiawen Peng;Yanzuo Lu;Jian-Huang Lai;Andy J. Ma
Unsupervised few-shot action recognition is a practical but challenging task, which adapts knowledge learned from unlabeled videos to novel action classes with only limited labeled data. Without annotated data of base action classes for meta-learning, it cannot achieve satisfactory performance due to the low-quality pseudo-classes and episodes. Though vision-language pre-training models such as CLIP can be employed to improve the quality of pseudo-classes and episodes, the performance improvements may still be limited by using only the visual encoder in the absence of textual modality information. In this paper, we propose fully exploiting the multimodal knowledge of a pre-trained vision-language model CLIP in a novel framework for unsupervised video meta-learning. Textual modality is automatically generated for each unlabeled video by a video-to-text transformer. Multimodal adaptive clustering for episodic sampling (MACES) based on a video-text ensemble distance metric is proposed to accurately estimate pseudo-classes, which constructs high-quality few-shot tasks (episodes) for episodic training. Vision-language meta-adaptation (VLMA) is designed for adapting the pre-trained model to novel tasks by category-aware vision-language contrastive learning and confidence-based reliable bidirectional knowledge distillation. The final prediction is obtained by multimodal adaptive inference. Extensive experiments on five benchmarks demonstrate the superiority of our method for unsupervised few-shot action recognition.
{"title":"Vision-Language Adaptive Clustering and Meta-Adaptation for Unsupervised Few-Shot Action Recognition","authors":"Jiaxin Chen;Jiawen Peng;Yanzuo Lu;Jian-Huang Lai;Andy J. Ma","doi":"10.1109/TCSVT.2025.3558785","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3558785","url":null,"abstract":"Unsupervised few-shot action recognition is a practical but challenging task, which adapts knowledge learned from unlabeled videos to novel action classes with only limited labeled data. Without annotated data of base action classes for meta-learning, it cannot achieve satisfactory performance due to the low-quality pseudo-classes and episodes. Though vision-language pre-training models such as CLIP can be employed to improve the quality of pseudo-classes and episodes, the performance improvements may still be limited by using only the visual encoder in the absence of textual modality information. In this paper, we propose fully exploiting the multimodal knowledge of a pre-trained vision-language model CLIP in a novel framework for unsupervised video meta-learning. Textual modality is automatically generated for each unlabeled video by a video-to-text transformer. Multimodal adaptive clustering for episodic sampling (MACES) based on a video-text ensemble distance metric is proposed to accurately estimate pseudo-classes, which constructs high-quality few-shot tasks (episodes) for episodic training. Vision-language meta-adaptation (VLMA) is designed for adapting the pre-trained model to novel tasks by category-aware vision-language contrastive learning and confidence-based reliable bidirectional knowledge distillation. The final prediction is obtained by multimodal adaptive inference. Extensive experiments on five benchmarks demonstrate the superiority of our method for unsupervised few-shot action recognition.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"9246-9260"},"PeriodicalIF":11.1,"publicationDate":"2025-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-09DOI: 10.1109/TCSVT.2025.3559214
Chao Qu;Zewei Chen;Jingyuan Zhang;Xiaoyu Chen;Jing Han
Digital images captured by unstable imaging systems often simultaneously suffer from random noise and stripe noise. Due to the complex noise distribution, denoising and destriping methods based on simple handcrafted priors may leave residual noise. Although supervised methods have achieved some progress, they rely on large-scale noisy-clean image pairs, which are challenging to obtain in practice. To address these problems, we propose a self-supervised image denoising and destriping method based on blind-spot regularization, named Self-BSR. This method transforms the overall denoising and destriping problem into a modeling task for two spatially correlated signals: image and stripe. Specifically, blind-spot regularization leverages spatial continuity learned by the improved blind-spot network to separately constrain the reconstruction of image and stripe while suppressing pixel-wise independent noise. This regularization has two advantages: first, it is adaptively formulated based on implicit network priors, without any explicit parametric modeling of image and noise; second, it enables Self-BSR to learn denoising and destriping only from noisy images. In addition, we introduce the directional feature unshuffle in Self-BSR, which extracts multi-directional information to provide discriminative features for separating image from stripe. Furthermore, the feature-resampling refinement is proposed to improve the reconstruction ability of Self-BSR by resampling pixels with high spatial correlation in the receptive field. Extensive experiments on synthetic and real-world datasets demonstrate significant advantages of the proposed method over existing methods in denoising and destriping performance. The code will be publicly available at https://github.com/Jocobqc/Self-BSR
{"title":"Self-BSR: Self-Supervised Image Denoising and Destriping Based on Blind-Spot Regularization","authors":"Chao Qu;Zewei Chen;Jingyuan Zhang;Xiaoyu Chen;Jing Han","doi":"10.1109/TCSVT.2025.3559214","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3559214","url":null,"abstract":"Digital images captured by unstable imaging systems often simultaneously suffer from random noise and stripe noise. Due to the complex noise distribution, denoising and destriping methods based on simple handcrafted priors may leave residual noise. Although supervised methods have achieved some progress, they rely on large-scale noisy-clean image pairs, which are challenging to obtain in practice. To address these problems, we propose a self-supervised image denoising and destriping method based on blind-spot regularization, named Self-BSR. This method transforms the overall denoising and destriping problem into a modeling task for two spatially correlated signals: image and stripe. Specifically, blind-spot regularization leverages spatial continuity learned by the improved blind-spot network to separately constrain the reconstruction of image and stripe while suppressing pixel-wise independent noise. This regularization has two advantages: first, it is adaptively formulated based on implicit network priors, without any explicit parametric modeling of image and noise; second, it enables Self-BSR to learn denoising and destriping only from noisy images. In addition, we introduce the directional feature unshuffle in Self-BSR, which extracts multi-directional information to provide discriminative features for separating image from stripe. Furthermore, the feature-resampling refinement is proposed to improve the reconstruction ability of Self-BSR by resampling pixels with high spatial correlation in the receptive field. Extensive experiments on synthetic and real-world datasets demonstrate significant advantages of the proposed method over existing methods in denoising and destriping performance. The code will be publicly available at <uri>https://github.com/Jocobqc/Self-BSR</uri>","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"8666-8678"},"PeriodicalIF":11.1,"publicationDate":"2025-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}