首页 > 最新文献

IEEE Transactions on Circuits and Systems for Video Technology最新文献

英文 中文
Fine-Grained Alignment and Interaction for Video Grounding With Cross-Modal Semantic Hierarchical Graph 基于跨模态语义层次图的视频接地的细粒度对齐与交互
IF 11.1 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-06-02 DOI: 10.1109/TCSVT.2025.3575957
Ran Ran;Jiwei Wei;Shiyuan He;Yuyang Zhou;Peng Wang;Yang Yang;Heng Tao Shen
Video grounding tasks have recently gained significant attention. However, existing methods failed to fully comprehend the semantics within queries and videos, often overlooking key content. Moreover, the lack of fine-grained cross-modal alignment and interaction to guide the semantic matching of complex texts and videos lead to inconsistent representational modeling. To address this issue, we propose a Semantic Hierarchical Grounding model, referred to as SHG, and design a cross-modal semantic hierarchical graph to achieve fine-grained semantic understanding. SHG decomposes both the query and each video moment into three levels: global, action, and element. This topology, ranging from global to local, establishes multi-granularity intrinsic connections between the two modalities, fostering a comprehensive understanding of dynamic semantics and fine-grained cross-modal matching. Accordingly, to fully leverage the rich information within the cross-modal semantic hierarchical graph, we employ contrastive learning by seeking samples with the same action and element semantics, then achieve node-moment cross-modal hierarchical matching for global alignment. This approach can unearth fine-grained clues and align semantics across multiple granularities. Moreover, we combine the designed hierarchical graph interaction for coarse-to-fine fusion of text and video, thereby enabling highly accurate video grounding. Extensive experiments conducted on three challenging public datasets (ActivityNet-Captions, TACoS, and Charades-STA) demonstrate that the proposed approach outperforms state-of-the-art techniques, validating its effectiveness.
视频接地任务最近得到了极大的关注。然而,现有的方法无法完全理解查询和视频中的语义,往往忽略了关键内容。此外,缺乏细粒度的跨模态对齐和交互来指导复杂文本和视频的语义匹配,导致表征建模不一致。为了解决这个问题,我们提出了一个语义层次基础模型,简称SHG,并设计了一个跨模态语义层次图来实现细粒度的语义理解。SHG将查询和每个视频时刻分解为三个层次:全局、动作和元素。这种拓扑结构的范围从全局到局部,在两个模态之间建立了多粒度的内在连接,促进了对动态语义和细粒度跨模态匹配的全面理解。因此,为了充分利用跨模态语义层次图中的丰富信息,我们采用对比学习的方法,通过寻找具有相同动作和元素语义的样本,实现节点-时刻跨模态层次匹配,实现全局对齐。这种方法可以发现细粒度的线索,并跨多个粒度对齐语义。此外,我们结合设计的分层图交互进行文本和视频的粗到精融合,从而实现高精度的视频接地。在三个具有挑战性的公共数据集(ActivityNet-Captions、TACoS和Charades-STA)上进行的大量实验表明,所提出的方法优于最先进的技术,验证了其有效性。
{"title":"Fine-Grained Alignment and Interaction for Video Grounding With Cross-Modal Semantic Hierarchical Graph","authors":"Ran Ran;Jiwei Wei;Shiyuan He;Yuyang Zhou;Peng Wang;Yang Yang;Heng Tao Shen","doi":"10.1109/TCSVT.2025.3575957","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3575957","url":null,"abstract":"Video grounding tasks have recently gained significant attention. However, existing methods failed to fully comprehend the semantics within queries and videos, often overlooking key content. Moreover, the lack of fine-grained cross-modal alignment and interaction to guide the semantic matching of complex texts and videos lead to inconsistent representational modeling. To address this issue, we propose a Semantic Hierarchical Grounding model, referred to as SHG, and design a cross-modal semantic hierarchical graph to achieve fine-grained semantic understanding. SHG decomposes both the query and each video moment into three levels: global, action, and element. This topology, ranging from global to local, establishes multi-granularity intrinsic connections between the two modalities, fostering a comprehensive understanding of dynamic semantics and fine-grained cross-modal matching. Accordingly, to fully leverage the rich information within the cross-modal semantic hierarchical graph, we employ contrastive learning by seeking samples with the same action and element semantics, then achieve node-moment cross-modal hierarchical matching for global alignment. This approach can unearth fine-grained clues and align semantics across multiple granularities. Moreover, we combine the designed hierarchical graph interaction for coarse-to-fine fusion of text and video, thereby enabling highly accurate video grounding. Extensive experiments conducted on three challenging public datasets (ActivityNet-Captions, TACoS, and Charades-STA) demonstrate that the proposed approach outperforms state-of-the-art techniques, validating its effectiveness.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 11","pages":"11641-11654"},"PeriodicalIF":11.1,"publicationDate":"2025-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145405268","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Low-Rank Tensor Meets Deep Prior: Coupling Model-Driven and Data-Driven Methods for Hyperspectral Image Reconstruction 低秩张量满足深度先验:耦合模型驱动和数据驱动的高光谱图像重建方法
IF 11.1 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-06-02 DOI: 10.1109/TCSVT.2025.3575470
Yong Chen;Feiwang Yuan;Wenzhen Lai;Jinshan Zeng;Wei He;Qing Huang
Snapshot compressive imaging (SCI) captures a 3D hyperspectral image (HSI) using a 2D compressive measurement and reconstructs the desired 3D HSI from that 2D measurement. The effective reconstruction method thus is crucial in SCI. Despite recent successes of deep learning (DL)-based methods over traditional approaches, they often ignore the intrinsic characteristics of HSI and are trained for a specific imaging system using sufficient paired datasets. To address this, we propose a novel self-supervised HSI reconstruction framework called low-rank tensor meets deep prior (LDMeet), which couples model-driven and data-driven methods. The design of LDMeet is inspired by the traditional model-driven low-rank tensor prior constructed based on domain knowledge, which can explore the intrinsic global spatial-spectral correlation of HSI and make the reconstruction method interpretable. To further utilize the powerful learning ability of DL-based approaches, we introduce a self-supervised spatial-spectral guided network (SSG-Net) into LDMeet to learn the implicit deep spatial-spectral prior of HSI without requiring training data, making it adaptable to various imaging systems. An efficient alternating direction method of multiplier (ADMM) is designed to solve the LDMeet model. Comprehensive experiments confirm that our LDMeet achieves superior results compared to self-supervised HSI reconstruction methods, while also yielding competitive results with supervised learning methods.
快照压缩成像(SCI)使用2D压缩测量捕获3D高光谱图像(HSI),并从该2D测量重建所需的3D高光谱图像。因此,有效的重建方法对脊髓损伤至关重要。尽管最近基于深度学习(DL)的方法比传统方法取得了成功,但它们往往忽略了HSI的内在特征,并且使用足够的配对数据集进行特定成像系统的训练。为了解决这个问题,我们提出了一种新的自监督HSI重建框架,称为低秩张量满足深度先验(LDMeet),它结合了模型驱动和数据驱动的方法。LDMeet的设计灵感来自于传统的基于领域知识构建的模型驱动低秩张量先验,它可以探索HSI内在的全局空间-频谱相关性,并使重建方法具有可解释性。为了进一步利用基于dl的方法强大的学习能力,我们在LDMeet中引入了一个自监督空间光谱引导网络(SSG-Net),在不需要训练数据的情况下学习HSI的隐式深度空间光谱先验,使其适应各种成像系统。设计了一种高效的交变方向乘法器(ADMM)求解LDMeet模型。综合实验证实,与自监督HSI重建方法相比,我们的LDMeet取得了更好的结果,同时也产生了与监督学习方法竞争的结果。
{"title":"Low-Rank Tensor Meets Deep Prior: Coupling Model-Driven and Data-Driven Methods for Hyperspectral Image Reconstruction","authors":"Yong Chen;Feiwang Yuan;Wenzhen Lai;Jinshan Zeng;Wei He;Qing Huang","doi":"10.1109/TCSVT.2025.3575470","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3575470","url":null,"abstract":"Snapshot compressive imaging (SCI) captures a 3D hyperspectral image (HSI) using a 2D compressive measurement and reconstructs the desired 3D HSI from that 2D measurement. The effective reconstruction method thus is crucial in SCI. Despite recent successes of deep learning (DL)-based methods over traditional approaches, they often ignore the intrinsic characteristics of HSI and are trained for a specific imaging system using sufficient paired datasets. To address this, we propose a novel self-supervised HSI reconstruction framework called low-rank tensor meets deep prior (LDMeet), which couples model-driven and data-driven methods. The design of LDMeet is inspired by the traditional model-driven low-rank tensor prior constructed based on domain knowledge, which can explore the intrinsic global spatial-spectral correlation of HSI and make the reconstruction method interpretable. To further utilize the powerful learning ability of DL-based approaches, we introduce a self-supervised spatial-spectral guided network (SSG-Net) into LDMeet to learn the implicit deep spatial-spectral prior of HSI without requiring training data, making it adaptable to various imaging systems. An efficient alternating direction method of multiplier (ADMM) is designed to solve the LDMeet model. Comprehensive experiments confirm that our LDMeet achieves superior results compared to self-supervised HSI reconstruction methods, while also yielding competitive results with supervised learning methods.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 11","pages":"11685-11697"},"PeriodicalIF":11.1,"publicationDate":"2025-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145405354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SHAA: Spatial Hybrid Attention Network With Adaptive Cross-Entropy Loss Function for UAV-View Geo-Localization 基于自适应交叉熵损失函数的空间混合注意网络
IF 11.1 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-04-15 DOI: 10.1109/TCSVT.2025.3560637
Nanhua Chen;Dongshuo Zhang;Kai Jiang;Meng Yu;Yeqing Zhu;Tai-Shan Lou;Liangyu Zhao
Cross-view geo-localization provides an offline visual positioning strategy for unmanned aerial vehicles (UAVs) in Global Navigation Satellite System (GNSS)-denied environments. However, it still faces the following challenges, leading to suboptimal localization performance: 1) Existing methods primarily focus on extracting global features or local features by partitioning feature maps, neglecting the exploration of spatial information, which is essential for extracting consistent feature representations and aligning images of identical targets across different views. 2) Cross-view geo-localization encounters the challenge of data imbalance between UAV and satellite images. To address these challenges, the Spatial Hybrid Attention Network with Adaptive Cross-Entropy Loss Function (SHAA) is proposed. To tackle the first issue, the Spatial Hybrid Attention (SHA) method employs a Spatial Shift-MLP (SSM) to focus on the spatial geometric correspondences in feature maps across different views, extracting both global features and fine-grained features. Additionally, the SHA method utilizes a Hybrid Attention (HA) mechanism to enhance feature extraction diversity and robustness by capturing interactions between spatial and channel dimensions, thereby extracting consistent cross-view features and aligning images. For the second challenge, the Adaptive Cross-Entropy (ACE) loss function incorporates adaptive weights to emphasize hard samples, alleviating data imbalance issues and improving training effectiveness. Extensive experiments on widely recognized benchmarks, including University-1652, SUES-200, and DenseUAV, demonstrate that SHAA achieves state-of-the-art performance, outperforming existing methods by over 3.92%. Code will be released at: https://github.com/chennanhua001/SHAA.
交叉视角地理定位为全球导航卫星系统(GNSS)拒绝环境下的无人机(uav)提供了一种离线视觉定位策略。然而,该方法仍然面临以下挑战,导致定位性能不理想:1)现有方法主要侧重于通过划分特征映射提取全局特征或局部特征,忽略了空间信息的探索,而空间信息对于提取一致的特征表示和跨不同视图的相同目标图像对齐至关重要。2)交叉视点地理定位面临无人机与卫星图像数据不平衡的挑战。为了解决这些问题,提出了具有自适应交叉熵损失函数(SHAA)的空间混合注意网络。为了解决第一个问题,空间混合注意(SHA)方法采用空间移位- mlp (SSM)来关注不同视图特征映射中的空间几何对应关系,提取全局特征和细粒度特征。此外,SHA方法利用混合注意(HA)机制,通过捕获空间和通道维度之间的相互作用来增强特征提取的多样性和鲁棒性,从而提取一致的交叉视图特征并对齐图像。对于第二个挑战,自适应交叉熵(ACE)损失函数结合自适应权值来强调硬样本,缓解数据不平衡问题,提高训练效率。在广泛认可的基准测试(包括University-1652、sus -200和DenseUAV)上进行的大量实验表明,SHAA实现了最先进的性能,比现有方法高出3.92%以上。代码将在https://github.com/chennanhua001/SHAA上发布。
{"title":"SHAA: Spatial Hybrid Attention Network With Adaptive Cross-Entropy Loss Function for UAV-View Geo-Localization","authors":"Nanhua Chen;Dongshuo Zhang;Kai Jiang;Meng Yu;Yeqing Zhu;Tai-Shan Lou;Liangyu Zhao","doi":"10.1109/TCSVT.2025.3560637","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3560637","url":null,"abstract":"Cross-view geo-localization provides an offline visual positioning strategy for unmanned aerial vehicles (UAVs) in Global Navigation Satellite System (GNSS)-denied environments. However, it still faces the following challenges, leading to suboptimal localization performance: 1) Existing methods primarily focus on extracting global features or local features by partitioning feature maps, neglecting the exploration of spatial information, which is essential for extracting consistent feature representations and aligning images of identical targets across different views. 2) Cross-view geo-localization encounters the challenge of data imbalance between UAV and satellite images. To address these challenges, the Spatial Hybrid Attention Network with Adaptive Cross-Entropy Loss Function (SHAA) is proposed. To tackle the first issue, the Spatial Hybrid Attention (SHA) method employs a Spatial Shift-MLP (SSM) to focus on the spatial geometric correspondences in feature maps across different views, extracting both global features and fine-grained features. Additionally, the SHA method utilizes a Hybrid Attention (HA) mechanism to enhance feature extraction diversity and robustness by capturing interactions between spatial and channel dimensions, thereby extracting consistent cross-view features and aligning images. For the second challenge, the Adaptive Cross-Entropy (ACE) loss function incorporates adaptive weights to emphasize hard samples, alleviating data imbalance issues and improving training effectiveness. Extensive experiments on widely recognized benchmarks, including University-1652, SUES-200, and DenseUAV, demonstrate that SHAA achieves state-of-the-art performance, outperforming existing methods by over 3.92%. Code will be released at: <uri>https://github.com/chennanhua001/SHAA</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"9398-9413"},"PeriodicalIF":11.1,"publicationDate":"2025-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Keypoints and Action Units Jointly Drive Talking Head Generation for Video Conferencing 关键点和动作单元共同驱动视频会议的说话头生成
IF 11.1 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-04-14 DOI: 10.1109/TCSVT.2025.3560369
Wuzhen Shi;Zibang Xue;Yang Wen
This paper introduces a high-quality talking head generation method that is jointly driven by keypoints and action units, aiming to strike a balance between low-bandwidth transmission and high-quality generation in video conference scenarios. Existing methods for talking head generation often face limitations: they either require an excessive amount of driving information or struggle with accuracy and quality when adapted to low-bandwidth conditions. To address this, we decompose the talking head generation task into two components: a driving task, focused on information-limited control, and an enhancement task, aimed at achieving high-quality, high-definition output. Our proposed method innovatively incorporates the joint driving of keypoints and action units, improving the accuracy of pose and expression generation while remaining suitable for low-bandwidth environments. Furthermore, we implement a multi-step video quality enhancement process, targeting both the entire frame and key regions, while incorporating temporal consistency constraints. By leveraging attention mechanisms, we enhance the realism of the challenging-to-generate mouth regions and mitigate background jitter through background fusion. Finally, a prior-driven super-resolution network is employed to achieve high-quality display. Extensive experiments demonstrate that our method effectively supports low-resolution recording, low-bandwidth transmission, and high-definition display.
针对视频会议场景下低带宽传输和高质量生成的平衡问题,提出了一种由关键点和动作单元共同驱动的高质量话头生成方法。现有的说话头生成方法往往面临局限性:它们要么需要过多的驱动信息,要么在适应低带宽条件时存在准确性和质量问题。为了解决这个问题,我们将说话头生成任务分解为两个组件:驱动任务,专注于信息有限的控制,以及增强任务,旨在实现高质量,高清晰度的输出。我们提出的方法创新地结合了关键点和动作单元的联合驱动,提高了姿态和表情生成的准确性,同时仍然适用于低带宽环境。此外,我们实现了一个多步骤的视频质量增强过程,针对整个帧和关键区域,同时结合时间一致性约束。通过利用注意力机制,我们增强了难以生成的嘴巴区域的真实感,并通过背景融合减轻了背景抖动。最后,采用先验驱动的超分辨率网络实现高质量显示。大量实验表明,该方法能有效支持低分辨率记录、低带宽传输和高清显示。
{"title":"Keypoints and Action Units Jointly Drive Talking Head Generation for Video Conferencing","authors":"Wuzhen Shi;Zibang Xue;Yang Wen","doi":"10.1109/TCSVT.2025.3560369","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3560369","url":null,"abstract":"This paper introduces a high-quality talking head generation method that is jointly driven by keypoints and action units, aiming to strike a balance between low-bandwidth transmission and high-quality generation in video conference scenarios. Existing methods for talking head generation often face limitations: they either require an excessive amount of driving information or struggle with accuracy and quality when adapted to low-bandwidth conditions. To address this, we decompose the talking head generation task into two components: a driving task, focused on information-limited control, and an enhancement task, aimed at achieving high-quality, high-definition output. Our proposed method innovatively incorporates the joint driving of keypoints and action units, improving the accuracy of pose and expression generation while remaining suitable for low-bandwidth environments. Furthermore, we implement a multi-step video quality enhancement process, targeting both the entire frame and key regions, while incorporating temporal consistency constraints. By leveraging attention mechanisms, we enhance the realism of the challenging-to-generate mouth regions and mitigate background jitter through background fusion. Finally, a prior-driven super-resolution network is employed to achieve high-quality display. Extensive experiments demonstrate that our method effectively supports low-resolution recording, low-bandwidth transmission, and high-definition display.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"8692-8706"},"PeriodicalIF":11.1,"publicationDate":"2025-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enhancing Visible-Infrared Person Re-Identification With Modality- and Instance-Aware Adaptation Learning 基于模态和实例感知的自适应学习增强可见红外人再识别
IF 11.1 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-04-11 DOI: 10.1109/TCSVT.2025.3560118
Ruiqi Wu;Bingliang Jiao;Meng Liu;Shining Wang;Wenxuan Wang;Peng Wang
The Visible-Infrared Person Re-identification (VI ReID) aims to achieve cross-modality re-identification by matching pedestrian images from visible and infrared illumination. A crucial challenge in this task is mitigating the impact of modality divergence to enable the VI ReID model to learn cross-modality correspondence. Regarding this challenge, existing methods primarily focus on eliminating the information gap between different modalities by extracting modality-invariant information or supplementing inputs with specific information from another modality. However, these methods may overly focus on bridging the information gap, a challenging issue that could potentially overshadow the inherent complexities of cross-modality ReID itself. Based on this insight, we propose a straightforward yet effective strategy to empower the VI ReID model with sufficient flexibility to adapt diverse modality inputs to achieve cross-modality ReID effectively. Specifically, we introduce a Modality-aware and Instance-aware Visual Prompts (MIP) network, leveraging transformer architecture with customized visual prompts. In our MIP, a set of modality-aware prompts is designed to enable our model to dynamically adapt diverse modality inputs and effectively extract information for identification, thereby alleviating the interference of modality divergence. Besides, we also propose the instance-aware prompts, which are responsible for guiding the model to adapt individual pedestrians and capture discriminative clues for accurate identification. Through extensive experiments on four mainstream VI ReID datasets, the effectiveness of our designed modules is evaluated. Furthermore, our proposed MIP network outperforms most current state-of-the-art methods.
可见-红外人体再识别(VI ReID)旨在通过匹配可见光和红外照明的行人图像来实现跨模态再识别。这项任务的一个关键挑战是减轻模态差异的影响,使VI ReID模型能够学习跨模态对应。针对这一挑战,现有方法主要侧重于通过提取模态不变信息或用另一模态的特定信息补充输入来消除不同模态之间的信息差距。然而,这些方法可能过于关注弥合信息差距,这是一个具有挑战性的问题,可能会掩盖跨模态ReID本身固有的复杂性。基于这一见解,我们提出了一个简单而有效的策略,使VI ReID模型具有足够的灵活性,以适应不同的模态输入,从而有效地实现跨模态ReID。具体地说,我们引入了一个模态感知和实例感知的视觉提示(MIP)网络,利用具有自定义视觉提示的转换器体系结构。在我们的MIP中,设计了一组模态感知提示,使我们的模型能够动态适应不同的模态输入,并有效地提取信息进行识别,从而减轻模态分歧的干扰。此外,我们还提出了实例感知提示,它负责指导模型适应单个行人并捕获判别线索以进行准确识别。通过在四种主流VI ReID数据集上的大量实验,评估了我们设计的模块的有效性。此外,我们提出的MIP网络优于目前最先进的方法。
{"title":"Enhancing Visible-Infrared Person Re-Identification With Modality- and Instance-Aware Adaptation Learning","authors":"Ruiqi Wu;Bingliang Jiao;Meng Liu;Shining Wang;Wenxuan Wang;Peng Wang","doi":"10.1109/TCSVT.2025.3560118","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3560118","url":null,"abstract":"The Visible-Infrared Person Re-identification (VI ReID) aims to achieve cross-modality re-identification by matching pedestrian images from visible and infrared illumination. A crucial challenge in this task is mitigating the impact of modality divergence to enable the VI ReID model to learn cross-modality correspondence. Regarding this challenge, existing methods primarily focus on eliminating the information gap between different modalities by extracting modality-invariant information or supplementing inputs with specific information from another modality. However, these methods may overly focus on bridging the information gap, a challenging issue that could potentially overshadow the inherent complexities of cross-modality ReID itself. Based on this insight, we propose a straightforward yet effective strategy to empower the VI ReID model with sufficient flexibility to adapt diverse modality inputs to achieve cross-modality ReID effectively. Specifically, we introduce a Modality-aware and Instance-aware Visual Prompts (MIP) network, leveraging transformer architecture with customized visual prompts. In our MIP, a set of modality-aware prompts is designed to enable our model to dynamically adapt diverse modality inputs and effectively extract information for identification, thereby alleviating the interference of modality divergence. Besides, we also propose the instance-aware prompts, which are responsible for guiding the model to adapt individual pedestrians and capture discriminative clues for accurate identification. Through extensive experiments on four mainstream VI ReID datasets, the effectiveness of our designed modules is evaluated. Furthermore, our proposed MIP network outperforms most current state-of-the-art methods.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 8","pages":"8086-8103"},"PeriodicalIF":11.1,"publicationDate":"2025-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144781983","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Spatial-Aware Conformal Prediction for Trustworthy Hyperspectral Image Classification 可信赖高光谱图像分类的空间感知保形预测
IF 11.1 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-04-10 DOI: 10.1109/TCSVT.2025.3558753
Kangdao Liu;Tianhao Sun;Hao Zeng;Yongshan Zhang;Chi-Man Pun;Chi-Man Vong
Hyperspectral image (HSI) classification involves assigning unique labels to each pixel to identify various land cover categories. While deep classifiers have achieved high predictive accuracy in this field, they lack the ability to rigorously quantify confidence in their predictions. This limitation restricts their application in critical contexts where the cost of prediction errors is significant, as quantifying the uncertainty of model predictions is crucial for the safe deployment of predictive models. To address this limitation, a rigorous theoretical proof is presented first, which demonstrates the validity of Conformal Prediction, an emerging uncertainty quantification technique, in the context of HSI classification. Building on this foundation, a conformal procedure is designed to equip any pre-trained HSI classifier with trustworthy prediction sets, ensuring that the true labels are included with a user-defined probability (e.g., 95%). Furthermore, a novel framework of Conformal Prediction specifically designed for HSI data, called Spatial-Aware Conformal Prediction ( SACP ), is proposed. This framework integrates essential spatial information of HSI by aggregating the non-conformity scores of pixels with high spatial correlation, effectively improving the statistical efficiency of prediction sets. Both theoretical and empirical results validate the effectiveness of the proposed approaches. The source code is available at https://github.com/J4ckLiu/SACP
高光谱图像(HSI)分类涉及到为每个像素分配独特的标签,以识别各种土地覆盖类别。虽然深度分类器在这一领域取得了很高的预测精度,但它们缺乏严格量化预测信心的能力。这一限制限制了它们在预测误差代价很大的关键环境中的应用,因为量化模型预测的不确定性对于预测模型的安全部署至关重要。为了解决这一限制,首先提出了一个严格的理论证明,它证明了保形预测,一种新兴的不确定性量化技术,在恒指分类的背景下的有效性。在此基础上,设计了一个适形程序,为任何预训练的HSI分类器配备可信的预测集,确保以用户定义的概率(例如,95%)包含真实标签。此外,本文还提出了一种专门为HSI数据设计的新型保形预测框架,称为空间感知保形预测(SACP)。该框架通过聚合具有高空间相关性的像元的不一致性得分,整合了HSI的基本空间信息,有效提高了预测集的统计效率。理论和实证结果验证了所提方法的有效性。源代码可从https://github.com/J4ckLiu/SACP获得
{"title":"Spatial-Aware Conformal Prediction for Trustworthy Hyperspectral Image Classification","authors":"Kangdao Liu;Tianhao Sun;Hao Zeng;Yongshan Zhang;Chi-Man Pun;Chi-Man Vong","doi":"10.1109/TCSVT.2025.3558753","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3558753","url":null,"abstract":"Hyperspectral image (HSI) classification involves assigning unique labels to each pixel to identify various land cover categories. While deep classifiers have achieved high predictive accuracy in this field, they lack the ability to rigorously quantify confidence in their predictions. This limitation restricts their application in critical contexts where the cost of prediction errors is significant, as quantifying the uncertainty of model predictions is crucial for the safe deployment of predictive models. To address this limitation, a rigorous theoretical proof is presented first, which demonstrates the validity of Conformal Prediction, an emerging uncertainty quantification technique, in the context of HSI classification. Building on this foundation, a conformal procedure is designed to equip any pre-trained HSI classifier with trustworthy prediction sets, ensuring that the true labels are included with a user-defined probability (e.g., 95%). Furthermore, a novel framework of Conformal Prediction specifically designed for HSI data, called Spatial-Aware Conformal Prediction ( <monospace>SACP</monospace> ), is proposed. This framework integrates essential spatial information of HSI by aggregating the non-conformity scores of pixels with high spatial correlation, effectively improving the statistical efficiency of prediction sets. Both theoretical and empirical results validate the effectiveness of the proposed approaches. The source code is available at <uri>https://github.com/J4ckLiu/SACP</uri>","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"8754-8766"},"PeriodicalIF":11.1,"publicationDate":"2025-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
VLF-SAR: A Novel Vision-Language Framework for Few-Shot SAR Target Recognition VLF-SAR:一种新的多镜头SAR目标识别视觉语言框架
IF 11.1 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-04-10 DOI: 10.1109/TCSVT.2025.3558801
Nishang Xie;Tao Zhang;Lanyu Zhang;Jie Chen;Feiming Wei;Wenxian Yu
Due to the challenges of obtaining data from valuable targets, few-shot learning plays a critical role in synthetic aperture radar (SAR) target recognition. However, the high noise levels and complex backgrounds inherent in SAR data make this technology difficult to implement. To improve the recognition accuracy, in this paper, we propose a novel vision-language framework, VLF-SAR, with two specialized models: VLF-SAR-P for polarimetric SAR (PolSAR) data and VLF-SAR-T for traditional SAR data. Both models start with a frequency embedded module (FEM) to generate key structural features. For VLF-SAR-P, a polarimetric feature selector (PFS) is further introduced to identify the most relevant polarimetric features. Also, a novel adaptive multimodal triple attention mechanism (AMTAM) is designed to facilitate dynamic interactions between different kinds of features. For VLF-SAR-T, after FEM, a multimodal fusion attention mechanism (MFAM) is correspondingly proposed to fuse and adapt information extracted from frozen contrastive language-image pre-training (CLIP) encoders across different modalities. Extensive experiments on the OpenSARShip2.0, FUSAR-Ship, and SAR-AirCraft-1.0 datasets demonstrate the superiority of VLF-SAR over some state-of-the-art methods, offering a promising approach for few-shot SAR target recognition.
由于难以从有价值的目标中获取数据,少射学习在合成孔径雷达(SAR)目标识别中起着至关重要的作用。然而,SAR数据固有的高噪声水平和复杂背景使得该技术难以实现。为了提高识别精度,本文提出了一种新的视觉语言框架——VLF-SAR,其中包括两个专用模型:用于偏振SAR (PolSAR)数据的VLF-SAR- p模型和用于传统SAR数据的VLF-SAR- t模型。两种模型都从频率嵌入模块(FEM)开始,以生成关键结构特征。对于VLF-SAR-P,进一步引入偏振特征选择器(PFS)来识别最相关的偏振特征。此外,设计了一种新的自适应多模态三注意机制(AMTAM),以促进不同类型特征之间的动态交互。对于VLF-SAR-T,在FEM之后,相应提出了一种多模态融合注意机制(MFAM),用于融合和适应不同模态的冷冻对比语言图像预训练(CLIP)编码器提取的信息。在OpenSARShip2.0、FUSAR-Ship和SAR- aircraft -1.0数据集上进行的大量实验表明,VLF-SAR优于一些最先进的方法,为少量SAR目标识别提供了一种有前途的方法。
{"title":"VLF-SAR: A Novel Vision-Language Framework for Few-Shot SAR Target Recognition","authors":"Nishang Xie;Tao Zhang;Lanyu Zhang;Jie Chen;Feiming Wei;Wenxian Yu","doi":"10.1109/TCSVT.2025.3558801","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3558801","url":null,"abstract":"Due to the challenges of obtaining data from valuable targets, few-shot learning plays a critical role in synthetic aperture radar (SAR) target recognition. However, the high noise levels and complex backgrounds inherent in SAR data make this technology difficult to implement. To improve the recognition accuracy, in this paper, we propose a novel vision-language framework, VLF-SAR, with two specialized models: VLF-SAR-P for polarimetric SAR (PolSAR) data and VLF-SAR-T for traditional SAR data. Both models start with a frequency embedded module (FEM) to generate key structural features. For VLF-SAR-P, a polarimetric feature selector (PFS) is further introduced to identify the most relevant polarimetric features. Also, a novel adaptive multimodal triple attention mechanism (AMTAM) is designed to facilitate dynamic interactions between different kinds of features. For VLF-SAR-T, after FEM, a multimodal fusion attention mechanism (MFAM) is correspondingly proposed to fuse and adapt information extracted from frozen contrastive language-image pre-training (CLIP) encoders across different modalities. Extensive experiments on the OpenSARShip2.0, FUSAR-Ship, and SAR-AirCraft-1.0 datasets demonstrate the superiority of VLF-SAR over some state-of-the-art methods, offering a promising approach for few-shot SAR target recognition.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"9530-9544"},"PeriodicalIF":11.1,"publicationDate":"2025-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Prior Knowledge-Driven Hybrid Prompter Learning for RGB-Event Tracking 基于先验知识驱动的rgb -事件跟踪混合提示学习
IF 11.1 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-04-10 DOI: 10.1109/TCSVT.2025.3559614
Mianzhao Wang;Fan Shi;Xu Cheng;Shengyong Chen
Event data can asynchronously capture variations in light intensity, thereby implicitly providing valuable complementary cues for RGB-Event tracking. Existing methods typically employ a direct interaction mechanism to fuse RGB and event data. However, due to differences in imaging mechanisms, the representational disparity between these two data types is not fixed, which can lead to tracking failures in certain challenging scenarios. To address this issue, we propose a novel prior knowledge-driven hybrid prompter learning framework for RGB-Event tracking. Specifically, we develop a frame-event hybrid prompter that leverages prior tracking knowledge from the foundation model as intermediate modal support to mitigate the heterogeneity between RGB and event data. By leveraging its rich prior tracking knowledge, the intermediate modal reduces the gap between the dense RGB and sparse event data interactions, effectively guiding complementary learning between modalities. Meanwhile, to mitigate the internal learning disparities between the lightweight hybrid prompter and the deep transformer model, we introduce a pseudo-prompt learning strategy that lies between full fine-tuning and partial fine-tuning. This strategy adopts a divide-and-conquer approach to assign different learning rates to modules with distinct functions, effectively reducing the dominant influence of RGB information in complex scenarios. Extensive experiments conducted on two public RGB-Event tracking datasets show that the proposed HPL outperforms state-of-the-art tracking methods, achieving exceptional performance.
事件数据可以异步捕获光强度的变化,从而隐式地为rgb事件跟踪提供有价值的补充线索。现有的方法通常采用直接交互机制来融合RGB和事件数据。然而,由于成像机制的差异,这两种数据类型之间的表示差异并不是固定的,这可能导致在某些具有挑战性的场景中跟踪失败。为了解决这个问题,我们提出了一种新的先验知识驱动的混合提示学习框架,用于RGB-Event跟踪。具体来说,我们开发了一个框架-事件混合提示器,它利用基础模型的先前跟踪知识作为中间模态支持,以减轻RGB和事件数据之间的异质性。通过利用其丰富的先验跟踪知识,中间模态减少了密集RGB和稀疏事件数据交互之间的差距,有效地指导了模态之间的互补学习。同时,为了缓解轻量级混合提示器和深度变压器模型之间的内部学习差异,我们引入了一种介于完全微调和部分微调之间的伪提示学习策略。该策略采用分而治之的方法,对不同功能的模块分配不同的学习率,有效降低了RGB信息在复杂场景下的主导影响。在两个公共rgb -事件跟踪数据集上进行的大量实验表明,所提出的HPL优于最先进的跟踪方法,实现了卓越的性能。
{"title":"Prior Knowledge-Driven Hybrid Prompter Learning for RGB-Event Tracking","authors":"Mianzhao Wang;Fan Shi;Xu Cheng;Shengyong Chen","doi":"10.1109/TCSVT.2025.3559614","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3559614","url":null,"abstract":"Event data can asynchronously capture variations in light intensity, thereby implicitly providing valuable complementary cues for RGB-Event tracking. Existing methods typically employ a direct interaction mechanism to fuse RGB and event data. However, due to differences in imaging mechanisms, the representational disparity between these two data types is not fixed, which can lead to tracking failures in certain challenging scenarios. To address this issue, we propose a novel prior knowledge-driven hybrid prompter learning framework for RGB-Event tracking. Specifically, we develop a frame-event hybrid prompter that leverages prior tracking knowledge from the foundation model as intermediate modal support to mitigate the heterogeneity between RGB and event data. By leveraging its rich prior tracking knowledge, the intermediate modal reduces the gap between the dense RGB and sparse event data interactions, effectively guiding complementary learning between modalities. Meanwhile, to mitigate the internal learning disparities between the lightweight hybrid prompter and the deep transformer model, we introduce a pseudo-prompt learning strategy that lies between full fine-tuning and partial fine-tuning. This strategy adopts a divide-and-conquer approach to assign different learning rates to modules with distinct functions, effectively reducing the dominant influence of RGB information in complex scenarios. Extensive experiments conducted on two public RGB-Event tracking datasets show that the proposed HPL outperforms state-of-the-art tracking methods, achieving exceptional performance.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"8679-8691"},"PeriodicalIF":11.1,"publicationDate":"2025-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Vision-Language Adaptive Clustering and Meta-Adaptation for Unsupervised Few-Shot Action Recognition 无监督少镜头动作识别的视觉语言自适应聚类和元适应
IF 11.1 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-04-09 DOI: 10.1109/TCSVT.2025.3558785
Jiaxin Chen;Jiawen Peng;Yanzuo Lu;Jian-Huang Lai;Andy J. Ma
Unsupervised few-shot action recognition is a practical but challenging task, which adapts knowledge learned from unlabeled videos to novel action classes with only limited labeled data. Without annotated data of base action classes for meta-learning, it cannot achieve satisfactory performance due to the low-quality pseudo-classes and episodes. Though vision-language pre-training models such as CLIP can be employed to improve the quality of pseudo-classes and episodes, the performance improvements may still be limited by using only the visual encoder in the absence of textual modality information. In this paper, we propose fully exploiting the multimodal knowledge of a pre-trained vision-language model CLIP in a novel framework for unsupervised video meta-learning. Textual modality is automatically generated for each unlabeled video by a video-to-text transformer. Multimodal adaptive clustering for episodic sampling (MACES) based on a video-text ensemble distance metric is proposed to accurately estimate pseudo-classes, which constructs high-quality few-shot tasks (episodes) for episodic training. Vision-language meta-adaptation (VLMA) is designed for adapting the pre-trained model to novel tasks by category-aware vision-language contrastive learning and confidence-based reliable bidirectional knowledge distillation. The final prediction is obtained by multimodal adaptive inference. Extensive experiments on five benchmarks demonstrate the superiority of our method for unsupervised few-shot action recognition.
无监督少镜头动作识别是一项实用但具有挑战性的任务,它将从未标记视频中学习到的知识适应于只有有限标记数据的新动作课程。如果元学习没有基本动作类的注释数据,则由于伪类和情节的质量较低,无法达到令人满意的性能。虽然视觉语言预训练模型(如CLIP)可以用来提高伪类和伪集的质量,但在缺乏文本情态信息的情况下,仅使用视觉编码器可能仍然会限制性能的提高。在本文中,我们提出在无监督视频元学习的新框架中充分利用预训练视觉语言模型CLIP的多模态知识。文本模态由视频到文本转换器自动为每个未标记的视频生成。提出了基于视频文本集成距离度量的情景抽样多模态自适应聚类(mace)方法,以准确估计伪类,构建高质量的片段训练任务(片段)。视觉语言元适应(VLMA)是一种基于类别感知的视觉语言对比学习和基于置信度的双向可靠知识蒸馏的方法,旨在使预先训练好的模型适应新的任务。最后通过多模态自适应推理得到预测结果。在五个基准上的大量实验证明了我们的方法在无监督少镜头动作识别方面的优越性。
{"title":"Vision-Language Adaptive Clustering and Meta-Adaptation for Unsupervised Few-Shot Action Recognition","authors":"Jiaxin Chen;Jiawen Peng;Yanzuo Lu;Jian-Huang Lai;Andy J. Ma","doi":"10.1109/TCSVT.2025.3558785","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3558785","url":null,"abstract":"Unsupervised few-shot action recognition is a practical but challenging task, which adapts knowledge learned from unlabeled videos to novel action classes with only limited labeled data. Without annotated data of base action classes for meta-learning, it cannot achieve satisfactory performance due to the low-quality pseudo-classes and episodes. Though vision-language pre-training models such as CLIP can be employed to improve the quality of pseudo-classes and episodes, the performance improvements may still be limited by using only the visual encoder in the absence of textual modality information. In this paper, we propose fully exploiting the multimodal knowledge of a pre-trained vision-language model CLIP in a novel framework for unsupervised video meta-learning. Textual modality is automatically generated for each unlabeled video by a video-to-text transformer. Multimodal adaptive clustering for episodic sampling (MACES) based on a video-text ensemble distance metric is proposed to accurately estimate pseudo-classes, which constructs high-quality few-shot tasks (episodes) for episodic training. Vision-language meta-adaptation (VLMA) is designed for adapting the pre-trained model to novel tasks by category-aware vision-language contrastive learning and confidence-based reliable bidirectional knowledge distillation. The final prediction is obtained by multimodal adaptive inference. Extensive experiments on five benchmarks demonstrate the superiority of our method for unsupervised few-shot action recognition.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"9246-9260"},"PeriodicalIF":11.1,"publicationDate":"2025-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Self-BSR: Self-Supervised Image Denoising and Destriping Based on Blind-Spot Regularization 基于盲点正则化的自监督图像去噪和去条带
IF 11.1 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-04-09 DOI: 10.1109/TCSVT.2025.3559214
Chao Qu;Zewei Chen;Jingyuan Zhang;Xiaoyu Chen;Jing Han
Digital images captured by unstable imaging systems often simultaneously suffer from random noise and stripe noise. Due to the complex noise distribution, denoising and destriping methods based on simple handcrafted priors may leave residual noise. Although supervised methods have achieved some progress, they rely on large-scale noisy-clean image pairs, which are challenging to obtain in practice. To address these problems, we propose a self-supervised image denoising and destriping method based on blind-spot regularization, named Self-BSR. This method transforms the overall denoising and destriping problem into a modeling task for two spatially correlated signals: image and stripe. Specifically, blind-spot regularization leverages spatial continuity learned by the improved blind-spot network to separately constrain the reconstruction of image and stripe while suppressing pixel-wise independent noise. This regularization has two advantages: first, it is adaptively formulated based on implicit network priors, without any explicit parametric modeling of image and noise; second, it enables Self-BSR to learn denoising and destriping only from noisy images. In addition, we introduce the directional feature unshuffle in Self-BSR, which extracts multi-directional information to provide discriminative features for separating image from stripe. Furthermore, the feature-resampling refinement is proposed to improve the reconstruction ability of Self-BSR by resampling pixels with high spatial correlation in the receptive field. Extensive experiments on synthetic and real-world datasets demonstrate significant advantages of the proposed method over existing methods in denoising and destriping performance. The code will be publicly available at https://github.com/Jocobqc/Self-BSR
不稳定成像系统捕获的数字图像经常同时受到随机噪声和条纹噪声的干扰。由于噪声分布复杂,基于简单手工先验的去噪和去条带方法可能会留下残余噪声。监督方法虽然取得了一定的进展,但它依赖于大规模的去噪图像对,在实践中难以获得。为了解决这些问题,我们提出了一种基于盲点正则化的自监督图像去噪和去条带方法,称为Self-BSR。该方法将整体去噪和去条纹问题转化为图像和条纹两个空间相关信号的建模任务。具体来说,盲点正则化利用改进的盲点网络学习到的空间连续性来分别约束图像和条纹的重建,同时抑制逐像素的独立噪声。这种正则化有两个优点:首先,它是基于隐式网络先验自适应表述的,不需要对图像和噪声进行显式的参数化建模;其次,它使Self-BSR能够仅从噪声图像中学习去噪和去条带。此外,我们在Self-BSR中引入了方向性特征解洗牌,提取多向信息,为图像与条纹的分离提供判别性特征。在此基础上,提出了特征重采样细化方法,通过重采样接收野中具有高空间相关性的像素点来提高自bsr的重建能力。在合成和真实数据集上进行的大量实验表明,该方法在去噪和去条带性能方面优于现有方法。代码将在https://github.com/Jocobqc/Self-BSR上公开
{"title":"Self-BSR: Self-Supervised Image Denoising and Destriping Based on Blind-Spot Regularization","authors":"Chao Qu;Zewei Chen;Jingyuan Zhang;Xiaoyu Chen;Jing Han","doi":"10.1109/TCSVT.2025.3559214","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3559214","url":null,"abstract":"Digital images captured by unstable imaging systems often simultaneously suffer from random noise and stripe noise. Due to the complex noise distribution, denoising and destriping methods based on simple handcrafted priors may leave residual noise. Although supervised methods have achieved some progress, they rely on large-scale noisy-clean image pairs, which are challenging to obtain in practice. To address these problems, we propose a self-supervised image denoising and destriping method based on blind-spot regularization, named Self-BSR. This method transforms the overall denoising and destriping problem into a modeling task for two spatially correlated signals: image and stripe. Specifically, blind-spot regularization leverages spatial continuity learned by the improved blind-spot network to separately constrain the reconstruction of image and stripe while suppressing pixel-wise independent noise. This regularization has two advantages: first, it is adaptively formulated based on implicit network priors, without any explicit parametric modeling of image and noise; second, it enables Self-BSR to learn denoising and destriping only from noisy images. In addition, we introduce the directional feature unshuffle in Self-BSR, which extracts multi-directional information to provide discriminative features for separating image from stripe. Furthermore, the feature-resampling refinement is proposed to improve the reconstruction ability of Self-BSR by resampling pixels with high spatial correlation in the receptive field. Extensive experiments on synthetic and real-world datasets demonstrate significant advantages of the proposed method over existing methods in denoising and destriping performance. The code will be publicly available at <uri>https://github.com/Jocobqc/Self-BSR</uri>","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"8666-8678"},"PeriodicalIF":11.1,"publicationDate":"2025-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Circuits and Systems for Video Technology
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1