首页 > 最新文献

IEEE Transactions on Circuits and Systems for Video Technology最新文献

英文 中文
UP-Person: Unified Parameter-Efficient Transfer Learning for Text-Based Person Retrieval UP-Person:基于文本的人物检索的统一参数高效迁移学习
IF 11.1 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-07-15 DOI: 10.1109/TCSVT.2025.3588406
Yating Liu;Yaowei Li;Xiangyuan Lan;Wenming Yang;Zimo Liu;Qingmin Liao
Text-based Person Retrieval (TPR) as a multi-modal task, which aims to retrieve the target person from a pool of candidate images given a text description, has recently garnered considerable attention due to the progress of contrastive visual-language pre-trained model. Prior works leverage pre-trained CLIP to extract person visual and textual features and fully fine-tune the entire network, which have shown notable performance improvements compared to uni-modal pre-training models. However, full-tuning a large model is prone to overfitting and hinders the generalization ability. In this paper, we propose a novel Unified Parameter-Efficient Transfer Learning (PETL) method for Text-based Person Retrieval (UP-Person) to thoroughly transfer the multi-modal knowledge from CLIP. Specifically, UP-Person simultaneously integrates three lightweight PETL components including Prefix, LoRA and Adapter, where Prefix and LoRA are devised together to mine local information with task-specific information prompts, and Adapter is designed to adjust global feature representations. Additionally, two vanilla submodules are optimized to adapt to the unified architecture of TPR. For one thing, S-Prefix is proposed to boost attention of prefix and enhance the gradient propagation of prefix tokens, which improves the flexibility and performance of the vanilla prefix. For another thing, L-Adapter is designed in parallel with layer normalization to adjust the overall distribution, which can resolve conflicts caused by overlap and interaction among multiple submodules. Extensive experimental results demonstrate that our UP-Person achieves state-of-the-art results across various person retrieval datasets, including CUHK-PEDES, ICFG-PEDES and RSTPReid while merely fine-tuning 4.7% parameters. Code is available at https://github.com/Liu-Yating/UP-Person.
基于文本的人物检索(text -based Person Retrieval, TPR)作为一种多模态任务,旨在从给定文本描述的候选图像池中检索目标人物,近年来由于对比视觉语言预训练模型的进展而引起了人们的广泛关注。先前的工作利用预训练的CLIP来提取人的视觉和文本特征,并对整个网络进行完全微调,与单模态预训练模型相比,这已经显示出显着的性能改进。但是,对大型模型进行全调优容易出现过拟合,影响泛化能力。本文提出了一种新的基于文本的人物检索(UP-Person)的统一参数高效迁移学习(PETL)方法,以彻底迁移来自CLIP的多模态知识。具体来说,UP-Person同时集成了三个轻量级的PETL组件,包括Prefix、LoRA和Adapter,其中Prefix和LoRA一起设计用于使用特定于任务的信息提示挖掘本地信息,Adapter用于调整全局特征表示。此外,还对两个vanilla子模块进行了优化,以适应TPR的统一架构。首先,S-Prefix的提出提高了前缀的关注度,增强了前缀令牌的梯度传播,提高了普通前缀的灵活性和性能;另一方面,L-Adapter与层规范化并行设计,调整整体分布,可以解决多个子模块之间的重叠和交互造成的冲突。大量的实验结果表明,我们的UP-Person在各种人物检索数据集(包括中大- pedes, ICFG-PEDES和RSTPReid)上取得了最先进的结果,而仅微调了4.7%的参数。代码可从https://github.com/Liu-Yating/UP-Person获得。
{"title":"UP-Person: Unified Parameter-Efficient Transfer Learning for Text-Based Person Retrieval","authors":"Yating Liu;Yaowei Li;Xiangyuan Lan;Wenming Yang;Zimo Liu;Qingmin Liao","doi":"10.1109/TCSVT.2025.3588406","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3588406","url":null,"abstract":"Text-based Person Retrieval (TPR) as a multi-modal task, which aims to retrieve the target person from a pool of candidate images given a text description, has recently garnered considerable attention due to the progress of contrastive visual-language pre-trained model. Prior works leverage pre-trained CLIP to extract person visual and textual features and fully fine-tune the entire network, which have shown notable performance improvements compared to uni-modal pre-training models. However, full-tuning a large model is prone to overfitting and hinders the generalization ability. In this paper, we propose a novel <italic>U</i>nified <italic>P</i>arameter-Efficient Transfer Learning (PETL) method for Text-based <italic>Person</i> Retrieval (UP-Person) to thoroughly transfer the multi-modal knowledge from CLIP. Specifically, UP-Person simultaneously integrates three lightweight PETL components including Prefix, LoRA and Adapter, where Prefix and LoRA are devised together to mine local information with task-specific information prompts, and Adapter is designed to adjust global feature representations. Additionally, two vanilla submodules are optimized to adapt to the unified architecture of TPR. For one thing, S-Prefix is proposed to boost attention of prefix and enhance the gradient propagation of prefix tokens, which improves the flexibility and performance of the vanilla prefix. For another thing, L-Adapter is designed in parallel with layer normalization to adjust the overall distribution, which can resolve conflicts caused by overlap and interaction among multiple submodules. Extensive experimental results demonstrate that our UP-Person achieves state-of-the-art results across various person retrieval datasets, including CUHK-PEDES, ICFG-PEDES and RSTPReid while merely fine-tuning 4.7% parameters. Code is available at <uri>https://github.com/Liu-Yating/UP-Person</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 12","pages":"12874-12889"},"PeriodicalIF":11.1,"publicationDate":"2025-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145674738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
BFSTAL: Bidirectional Feature Splitting With Cross-Layer Fusion for Temporal Action Localization 基于跨层融合的双向特征分割与时间动作定位
IF 11.1 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-07-14 DOI: 10.1109/TCSVT.2025.3588710
Jinglin Xu;Yaqi Zhang;Wenhao Zhou;Hongmin Liu
Temporal Action Localization (TAL) aims to identify the boundaries of actions and their corresponding categories in untrimmed videos. Most existing methods simultaneously process past and future information, neglecting the inherently sequential nature of action occurrence. This confused treatment of past and future information hinders the model’s ability to understand action procedures effectively. To address these issues, we propose Bidirectional Feature Splitting with Cross-Layer Fusion for Temporal Action Localization (BFSTAL), a new bidirectional feature-splitting approach based on Mamba for the TAL task, composed of two core parts, Decomposed Bidirectionally Hybrid (DBH) and Cross-Layer Fusion Detection (CLFD), which explicitly enhances the model’s capacity to understand action procedures, especially to localize temporal boundaries of actions. Specifically, we introduce the Decomposed Bidirectionally Hybrid (DBH) component, which splits video features at a given timestamp into forward features (past information) and backward features (future information). DBH integrates three key modules: Bidirectional Multi-Head Self-Attention (Bi-MHSA), Bidirectional State Space Model (Bi-SSM), and Bidirectional Convolution (Bi-CONV). DBH effectively captures long-range dependencies by combining state-space modeling, attention mechanisms, and convolutional networks while improving spatial-temporal awareness. Furthermore, we propose Cross-Layer Fusion Detection (CLFD), which aggregates multi-scale features from different pyramid levels, enhancing contextual understanding and temporal action localization precision. Extensive experiments demonstrate that BFSTAL outperforms other methods on four widely used TAL benchmarks: THUMOS14, EPIC-KITCHENS 100, Charades, and MultiTHUMOS.
时间动作定位(TAL)旨在识别未修剪视频中动作的边界及其相应的类别。大多数现有的方法同时处理过去和未来的信息,忽略了行动发生的固有顺序性。这种对过去和未来信息的混淆处理阻碍了模型有效理解操作过程的能力。为了解决这些问题,我们提出了一种新的基于曼巴的双向特征分割方法BFSTAL (Bidirectional Feature Splitting with Cross-Layer Fusion for Temporal Action Localization),该方法由分解双向混合(DBH)和跨层融合检测(Cross-Layer Fusion Detection, CLFD)两个核心部分组成,显著增强了模型对动作过程的理解能力,特别是对动作时间边界的定位能力。具体来说,我们引入了双向混合分解(DBH)组件,该组件将给定时间戳的视频特征分解为向前特征(过去信息)和向后特征(未来信息)。DBH集成了三个关键模块:双向多头自注意(Bi-MHSA)、双向状态空间模型(Bi-SSM)和双向卷积(Bi-CONV)。DBH通过结合状态空间建模、注意机制和卷积网络有效地捕获远程依赖关系,同时提高时空感知。此外,我们提出了跨层融合检测(Cross-Layer Fusion Detection, CLFD),它聚合了来自不同金字塔层次的多尺度特征,增强了上下文理解和时间动作定位精度。大量实验表明,BFSTAL在四种广泛使用的TAL基准测试上优于其他方法:THUMOS14、EPIC-KITCHENS 100、Charades和MultiTHUMOS。
{"title":"BFSTAL: Bidirectional Feature Splitting With Cross-Layer Fusion for Temporal Action Localization","authors":"Jinglin Xu;Yaqi Zhang;Wenhao Zhou;Hongmin Liu","doi":"10.1109/TCSVT.2025.3588710","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3588710","url":null,"abstract":"Temporal Action Localization (TAL) aims to identify the boundaries of actions and their corresponding categories in untrimmed videos. Most existing methods simultaneously process past and future information, neglecting the inherently sequential nature of action occurrence. This confused treatment of past and future information hinders the model’s ability to understand action procedures effectively. To address these issues, we propose Bidirectional Feature Splitting with Cross-Layer Fusion for Temporal Action Localization (BFSTAL), a new bidirectional feature-splitting approach based on Mamba for the TAL task, composed of two core parts, Decomposed Bidirectionally Hybrid (DBH) and Cross-Layer Fusion Detection (CLFD), which explicitly enhances the model’s capacity to understand action procedures, especially to localize temporal boundaries of actions. Specifically, we introduce the Decomposed Bidirectionally Hybrid (DBH) component, which splits video features at a given timestamp into forward features (past information) and backward features (future information). DBH integrates three key modules: Bidirectional Multi-Head Self-Attention (Bi-MHSA), Bidirectional State Space Model (Bi-SSM), and Bidirectional Convolution (Bi-CONV). DBH effectively captures long-range dependencies by combining state-space modeling, attention mechanisms, and convolutional networks while improving spatial-temporal awareness. Furthermore, we propose Cross-Layer Fusion Detection (CLFD), which aggregates multi-scale features from different pyramid levels, enhancing contextual understanding and temporal action localization precision. Extensive experiments demonstrate that BFSTAL outperforms other methods on four widely used TAL benchmarks: THUMOS14, EPIC-KITCHENS 100, Charades, and MultiTHUMOS.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 12","pages":"12707-12718"},"PeriodicalIF":11.1,"publicationDate":"2025-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145674788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Generalized Visual Relation Detection With Diffusion Models 基于扩散模型的广义视觉关系检测
IF 11.1 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-07-14 DOI: 10.1109/TCSVT.2025.3588357
Kaifeng Gao;Siqi Chen;Hanwang Zhang;Jun Xiao;Yueting Zhuang;Qianru Sun
Visual relation detection (VRD) aims to identify relationships (or interactions) between object pairs in an image. Although recent VRD models have achieved impressive performance, they are all restricted to pre-defined relation categories, while failing to consider the e.gsemantic ambiguity characteristic of visual relations. Unlike objects, the appearance of visual relations is always subtle and can be described by multiple predicate words from different perspectives, e.g., “ride” can be depicted as “race” and “sit on”, from the sports and spatial position views, respectively. To this end, we propose to model visual relations as continuous embeddings, and design diffusion models to achieve generalized VRD in a conditional generative manner, termed Diff-VRD. We model the diffusion process in a latent space and generate all possible relations in the image as an embedding sequence. During the generation, the visual and text embeddings of subject-object pairs serve as conditional signals and are injected via cross-attention. After the generation, we design a subsequent matching stage to assign the relation words to subject-object pairs by considering their semantic similarities. Benefiting from the diffusion-based generative process, our Diff-VRD is able to generate visual relations beyond the pre-defined category labels of datasets. To properly evaluate this generalized VRD task, we introduce two evaluation metrics, e.gi.e., text-to-image retrieval and SPICE PR Curve inspired by image captioning. Extensive experiments in both human-object interaction (HOI) detection and scene graph generation (SGG) benchmarks attest to the superiority and effectiveness of Diff-VRD.
视觉关系检测(VRD)旨在识别图像中物体对之间的关系(或相互作用)。尽管最近的VRD模型取得了令人印象深刻的成绩,但它们都局限于预定义的关系类别,而没有考虑视觉关系的语义歧义特征。与物体不同,视觉关系的出现总是微妙的,可以用多个谓词词从不同的角度来描述,例如“骑”可以分别从运动角度和空间位置角度描述为“race”和“sit on”。为此,我们建议将视觉关系建模为连续嵌入,并设计扩散模型,以条件生成的方式实现广义VRD,称为difff -VRD。我们在潜在空间中对扩散过程进行建模,并以嵌入序列的形式生成图像中所有可能的关系。在生成过程中,主客体对的视觉嵌入和文本嵌入作为条件信号,通过交叉注意注入。生成关系词后,我们设计了后续的匹配阶段,根据语义相似性将关系词分配给主宾对。得益于基于扩散的生成过程,我们的Diff-VRD能够在预定义的数据集类别标签之外生成视觉关系。为了正确地评估这个广义VRD任务,我们引入了两个评估指标,即:,文本到图像检索和SPICE PR曲线的灵感来自图像字幕。在人-物交互(HOI)检测和场景图生成(SGG)基准测试中进行的大量实验证明了Diff-VRD的优越性和有效性。
{"title":"Generalized Visual Relation Detection With Diffusion Models","authors":"Kaifeng Gao;Siqi Chen;Hanwang Zhang;Jun Xiao;Yueting Zhuang;Qianru Sun","doi":"10.1109/TCSVT.2025.3588357","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3588357","url":null,"abstract":"Visual relation detection (VRD) aims to identify relationships (or interactions) between object pairs in an image. Although recent VRD models have achieved impressive performance, they are all restricted to pre-defined relation categories, while failing to consider the e.gsemantic ambiguity characteristic of visual relations. Unlike objects, the appearance of visual relations is always subtle and can be described by multiple predicate words from different perspectives, e.g., “ride” can be depicted as “race” and “sit on”, from the sports and spatial position views, respectively. To this end, we propose to model visual relations as continuous embeddings, and design diffusion models to achieve generalized VRD in a conditional generative manner, termed Diff-VRD. We model the diffusion process in a latent space and generate all possible relations in the image as an embedding sequence. During the generation, the visual and text embeddings of subject-object pairs serve as conditional signals and are injected via cross-attention. After the generation, we design a subsequent matching stage to assign the relation words to subject-object pairs by considering their semantic similarities. Benefiting from the diffusion-based generative process, our Diff-VRD is able to generate visual relations beyond the pre-defined category labels of datasets. To properly evaluate this generalized VRD task, we introduce two evaluation metrics, e.gi.e., text-to-image retrieval and SPICE PR Curve inspired by image captioning. Extensive experiments in both human-object interaction (HOI) detection and scene graph generation (SGG) benchmarks attest to the superiority and effectiveness of Diff-VRD.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 1","pages":"1203-1215"},"PeriodicalIF":11.1,"publicationDate":"2025-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146049277","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Generative Augmentation Hashing for Few-Shot Cross-Modal Retrieval 基于生成增强哈希的少镜头跨模态检索
IF 11.1 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-07-14 DOI: 10.1109/TCSVT.2025.3588769
Fengling Li;Zequn Wang;Tianshi Wang;Lei Zhu;Xiaojun Chang
Deep cross-modal hashing has demonstrated strong performance in large-scale retrieval but remains challenging in few-shot scenarios due to limited data and weak cross-modal alignment. We propose Generative Augmentation Hashing (GAH), a new framework that synergizes Visual-Language Models (VLMs) and generation-driven hashing to address these limitations. GAH first introduces a cycle generative augmentation mechanism: VLMs generate descriptive textual captions for images, which, combined with label semantics, guide diffusion models to synthesize semantically aligned images via inconsistency filtering. These images then regenerate coherent textual descriptions through VLMs, forming a self-reinforcing cycle that iteratively expands cross-modal data. To resolve the diversity-alignment trade-off in augmentation, we design cross-modal perturbation enhancement, injecting synchronized perturbations with controlled noise to preserve inter-modal semantic relationships while enhancing robustness. Finally, GAH employs dual-level adversarial hash learning, where adversarial alignment of modality-specific and shared latent spaces optimizes both cross-modal consistency and discriminative hash code generation, effectively bridging heterogeneous gaps. Extensive experiments on benchmark datasets show that GAH outperforms state-of-the-art methods in few-shot cross-modal retrieval, achieving significant improvements in retrieval accuracy. Our source codes and datasets are available at https://github.com/xiaolaohuuu/GAH
深度跨模态哈希在大规模检索中表现出强大的性能,但由于数据有限和弱跨模态对齐,在少数场景下仍然具有挑战性。我们提出了生成增强哈希(GAH),这是一种新的框架,它将视觉语言模型(VLMs)和生成驱动哈希协同起来,以解决这些限制。GAH首先引入了一种循环生成增强机制:VLMs为图像生成描述性文本标题,并结合标签语义,引导扩散模型通过不一致过滤合成语义对齐的图像。然后,这些图像通过vlm重新生成连贯的文本描述,形成一个自我强化的循环,迭代地扩展跨模态数据。为了解决增强中的多样性-对准权衡问题,我们设计了跨模态扰动增强,在增强鲁棒性的同时,注入同步扰动与受控噪声,以保持模态间的语义关系。最后,GAH采用双级对抗性哈希学习,其中特定模态和共享潜在空间的对抗性对齐优化了跨模态一致性和判别哈希码生成,有效地缩小了异构差距。在基准数据集上的大量实验表明,GAH在少镜头跨模态检索中优于最先进的方法,在检索精度上取得了显着提高。我们的源代码和数据集可在https://github.com/xiaolaohuuu/GAH上获得
{"title":"Generative Augmentation Hashing for Few-Shot Cross-Modal Retrieval","authors":"Fengling Li;Zequn Wang;Tianshi Wang;Lei Zhu;Xiaojun Chang","doi":"10.1109/TCSVT.2025.3588769","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3588769","url":null,"abstract":"Deep cross-modal hashing has demonstrated strong performance in large-scale retrieval but remains challenging in few-shot scenarios due to limited data and weak cross-modal alignment. We propose Generative Augmentation Hashing (GAH), a new framework that synergizes Visual-Language Models (VLMs) and generation-driven hashing to address these limitations. GAH first introduces a cycle generative augmentation mechanism: VLMs generate descriptive textual captions for images, which, combined with label semantics, guide diffusion models to synthesize semantically aligned images via inconsistency filtering. These images then regenerate coherent textual descriptions through VLMs, forming a self-reinforcing cycle that iteratively expands cross-modal data. To resolve the diversity-alignment trade-off in augmentation, we design cross-modal perturbation enhancement, injecting synchronized perturbations with controlled noise to preserve inter-modal semantic relationships while enhancing robustness. Finally, GAH employs dual-level adversarial hash learning, where adversarial alignment of modality-specific and shared latent spaces optimizes both cross-modal consistency and discriminative hash code generation, effectively bridging heterogeneous gaps. Extensive experiments on benchmark datasets show that GAH outperforms state-of-the-art methods in few-shot cross-modal retrieval, achieving significant improvements in retrieval accuracy. Our source codes and datasets are available at <uri>https://github.com/xiaolaohuuu/GAH</uri>","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 12","pages":"12861-12873"},"PeriodicalIF":11.1,"publicationDate":"2025-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145729339","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hyperspectral Tracker With Constrained Object Adaptive Learning and Trajectory Construction 具有约束目标自适应学习和轨迹构造的高光谱跟踪器
IF 11.1 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-07-11 DOI: 10.1109/TCSVT.2025.3588230
Ye Wang;Mingyang Ma;Ge Zhang;Yuheng Liu;Tao Gao;Shaohui Mei
Hyperspectral imaging offers significant potential for precise object tracking, yet the scarcity of dataset volumes specifically tailored for hyperspectral tracking algorithms hinders progress, particularly for deep models with complex structures. Additionally, current deep learning-based hyperspectral trackers typically enhance model accuracy via online or adversarial learning, adversely affecting tracking speed. To address these challenges, this paper introduces the Constrained Object Adaptive Learning hyperspectral Tracker (COALT), an effective parameter-efficient fine-tuning tracker tailored for hyperspectral tracking. COALT integrates Pixel-level Object Constrained Spectral Prompt (POCSP) and Temporal Sequence Trajectory Prompt (TSTP) through Adaptive Learning with Parameter-efficient Fine-tuning (ALPEFT), enabling a transformer-based tracker to capture detailed spectral features and relationships in hyperspectral image sequences through trainable rank decomposition matrices. Specifically, POCSP is designed to retain optimal spectral information with low internal correlation and high object representativeness, enabling rapid image reconstruction. Then, the most representative spectral template and search are fused into a single stream as spectral prompts for the Encoder and Decoder layers. Concurrently, the previous coordinates within the same sequence are tokenized and utilized as temporal prompts by TSTP in the decoder layers. The model is trained with ALPEFT to optimize spectral information learning, which substantially reduces the number of training parameters, alleviating overfitting issues arising from limited data. Meanwhile, the proposed tracker not only retains the ability of pre-trained model to estimate object trajectories in an autoregressive manner but also effectively utilizes spectral information and enhances target location perception during the fine-tuning process. Extensive experiments and evaluations are conducted on two public hyperspectral tracking datasets. The results demonstrate that the proposed COALT tracker achieves satisfactory performance with leading processing speed. The code will be available at https://github.com/ PING-CHUANG/COALT
高光谱成像为精确目标跟踪提供了巨大的潜力,然而,专门为高光谱跟踪算法量身定制的数据集量的缺乏阻碍了进展,特别是对于具有复杂结构的深度模型。此外,目前基于深度学习的高光谱跟踪器通常通过在线或对抗性学习来提高模型精度,从而对跟踪速度产生不利影响。为了解决这些问题,本文介绍了约束目标自适应学习高光谱跟踪器(COALT),这是一种为高光谱跟踪量身定制的有效的参数高效微调跟踪器。COALT通过自适应学习和参数有效微调(ALPEFT)集成了像素级对象约束光谱提示(POCSP)和时间序列轨迹提示(TSTP),使基于变压器的跟踪器能够通过可训练的秩分解矩阵捕获高光谱图像序列中的详细光谱特征和关系。具体而言,POCSP旨在保留最优的光谱信息,具有低内部相关性和高目标代表性,从而实现快速图像重建。然后,最具代表性的光谱模板和搜索被融合成一个单一的流,作为编码器和解码器层的光谱提示。同时,在解码器层中,TSTP对同一序列中的先前坐标进行标记并将其用作时间提示。该模型使用ALPEFT进行训练,优化光谱信息学习,大大减少了训练参数的数量,缓解了有限数据引起的过拟合问题。同时,该跟踪器既保留了预训练模型自回归估计目标轨迹的能力,又在微调过程中有效利用了光谱信息,增强了目标位置感知能力。在两个公开的高光谱跟踪数据集上进行了大量的实验和评估。结果表明,所提出的COALT跟踪器在处理速度领先的情况下取得了令人满意的性能。该代码将在https://github.com/ PING-CHUANG/ coal上提供
{"title":"Hyperspectral Tracker With Constrained Object Adaptive Learning and Trajectory Construction","authors":"Ye Wang;Mingyang Ma;Ge Zhang;Yuheng Liu;Tao Gao;Shaohui Mei","doi":"10.1109/TCSVT.2025.3588230","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3588230","url":null,"abstract":"Hyperspectral imaging offers significant potential for precise object tracking, yet the scarcity of dataset volumes specifically tailored for hyperspectral tracking algorithms hinders progress, particularly for deep models with complex structures. Additionally, current deep learning-based hyperspectral trackers typically enhance model accuracy via online or adversarial learning, adversely affecting tracking speed. To address these challenges, this paper introduces the Constrained Object Adaptive Learning hyperspectral Tracker (COALT), an effective parameter-efficient fine-tuning tracker tailored for hyperspectral tracking. COALT integrates Pixel-level Object Constrained Spectral Prompt (POCSP) and Temporal Sequence Trajectory Prompt (TSTP) through Adaptive Learning with Parameter-efficient Fine-tuning (ALPEFT), enabling a transformer-based tracker to capture detailed spectral features and relationships in hyperspectral image sequences through trainable rank decomposition matrices. Specifically, POCSP is designed to retain optimal spectral information with low internal correlation and high object representativeness, enabling rapid image reconstruction. Then, the most representative spectral template and search are fused into a single stream as spectral prompts for the Encoder and Decoder layers. Concurrently, the previous coordinates within the same sequence are tokenized and utilized as temporal prompts by TSTP in the decoder layers. The model is trained with ALPEFT to optimize spectral information learning, which substantially reduces the number of training parameters, alleviating overfitting issues arising from limited data. Meanwhile, the proposed tracker not only retains the ability of pre-trained model to estimate object trajectories in an autoregressive manner but also effectively utilizes spectral information and enhances target location perception during the fine-tuning process. Extensive experiments and evaluations are conducted on two public hyperspectral tracking datasets. The results demonstrate that the proposed COALT tracker achieves satisfactory performance with leading processing speed. The code will be available at <uri>https://github.com/ PING-CHUANG/COALT</uri>","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 12","pages":"12666-12679"},"PeriodicalIF":11.1,"publicationDate":"2025-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145674677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Dual Prototypes-Based Personalized Federated Adversarial Cross-Modal Hashing 基于双原型的个性化联邦对抗跨模态哈希
IF 11.1 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-07-11 DOI: 10.1109/TCSVT.2025.3588161
Lingchen Gu;Xiaojuan Shen;Jiande Sun;Yan Liu;Jing Li;Zhihui Li;Sen-Ching S. Cheung;Wenbo Wan
With the rapid advances in wireless communication and IoT platforms, it is increasingly difficult to analyze relevant multi-modal data distributed across geographically diverse and heterogeneous platforms. One promising approach is to rely on federated learning to build compact cross-modal hash codes. However, existing federated learning methods easily exhibit degenerative performance in the global model due to the distributed data being derived from diverse domains. In addition, directly forcing each client to adopt the same global parameters as local parameters, without effective local training, significantly reduces the performance of each client. To overcome these challenges, we propose a novel federated adversarial cross-modal hashing, called Dual Prototypes-based personalized Federated Adversarial (DP-FeAd), which provides iterated training of shared dual prototypes. Specifically, aiming to expand local hashing models beyond their knowledge realms, DP-FeAd enables participating clients to engage in cooperative learning through two constructions: cluster prototypes and unbiased prototypes, instead of the traditional global prototypes, ensuring both generalization and stability. Specifically, the cluster prototypes are derived from local class-level prototypes and adversarially trained with local approximate hash codes to align their distributions. The unbiased prototypes are averaged from cluster prototypes and integrated into the training of local hashing models to maintain consistency across different local class-level prototypes further. The experiments conducted on two benchmark datasets demonstrate that our proposed method significantly enhances the performance of deep cross-modal hashing models in both IID (Independent and Identically Distributed) and non-IID scenarios.
随着无线通信和物联网平台的快速发展,分析分布在不同地理位置和异构平台上的相关多模态数据变得越来越困难。一种很有前途的方法是依靠联邦学习来构建紧凑的跨模态哈希码。然而,现有的联邦学习方法由于来自不同领域的分布式数据,在全局模型中容易表现出性能退化。此外,直接强迫每个客户端采用与局部参数相同的全局参数,而不进行有效的局部训练,会大大降低每个客户端的性能。为了克服这些挑战,我们提出了一种新的联邦对抗跨模态哈希,称为基于双原型的个性化联邦对抗(DP-FeAd),它提供了共享双原型的迭代训练。具体而言,DP-FeAd旨在将局部哈希模型扩展到其知识领域之外,使参与的客户端能够通过两种构造进行合作学习:集群原型和无偏原型,而不是传统的全局原型,从而保证了泛化和稳定性。具体来说,集群原型来源于局部类级原型,并使用局部近似哈希码进行对抗性训练,以对齐它们的分布。从聚类原型中平均无偏原型,并将其集成到局部哈希模型的训练中,以进一步保持不同局部类级原型的一致性。在两个基准数据集上进行的实验表明,我们提出的方法在IID(独立和同分布)和非IID场景下都显著提高了深度跨模态哈希模型的性能。
{"title":"Dual Prototypes-Based Personalized Federated Adversarial Cross-Modal Hashing","authors":"Lingchen Gu;Xiaojuan Shen;Jiande Sun;Yan Liu;Jing Li;Zhihui Li;Sen-Ching S. Cheung;Wenbo Wan","doi":"10.1109/TCSVT.2025.3588161","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3588161","url":null,"abstract":"With the rapid advances in wireless communication and IoT platforms, it is increasingly difficult to analyze relevant multi-modal data distributed across geographically diverse and heterogeneous platforms. One promising approach is to rely on federated learning to build compact cross-modal hash codes. However, existing federated learning methods easily exhibit degenerative performance in the global model due to the distributed data being derived from diverse domains. In addition, directly forcing each client to adopt the same global parameters as local parameters, without effective local training, significantly reduces the performance of each client. To overcome these challenges, we propose a novel federated adversarial cross-modal hashing, called Dual Prototypes-based personalized Federated Adversarial (DP-FeAd), which provides iterated training of shared dual prototypes. Specifically, aiming to expand local hashing models beyond their knowledge realms, DP-FeAd enables participating clients to engage in cooperative learning through two constructions: cluster prototypes and unbiased prototypes, instead of the traditional global prototypes, ensuring both generalization and stability. Specifically, the cluster prototypes are derived from local class-level prototypes and adversarially trained with local approximate hash codes to align their distributions. The unbiased prototypes are averaged from cluster prototypes and integrated into the training of local hashing models to maintain consistency across different local class-level prototypes further. The experiments conducted on two benchmark datasets demonstrate that our proposed method significantly enhances the performance of deep cross-modal hashing models in both IID (Independent and Identically Distributed) and non-IID scenarios.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 12","pages":"12846-12860"},"PeriodicalIF":11.1,"publicationDate":"2025-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145729378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LFSSMam: Efficient Aggregation of Multi-Spatial-Angular-Modal Information Using Selective SSM for Light Field Semantic Segmentation LFSSMam:基于选择性SSM的多空间角模态信息高效聚合光场语义分割
IF 11.1 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-07-11 DOI: 10.1109/TCSVT.2025.3588269
Wenbin Yan;Hua Chen;Qingwei Wu;Xiaogang Zhang;Qiu Fang;Shengjie Hu;Yaonan Wang
Efficiently aggregating 4D light field information to achieve accurate semantic segmentation has always faced challenges in capturing long range dependency information (CNN-based) and the memory limitations of quadratic computational complexity (Transformer-based). Recently, the Mamba architecture, which utilizes the state space model (SSM), has achieved high performance under linear complexity in various vision tasks. However, directly applying Mamba to 4D light field scanning will lead to an inherent loss of multi-spatial-angular information. To address the above challenges, we introduce LFSSMam, a novel Light Field Semantic Segmentation architecture based on the selective state space model (Mamba). Firstly, LFSSMam presents an innovative spatial-angular selective scanning mechanism to decouple and scan 4D multi-dimensional light field data. It separately captures the rich spatial context, complementary angular and structural information of light field 2D slices within the state space. In addition, we design an SSM-attention Cross-Fusion Enhance Module to perform preferential scanning and fusion across multi-spatial-angular-modal light field information, adaptively aggregating and enhancing the central view features. Comprehensive experiments on synthetic and real world datasets demonstrate that LFSSMam achieves leading edge SOTA (State-Of-The-Art) performance (with a 6.97% improvement to LF-based methods) while reducing memory and computational complexity. This work provides valuable guidance for the efficient modeling and application of multi-spatial-angular information in light field semantic segmentation. Our code is available at https://github.com/HNU-WQW/LFSSMam
有效地聚合四维光场信息以实现准确的语义分割一直面临着远程依赖信息捕获(基于cnn)和二次计算复杂度的内存限制(基于transformer)的挑战。近年来,利用状态空间模型(SSM)的Mamba结构在各种视觉任务的线性复杂度下取得了优异的性能。然而,直接将曼巴法应用于四维光场扫描会导致固有的多空间角度信息丢失。为了解决上述问题,我们引入了一种新的基于选择状态空间模型(Mamba)的光场语义分割体系结构LFSSMam。首先,LFSSMam提出了一种创新的空间-角度选择性扫描机制,对四维多维光场数据进行解耦和扫描。它分别捕获状态空间内光场二维切片丰富的空间脉络、互补的角度信息和结构信息。此外,我们设计了ssm -注意力交叉融合增强模块,对多空间-角模态光场信息进行优先扫描和融合,自适应聚合和增强中心视图特征。在合成和真实世界数据集上的综合实验表明,LFSSMam在减少内存和计算复杂性的同时,实现了领先的SOTA(最先进的)性能(比基于lf的方法提高了6.97%)。该工作为光场语义分割中多空间角信息的高效建模和应用提供了有价值的指导。我们的代码可在https://github.com/HNU-WQW/LFSSMam上获得
{"title":"LFSSMam: Efficient Aggregation of Multi-Spatial-Angular-Modal Information Using Selective SSM for Light Field Semantic Segmentation","authors":"Wenbin Yan;Hua Chen;Qingwei Wu;Xiaogang Zhang;Qiu Fang;Shengjie Hu;Yaonan Wang","doi":"10.1109/TCSVT.2025.3588269","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3588269","url":null,"abstract":"Efficiently aggregating 4D light field information to achieve accurate semantic segmentation has always faced challenges in capturing long range dependency information (CNN-based) and the memory limitations of quadratic computational complexity (Transformer-based). Recently, the Mamba architecture, which utilizes the state space model (SSM), has achieved high performance under linear complexity in various vision tasks. However, directly applying Mamba to 4D light field scanning will lead to an inherent loss of multi-spatial-angular information. To address the above challenges, we introduce LFSSMam, a novel Light Field Semantic Segmentation architecture based on the selective state space model (Mamba). Firstly, LFSSMam presents an innovative spatial-angular selective scanning mechanism to decouple and scan 4D multi-dimensional light field data. It separately captures the rich spatial context, complementary angular and structural information of light field 2D slices within the state space. In addition, we design an SSM-attention Cross-Fusion Enhance Module to perform preferential scanning and fusion across multi-spatial-angular-modal light field information, adaptively aggregating and enhancing the central view features. Comprehensive experiments on synthetic and real world datasets demonstrate that LFSSMam achieves leading edge SOTA (State-Of-The-Art) performance (with a 6.97% improvement to LF-based methods) while reducing memory and computational complexity. This work provides valuable guidance for the efficient modeling and application of multi-spatial-angular information in light field semantic segmentation. Our code is available at <uri>https://github.com/HNU-WQW/LFSSMam</uri>","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 12","pages":"12592-12606"},"PeriodicalIF":11.1,"publicationDate":"2025-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145674853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Variable-Size Symmetry-Based Graph Fourier Transforms for Image Compression 基于变大小对称的图像压缩图傅里叶变换
IF 11.1 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-07-10 DOI: 10.1109/TCSVT.2025.3587753
Alessandro Gnutti;Fabrizio Guerrini;Riccardo Leonardi;Antonio Ortega
Modern compression systems use linear transformations in their encoding and decoding processes, with transforms providing compact signal representations. While multiple data-dependent transforms for image/video coding can adapt to diverse statistical characteristics, assembling large datasets to learn each transform is challenging. Also, the resulting transforms typically lack fast implementation, leading to significant computational costs. Thus, despite many papers proposing new transform families, the most recent compression standards predominantly use traditional separable sinusoidal transforms. This paper proposes integrating a new family of Symmetry-based Graph Fourier Transforms (SBGFTs) of variable sizes into a coding framework, focusing on the extension from our previously introduced $8times 8$ SBGFTs to the general case of NxN grids. SBGFTs are non-separable transforms that achieve sparse signal representation while maintaining low computational complexity thanks to their symmetry properties. Their design is based on our proposed algorithm, which generates symmetric graphs on the grid by adding specific symmetrical connections between nodes and does not require any data-dependent adaptation. Furthermore, for video intra-frame coding, we exploit the correlations between optimal graphs and prediction modes to reduce the cardinality of the transform sets, thus proposing a low-complexity framework. Experiments show that SBGFTs outperform the primary transforms integrated in the explicit Multiple Transform Selection (MTS) used in the latest VVC intra-coding, providing a bit rate saving percentage of $mathbf {6.23%}$ , with only a marginal increase in average complexity. A MATLAB implementation of the proposed algorithm is available online at https://github.com/AlessandroGnutti/Variable-SBGFTs.
现代压缩系统在编码和解码过程中使用线性变换,用变换提供紧凑的信号表示。虽然图像/视频编码的多个数据依赖变换可以适应不同的统计特征,但组装大型数据集来学习每个变换是具有挑战性的。此外,结果转换通常缺乏快速实现,从而导致显著的计算成本。因此,尽管许多论文提出了新的变换族,但最近的压缩标准主要使用传统的可分离正弦变换。本文提出将一组新的基于对称的可变大小的图傅里叶变换(SBGFTs)集成到一个编码框架中,重点是将我们之前介绍的$8 × 8$ SBGFTs扩展到NxN网格的一般情况。sbgft是一种不可分变换,由于其对称性,可以在保持低计算复杂度的同时实现稀疏信号表示。它们的设计基于我们提出的算法,该算法通过在节点之间添加特定的对称连接在网格上生成对称图,并且不需要任何数据依赖的自适应。此外,对于视频帧内编码,我们利用最优图和预测模式之间的相关性来减少变换集的基数,从而提出了一个低复杂度的框架。实验表明,SBGFTs优于最新的VVC内部编码中使用的显式多重变换选择(MTS)中集成的主要变换,提供了$mathbf{6.23%}$的比特率节省百分比,平均复杂度仅略有增加。该算法的MATLAB实现可在https://github.com/AlessandroGnutti/Variable-SBGFTs上在线获得。
{"title":"Variable-Size Symmetry-Based Graph Fourier Transforms for Image Compression","authors":"Alessandro Gnutti;Fabrizio Guerrini;Riccardo Leonardi;Antonio Ortega","doi":"10.1109/TCSVT.2025.3587753","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3587753","url":null,"abstract":"Modern compression systems use linear transformations in their encoding and decoding processes, with transforms providing compact signal representations. While multiple data-dependent transforms for image/video coding can adapt to diverse statistical characteristics, assembling large datasets to learn each transform is challenging. Also, the resulting transforms typically lack fast implementation, leading to significant computational costs. Thus, despite many papers proposing new transform families, the most recent compression standards predominantly use traditional separable sinusoidal transforms. This paper proposes integrating a new family of Symmetry-based Graph Fourier Transforms (SBGFTs) of variable sizes into a coding framework, focusing on the extension from our previously introduced <inline-formula> <tex-math>$8times 8$ </tex-math></inline-formula> SBGFTs to the general case of NxN grids. SBGFTs are non-separable transforms that achieve sparse signal representation while maintaining low computational complexity thanks to their symmetry properties. Their design is based on our proposed algorithm, which generates symmetric graphs on the grid by adding specific symmetrical connections between nodes and does not require any data-dependent adaptation. Furthermore, for video intra-frame coding, we exploit the correlations between optimal graphs and prediction modes to reduce the cardinality of the transform sets, thus proposing a low-complexity framework. Experiments show that SBGFTs outperform the primary transforms integrated in the explicit Multiple Transform Selection (MTS) used in the latest VVC intra-coding, providing a bit rate saving percentage of <inline-formula> <tex-math>$mathbf {6.23%}$ </tex-math></inline-formula>, with only a marginal increase in average complexity. <italic>A</i> MATLAB <italic>implementation of the proposed algorithm is available online at</i> <uri>https://github.com/AlessandroGnutti/Variable-SBGFTs</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 12","pages":"12772-12787"},"PeriodicalIF":11.1,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145674753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
No-Reference Image Quality Assessment: Exploring Intrinsic Distortion Characteristics via Generative Noise Estimation With Mamba 无参考图像质量评估:通过曼巴生成噪声估计探索固有失真特性
IF 11.1 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-07-08 DOI: 10.1109/TCSVT.2025.3586106
Xuting Lan;Weizhi Xian;Mingliang Zhou;Jielu Yan;Xuekai Wei;Jun Luo;Weijia Jia;Sam Kwong
In the field of no-reference image quality assessment (NR-IQA), the visual masking effect has long been a challenging issue. Although existing methods attempt to alleviate the interference caused by masking by generating pseudoreference images, the quality of these images is often constrained by the accuracy and reconstruction capabilities of image restoration algorithms. This can introduce additional biases, thereby affecting the reliability of the evaluation results. To address this problem, we propose a novel generative “noise” estimation framework (GNE-Vim) that eliminates the need for pseudoreference images. Instead, it deeply decouples the distortion components from degraded images and performs quality-aware modelling of these components. During the training phase, the model leverages both reference images and distortion components to guide the learning of the true distortion distribution. In the inference phase, quality prediction is conducted directly on the basis of the decoupled distortion components, making the evaluation results more aligned with human subjective perception. The experimental results demonstrate that the proposed method achieves strong performance across datasets containing various types of distortions. The source code is publicly available at the following website: https://github.com/opencodelxt/GNE-Vim
在无参考图像质量评估(NR-IQA)领域,视觉掩蔽效应一直是一个具有挑战性的问题。虽然现有的方法试图通过生成伪参考图像来减轻掩蔽造成的干扰,但这些图像的质量往往受到图像恢复算法的精度和重建能力的限制。这可能会引入额外的偏差,从而影响评估结果的可靠性。为了解决这个问题,我们提出了一种新的生成“噪声”估计框架(GNE-Vim),该框架消除了对伪参考图像的需求。相反,它从退化图像中深度解耦失真分量,并对这些分量进行质量感知建模。在训练阶段,该模型利用参考图像和失真分量来指导对真实失真分布的学习。在推理阶段,直接基于解耦的失真分量进行质量预测,使评价结果更符合人的主观感知。实验结果表明,该方法在包含各种类型失真的数据集上具有较强的性能。源代码可在以下网站公开获取:https://github.com/opencodelxt/GNE-Vim
{"title":"No-Reference Image Quality Assessment: Exploring Intrinsic Distortion Characteristics via Generative Noise Estimation With Mamba","authors":"Xuting Lan;Weizhi Xian;Mingliang Zhou;Jielu Yan;Xuekai Wei;Jun Luo;Weijia Jia;Sam Kwong","doi":"10.1109/TCSVT.2025.3586106","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3586106","url":null,"abstract":"In the field of no-reference image quality assessment (NR-IQA), the visual masking effect has long been a challenging issue. Although existing methods attempt to alleviate the interference caused by masking by generating pseudoreference images, the quality of these images is often constrained by the accuracy and reconstruction capabilities of image restoration algorithms. This can introduce additional biases, thereby affecting the reliability of the evaluation results. To address this problem, we propose a novel generative “noise” estimation framework (GNE-Vim) that eliminates the need for pseudoreference images. Instead, it deeply decouples the distortion components from degraded images and performs quality-aware modelling of these components. During the training phase, the model leverages both reference images and distortion components to guide the learning of the true distortion distribution. In the inference phase, quality prediction is conducted directly on the basis of the decoupled distortion components, making the evaluation results more aligned with human subjective perception. The experimental results demonstrate that the proposed method achieves strong performance across datasets containing various types of distortions. The source code is publicly available at the following website: <uri>https://github.com/opencodelxt/GNE-Vim</uri>","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 12","pages":"12692-12706"},"PeriodicalIF":11.1,"publicationDate":"2025-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145674860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
3DPortraitGAN: Learning One-Quarter Headshot 3D GANs From a Single-View Portrait Dataset With Diverse Body Poses 3DPortraitGAN:从具有不同身体姿势的单视图肖像数据集学习四分之一头像3D gan
IF 11.1 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-07-07 DOI: 10.1109/TCSVT.2025.3586442
Yiqian Wu;Hao Xu;Xiangjun Tang;Yue Shangguan;Hongbo Fu;Xiaogang Jin
3D-aware face generators are typically trained on 2D real-life face image datasets that primarily consist of near-frontal face data. Due to data limitations, these generators cannot generate one-quarter headshot 3D portraits with head, neck, and shoulder geometry, which is crucial for applications like talking heads. Two reasons account for this issue: First, existing facial recognition methods struggle with extracting facial data captured from large camera angles or back views. Second, it is challenging to learn a distribution of 3D portraits covering the one-quarter headshot region from single-view data due to significant geometric deformation caused by diverse body poses. To this end, we first create the dataset $it {360}^{circ }$ -Portrait-HQ ( $it {360}^{circ }$ PHQ for short) which consists of high-quality single-view real portraits annotated with a variety of camera parameters (the yaw angles span the entire 360° range) and body poses. We then propose 3DPortraitGAN, the first 3D-aware one-quarter headshot portrait generator that learns a canonical 3D avatar distribution from the $it {360}^{circ }$ PHQ dataset with body pose self-learning. Our model can generate view-consistent portrait images from all camera angles with a canonical one-quarter headshot 3D representation. Our experiments show that the proposed framework can accurately predict portrait body poses and generate view-consistent, realistic portrait images with complete geometry from all camera angles.
3d感知人脸生成器通常是在主要由近正面人脸数据组成的2D真实人脸图像数据集上进行训练的。由于数据限制,这些生成器无法生成头部,颈部和肩部几何形状的四分之一头部3D肖像,这对于像说话头这样的应用程序至关重要。造成这一问题的原因有两个:首先,现有的面部识别方法难以提取从大摄像头角度或背面拍摄的面部数据。其次,由于不同的身体姿势导致显著的几何变形,从单视图数据中学习覆盖四分之一头部区域的3D肖像分布具有挑战性。为此,我们首先创建数据集$it {360}^{circ}$ -Portrait-HQ(简称$it {360}^{circ}$ PHQ),该数据集由高质量的单视图真实肖像组成,其中注释了各种相机参数(偏航角跨越整个360°范围)和身体姿势。然后,我们提出了3DPortraitGAN,这是第一个3D感知的四分之一头像生成器,它通过身体姿势自学习从$it {360}^{circ}$ PHQ数据集学习标准的3D头像分布。我们的模型可以从所有相机角度生成具有标准四分之一头像3D表示的视图一致的肖像图像。我们的实验表明,所提出的框架可以准确地预测肖像身体姿势,并从所有相机角度生成具有完整几何形状的视图一致,逼真的肖像图像。
{"title":"3DPortraitGAN: Learning One-Quarter Headshot 3D GANs From a Single-View Portrait Dataset With Diverse Body Poses","authors":"Yiqian Wu;Hao Xu;Xiangjun Tang;Yue Shangguan;Hongbo Fu;Xiaogang Jin","doi":"10.1109/TCSVT.2025.3586442","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3586442","url":null,"abstract":"3D-aware face generators are typically trained on 2D real-life face image datasets that primarily consist of near-frontal face data. Due to data limitations, these generators cannot generate <italic>one-quarter headshot</i> 3D portraits with head, neck, and shoulder geometry, which is crucial for applications like talking heads. Two reasons account for this issue: First, existing facial recognition methods struggle with extracting facial data captured from large camera angles or back views. Second, it is challenging to learn a distribution of 3D portraits covering the one-quarter headshot region from single-view data due to significant geometric deformation caused by diverse body poses. To this end, we first create the dataset <inline-formula> <tex-math>$it {360}^{circ }$ </tex-math></inline-formula>-<italic>Portrait</i>-<italic>HQ</i> (<inline-formula> <tex-math>$it {360}^{circ }$ </tex-math></inline-formula><italic>PHQ</i> for short) which consists of high-quality single-view real portraits annotated with a variety of camera parameters (the yaw angles span the entire 360° range) and body poses. We then propose <italic>3DPortraitGAN</i>, the first 3D-aware one-quarter headshot portrait generator that learns a canonical 3D avatar distribution from the <inline-formula> <tex-math>$it {360}^{circ }$ </tex-math></inline-formula><italic>PHQ</i> dataset with body pose self-learning. Our model can generate view-consistent portrait images from all camera angles with a canonical one-quarter headshot 3D representation. Our experiments show that the proposed framework can accurately predict portrait body poses and generate view-consistent, realistic portrait images with complete geometry from all camera angles.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 12","pages":"12760-12771"},"PeriodicalIF":11.1,"publicationDate":"2025-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145674678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Circuits and Systems for Video Technology
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1