Text-based Person Retrieval (TPR) as a multi-modal task, which aims to retrieve the target person from a pool of candidate images given a text description, has recently garnered considerable attention due to the progress of contrastive visual-language pre-trained model. Prior works leverage pre-trained CLIP to extract person visual and textual features and fully fine-tune the entire network, which have shown notable performance improvements compared to uni-modal pre-training models. However, full-tuning a large model is prone to overfitting and hinders the generalization ability. In this paper, we propose a novel Unified Parameter-Efficient Transfer Learning (PETL) method for Text-based Person Retrieval (UP-Person) to thoroughly transfer the multi-modal knowledge from CLIP. Specifically, UP-Person simultaneously integrates three lightweight PETL components including Prefix, LoRA and Adapter, where Prefix and LoRA are devised together to mine local information with task-specific information prompts, and Adapter is designed to adjust global feature representations. Additionally, two vanilla submodules are optimized to adapt to the unified architecture of TPR. For one thing, S-Prefix is proposed to boost attention of prefix and enhance the gradient propagation of prefix tokens, which improves the flexibility and performance of the vanilla prefix. For another thing, L-Adapter is designed in parallel with layer normalization to adjust the overall distribution, which can resolve conflicts caused by overlap and interaction among multiple submodules. Extensive experimental results demonstrate that our UP-Person achieves state-of-the-art results across various person retrieval datasets, including CUHK-PEDES, ICFG-PEDES and RSTPReid while merely fine-tuning 4.7% parameters. Code is available at https://github.com/Liu-Yating/UP-Person.
基于文本的人物检索(text -based Person Retrieval, TPR)作为一种多模态任务,旨在从给定文本描述的候选图像池中检索目标人物,近年来由于对比视觉语言预训练模型的进展而引起了人们的广泛关注。先前的工作利用预训练的CLIP来提取人的视觉和文本特征,并对整个网络进行完全微调,与单模态预训练模型相比,这已经显示出显着的性能改进。但是,对大型模型进行全调优容易出现过拟合,影响泛化能力。本文提出了一种新的基于文本的人物检索(UP-Person)的统一参数高效迁移学习(PETL)方法,以彻底迁移来自CLIP的多模态知识。具体来说,UP-Person同时集成了三个轻量级的PETL组件,包括Prefix、LoRA和Adapter,其中Prefix和LoRA一起设计用于使用特定于任务的信息提示挖掘本地信息,Adapter用于调整全局特征表示。此外,还对两个vanilla子模块进行了优化,以适应TPR的统一架构。首先,S-Prefix的提出提高了前缀的关注度,增强了前缀令牌的梯度传播,提高了普通前缀的灵活性和性能;另一方面,L-Adapter与层规范化并行设计,调整整体分布,可以解决多个子模块之间的重叠和交互造成的冲突。大量的实验结果表明,我们的UP-Person在各种人物检索数据集(包括中大- pedes, ICFG-PEDES和RSTPReid)上取得了最先进的结果,而仅微调了4.7%的参数。代码可从https://github.com/Liu-Yating/UP-Person获得。
{"title":"UP-Person: Unified Parameter-Efficient Transfer Learning for Text-Based Person Retrieval","authors":"Yating Liu;Yaowei Li;Xiangyuan Lan;Wenming Yang;Zimo Liu;Qingmin Liao","doi":"10.1109/TCSVT.2025.3588406","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3588406","url":null,"abstract":"Text-based Person Retrieval (TPR) as a multi-modal task, which aims to retrieve the target person from a pool of candidate images given a text description, has recently garnered considerable attention due to the progress of contrastive visual-language pre-trained model. Prior works leverage pre-trained CLIP to extract person visual and textual features and fully fine-tune the entire network, which have shown notable performance improvements compared to uni-modal pre-training models. However, full-tuning a large model is prone to overfitting and hinders the generalization ability. In this paper, we propose a novel <italic>U</i>nified <italic>P</i>arameter-Efficient Transfer Learning (PETL) method for Text-based <italic>Person</i> Retrieval (UP-Person) to thoroughly transfer the multi-modal knowledge from CLIP. Specifically, UP-Person simultaneously integrates three lightweight PETL components including Prefix, LoRA and Adapter, where Prefix and LoRA are devised together to mine local information with task-specific information prompts, and Adapter is designed to adjust global feature representations. Additionally, two vanilla submodules are optimized to adapt to the unified architecture of TPR. For one thing, S-Prefix is proposed to boost attention of prefix and enhance the gradient propagation of prefix tokens, which improves the flexibility and performance of the vanilla prefix. For another thing, L-Adapter is designed in parallel with layer normalization to adjust the overall distribution, which can resolve conflicts caused by overlap and interaction among multiple submodules. Extensive experimental results demonstrate that our UP-Person achieves state-of-the-art results across various person retrieval datasets, including CUHK-PEDES, ICFG-PEDES and RSTPReid while merely fine-tuning 4.7% parameters. Code is available at <uri>https://github.com/Liu-Yating/UP-Person</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 12","pages":"12874-12889"},"PeriodicalIF":11.1,"publicationDate":"2025-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145674738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-14DOI: 10.1109/TCSVT.2025.3588710
Jinglin Xu;Yaqi Zhang;Wenhao Zhou;Hongmin Liu
Temporal Action Localization (TAL) aims to identify the boundaries of actions and their corresponding categories in untrimmed videos. Most existing methods simultaneously process past and future information, neglecting the inherently sequential nature of action occurrence. This confused treatment of past and future information hinders the model’s ability to understand action procedures effectively. To address these issues, we propose Bidirectional Feature Splitting with Cross-Layer Fusion for Temporal Action Localization (BFSTAL), a new bidirectional feature-splitting approach based on Mamba for the TAL task, composed of two core parts, Decomposed Bidirectionally Hybrid (DBH) and Cross-Layer Fusion Detection (CLFD), which explicitly enhances the model’s capacity to understand action procedures, especially to localize temporal boundaries of actions. Specifically, we introduce the Decomposed Bidirectionally Hybrid (DBH) component, which splits video features at a given timestamp into forward features (past information) and backward features (future information). DBH integrates three key modules: Bidirectional Multi-Head Self-Attention (Bi-MHSA), Bidirectional State Space Model (Bi-SSM), and Bidirectional Convolution (Bi-CONV). DBH effectively captures long-range dependencies by combining state-space modeling, attention mechanisms, and convolutional networks while improving spatial-temporal awareness. Furthermore, we propose Cross-Layer Fusion Detection (CLFD), which aggregates multi-scale features from different pyramid levels, enhancing contextual understanding and temporal action localization precision. Extensive experiments demonstrate that BFSTAL outperforms other methods on four widely used TAL benchmarks: THUMOS14, EPIC-KITCHENS 100, Charades, and MultiTHUMOS.
时间动作定位(TAL)旨在识别未修剪视频中动作的边界及其相应的类别。大多数现有的方法同时处理过去和未来的信息,忽略了行动发生的固有顺序性。这种对过去和未来信息的混淆处理阻碍了模型有效理解操作过程的能力。为了解决这些问题,我们提出了一种新的基于曼巴的双向特征分割方法BFSTAL (Bidirectional Feature Splitting with Cross-Layer Fusion for Temporal Action Localization),该方法由分解双向混合(DBH)和跨层融合检测(Cross-Layer Fusion Detection, CLFD)两个核心部分组成,显著增强了模型对动作过程的理解能力,特别是对动作时间边界的定位能力。具体来说,我们引入了双向混合分解(DBH)组件,该组件将给定时间戳的视频特征分解为向前特征(过去信息)和向后特征(未来信息)。DBH集成了三个关键模块:双向多头自注意(Bi-MHSA)、双向状态空间模型(Bi-SSM)和双向卷积(Bi-CONV)。DBH通过结合状态空间建模、注意机制和卷积网络有效地捕获远程依赖关系,同时提高时空感知。此外,我们提出了跨层融合检测(Cross-Layer Fusion Detection, CLFD),它聚合了来自不同金字塔层次的多尺度特征,增强了上下文理解和时间动作定位精度。大量实验表明,BFSTAL在四种广泛使用的TAL基准测试上优于其他方法:THUMOS14、EPIC-KITCHENS 100、Charades和MultiTHUMOS。
{"title":"BFSTAL: Bidirectional Feature Splitting With Cross-Layer Fusion for Temporal Action Localization","authors":"Jinglin Xu;Yaqi Zhang;Wenhao Zhou;Hongmin Liu","doi":"10.1109/TCSVT.2025.3588710","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3588710","url":null,"abstract":"Temporal Action Localization (TAL) aims to identify the boundaries of actions and their corresponding categories in untrimmed videos. Most existing methods simultaneously process past and future information, neglecting the inherently sequential nature of action occurrence. This confused treatment of past and future information hinders the model’s ability to understand action procedures effectively. To address these issues, we propose Bidirectional Feature Splitting with Cross-Layer Fusion for Temporal Action Localization (BFSTAL), a new bidirectional feature-splitting approach based on Mamba for the TAL task, composed of two core parts, Decomposed Bidirectionally Hybrid (DBH) and Cross-Layer Fusion Detection (CLFD), which explicitly enhances the model’s capacity to understand action procedures, especially to localize temporal boundaries of actions. Specifically, we introduce the Decomposed Bidirectionally Hybrid (DBH) component, which splits video features at a given timestamp into forward features (past information) and backward features (future information). DBH integrates three key modules: Bidirectional Multi-Head Self-Attention (Bi-MHSA), Bidirectional State Space Model (Bi-SSM), and Bidirectional Convolution (Bi-CONV). DBH effectively captures long-range dependencies by combining state-space modeling, attention mechanisms, and convolutional networks while improving spatial-temporal awareness. Furthermore, we propose Cross-Layer Fusion Detection (CLFD), which aggregates multi-scale features from different pyramid levels, enhancing contextual understanding and temporal action localization precision. Extensive experiments demonstrate that BFSTAL outperforms other methods on four widely used TAL benchmarks: THUMOS14, EPIC-KITCHENS 100, Charades, and MultiTHUMOS.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 12","pages":"12707-12718"},"PeriodicalIF":11.1,"publicationDate":"2025-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145674788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-14DOI: 10.1109/TCSVT.2025.3588357
Kaifeng Gao;Siqi Chen;Hanwang Zhang;Jun Xiao;Yueting Zhuang;Qianru Sun
Visual relation detection (VRD) aims to identify relationships (or interactions) between object pairs in an image. Although recent VRD models have achieved impressive performance, they are all restricted to pre-defined relation categories, while failing to consider the e.gsemantic ambiguity characteristic of visual relations. Unlike objects, the appearance of visual relations is always subtle and can be described by multiple predicate words from different perspectives, e.g., “ride” can be depicted as “race” and “sit on”, from the sports and spatial position views, respectively. To this end, we propose to model visual relations as continuous embeddings, and design diffusion models to achieve generalized VRD in a conditional generative manner, termed Diff-VRD. We model the diffusion process in a latent space and generate all possible relations in the image as an embedding sequence. During the generation, the visual and text embeddings of subject-object pairs serve as conditional signals and are injected via cross-attention. After the generation, we design a subsequent matching stage to assign the relation words to subject-object pairs by considering their semantic similarities. Benefiting from the diffusion-based generative process, our Diff-VRD is able to generate visual relations beyond the pre-defined category labels of datasets. To properly evaluate this generalized VRD task, we introduce two evaluation metrics, e.gi.e., text-to-image retrieval and SPICE PR Curve inspired by image captioning. Extensive experiments in both human-object interaction (HOI) detection and scene graph generation (SGG) benchmarks attest to the superiority and effectiveness of Diff-VRD.
{"title":"Generalized Visual Relation Detection With Diffusion Models","authors":"Kaifeng Gao;Siqi Chen;Hanwang Zhang;Jun Xiao;Yueting Zhuang;Qianru Sun","doi":"10.1109/TCSVT.2025.3588357","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3588357","url":null,"abstract":"Visual relation detection (VRD) aims to identify relationships (or interactions) between object pairs in an image. Although recent VRD models have achieved impressive performance, they are all restricted to pre-defined relation categories, while failing to consider the e.gsemantic ambiguity characteristic of visual relations. Unlike objects, the appearance of visual relations is always subtle and can be described by multiple predicate words from different perspectives, e.g., “ride” can be depicted as “race” and “sit on”, from the sports and spatial position views, respectively. To this end, we propose to model visual relations as continuous embeddings, and design diffusion models to achieve generalized VRD in a conditional generative manner, termed Diff-VRD. We model the diffusion process in a latent space and generate all possible relations in the image as an embedding sequence. During the generation, the visual and text embeddings of subject-object pairs serve as conditional signals and are injected via cross-attention. After the generation, we design a subsequent matching stage to assign the relation words to subject-object pairs by considering their semantic similarities. Benefiting from the diffusion-based generative process, our Diff-VRD is able to generate visual relations beyond the pre-defined category labels of datasets. To properly evaluate this generalized VRD task, we introduce two evaluation metrics, e.gi.e., text-to-image retrieval and SPICE PR Curve inspired by image captioning. Extensive experiments in both human-object interaction (HOI) detection and scene graph generation (SGG) benchmarks attest to the superiority and effectiveness of Diff-VRD.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 1","pages":"1203-1215"},"PeriodicalIF":11.1,"publicationDate":"2025-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146049277","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deep cross-modal hashing has demonstrated strong performance in large-scale retrieval but remains challenging in few-shot scenarios due to limited data and weak cross-modal alignment. We propose Generative Augmentation Hashing (GAH), a new framework that synergizes Visual-Language Models (VLMs) and generation-driven hashing to address these limitations. GAH first introduces a cycle generative augmentation mechanism: VLMs generate descriptive textual captions for images, which, combined with label semantics, guide diffusion models to synthesize semantically aligned images via inconsistency filtering. These images then regenerate coherent textual descriptions through VLMs, forming a self-reinforcing cycle that iteratively expands cross-modal data. To resolve the diversity-alignment trade-off in augmentation, we design cross-modal perturbation enhancement, injecting synchronized perturbations with controlled noise to preserve inter-modal semantic relationships while enhancing robustness. Finally, GAH employs dual-level adversarial hash learning, where adversarial alignment of modality-specific and shared latent spaces optimizes both cross-modal consistency and discriminative hash code generation, effectively bridging heterogeneous gaps. Extensive experiments on benchmark datasets show that GAH outperforms state-of-the-art methods in few-shot cross-modal retrieval, achieving significant improvements in retrieval accuracy. Our source codes and datasets are available at https://github.com/xiaolaohuuu/GAH
{"title":"Generative Augmentation Hashing for Few-Shot Cross-Modal Retrieval","authors":"Fengling Li;Zequn Wang;Tianshi Wang;Lei Zhu;Xiaojun Chang","doi":"10.1109/TCSVT.2025.3588769","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3588769","url":null,"abstract":"Deep cross-modal hashing has demonstrated strong performance in large-scale retrieval but remains challenging in few-shot scenarios due to limited data and weak cross-modal alignment. We propose Generative Augmentation Hashing (GAH), a new framework that synergizes Visual-Language Models (VLMs) and generation-driven hashing to address these limitations. GAH first introduces a cycle generative augmentation mechanism: VLMs generate descriptive textual captions for images, which, combined with label semantics, guide diffusion models to synthesize semantically aligned images via inconsistency filtering. These images then regenerate coherent textual descriptions through VLMs, forming a self-reinforcing cycle that iteratively expands cross-modal data. To resolve the diversity-alignment trade-off in augmentation, we design cross-modal perturbation enhancement, injecting synchronized perturbations with controlled noise to preserve inter-modal semantic relationships while enhancing robustness. Finally, GAH employs dual-level adversarial hash learning, where adversarial alignment of modality-specific and shared latent spaces optimizes both cross-modal consistency and discriminative hash code generation, effectively bridging heterogeneous gaps. Extensive experiments on benchmark datasets show that GAH outperforms state-of-the-art methods in few-shot cross-modal retrieval, achieving significant improvements in retrieval accuracy. Our source codes and datasets are available at <uri>https://github.com/xiaolaohuuu/GAH</uri>","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 12","pages":"12861-12873"},"PeriodicalIF":11.1,"publicationDate":"2025-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145729339","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-11DOI: 10.1109/TCSVT.2025.3588230
Ye Wang;Mingyang Ma;Ge Zhang;Yuheng Liu;Tao Gao;Shaohui Mei
Hyperspectral imaging offers significant potential for precise object tracking, yet the scarcity of dataset volumes specifically tailored for hyperspectral tracking algorithms hinders progress, particularly for deep models with complex structures. Additionally, current deep learning-based hyperspectral trackers typically enhance model accuracy via online or adversarial learning, adversely affecting tracking speed. To address these challenges, this paper introduces the Constrained Object Adaptive Learning hyperspectral Tracker (COALT), an effective parameter-efficient fine-tuning tracker tailored for hyperspectral tracking. COALT integrates Pixel-level Object Constrained Spectral Prompt (POCSP) and Temporal Sequence Trajectory Prompt (TSTP) through Adaptive Learning with Parameter-efficient Fine-tuning (ALPEFT), enabling a transformer-based tracker to capture detailed spectral features and relationships in hyperspectral image sequences through trainable rank decomposition matrices. Specifically, POCSP is designed to retain optimal spectral information with low internal correlation and high object representativeness, enabling rapid image reconstruction. Then, the most representative spectral template and search are fused into a single stream as spectral prompts for the Encoder and Decoder layers. Concurrently, the previous coordinates within the same sequence are tokenized and utilized as temporal prompts by TSTP in the decoder layers. The model is trained with ALPEFT to optimize spectral information learning, which substantially reduces the number of training parameters, alleviating overfitting issues arising from limited data. Meanwhile, the proposed tracker not only retains the ability of pre-trained model to estimate object trajectories in an autoregressive manner but also effectively utilizes spectral information and enhances target location perception during the fine-tuning process. Extensive experiments and evaluations are conducted on two public hyperspectral tracking datasets. The results demonstrate that the proposed COALT tracker achieves satisfactory performance with leading processing speed. The code will be available at https://github.com/ PING-CHUANG/COALT
{"title":"Hyperspectral Tracker With Constrained Object Adaptive Learning and Trajectory Construction","authors":"Ye Wang;Mingyang Ma;Ge Zhang;Yuheng Liu;Tao Gao;Shaohui Mei","doi":"10.1109/TCSVT.2025.3588230","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3588230","url":null,"abstract":"Hyperspectral imaging offers significant potential for precise object tracking, yet the scarcity of dataset volumes specifically tailored for hyperspectral tracking algorithms hinders progress, particularly for deep models with complex structures. Additionally, current deep learning-based hyperspectral trackers typically enhance model accuracy via online or adversarial learning, adversely affecting tracking speed. To address these challenges, this paper introduces the Constrained Object Adaptive Learning hyperspectral Tracker (COALT), an effective parameter-efficient fine-tuning tracker tailored for hyperspectral tracking. COALT integrates Pixel-level Object Constrained Spectral Prompt (POCSP) and Temporal Sequence Trajectory Prompt (TSTP) through Adaptive Learning with Parameter-efficient Fine-tuning (ALPEFT), enabling a transformer-based tracker to capture detailed spectral features and relationships in hyperspectral image sequences through trainable rank decomposition matrices. Specifically, POCSP is designed to retain optimal spectral information with low internal correlation and high object representativeness, enabling rapid image reconstruction. Then, the most representative spectral template and search are fused into a single stream as spectral prompts for the Encoder and Decoder layers. Concurrently, the previous coordinates within the same sequence are tokenized and utilized as temporal prompts by TSTP in the decoder layers. The model is trained with ALPEFT to optimize spectral information learning, which substantially reduces the number of training parameters, alleviating overfitting issues arising from limited data. Meanwhile, the proposed tracker not only retains the ability of pre-trained model to estimate object trajectories in an autoregressive manner but also effectively utilizes spectral information and enhances target location perception during the fine-tuning process. Extensive experiments and evaluations are conducted on two public hyperspectral tracking datasets. The results demonstrate that the proposed COALT tracker achieves satisfactory performance with leading processing speed. The code will be available at <uri>https://github.com/ PING-CHUANG/COALT</uri>","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 12","pages":"12666-12679"},"PeriodicalIF":11.1,"publicationDate":"2025-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145674677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-11DOI: 10.1109/TCSVT.2025.3588161
Lingchen Gu;Xiaojuan Shen;Jiande Sun;Yan Liu;Jing Li;Zhihui Li;Sen-Ching S. Cheung;Wenbo Wan
With the rapid advances in wireless communication and IoT platforms, it is increasingly difficult to analyze relevant multi-modal data distributed across geographically diverse and heterogeneous platforms. One promising approach is to rely on federated learning to build compact cross-modal hash codes. However, existing federated learning methods easily exhibit degenerative performance in the global model due to the distributed data being derived from diverse domains. In addition, directly forcing each client to adopt the same global parameters as local parameters, without effective local training, significantly reduces the performance of each client. To overcome these challenges, we propose a novel federated adversarial cross-modal hashing, called Dual Prototypes-based personalized Federated Adversarial (DP-FeAd), which provides iterated training of shared dual prototypes. Specifically, aiming to expand local hashing models beyond their knowledge realms, DP-FeAd enables participating clients to engage in cooperative learning through two constructions: cluster prototypes and unbiased prototypes, instead of the traditional global prototypes, ensuring both generalization and stability. Specifically, the cluster prototypes are derived from local class-level prototypes and adversarially trained with local approximate hash codes to align their distributions. The unbiased prototypes are averaged from cluster prototypes and integrated into the training of local hashing models to maintain consistency across different local class-level prototypes further. The experiments conducted on two benchmark datasets demonstrate that our proposed method significantly enhances the performance of deep cross-modal hashing models in both IID (Independent and Identically Distributed) and non-IID scenarios.
{"title":"Dual Prototypes-Based Personalized Federated Adversarial Cross-Modal Hashing","authors":"Lingchen Gu;Xiaojuan Shen;Jiande Sun;Yan Liu;Jing Li;Zhihui Li;Sen-Ching S. Cheung;Wenbo Wan","doi":"10.1109/TCSVT.2025.3588161","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3588161","url":null,"abstract":"With the rapid advances in wireless communication and IoT platforms, it is increasingly difficult to analyze relevant multi-modal data distributed across geographically diverse and heterogeneous platforms. One promising approach is to rely on federated learning to build compact cross-modal hash codes. However, existing federated learning methods easily exhibit degenerative performance in the global model due to the distributed data being derived from diverse domains. In addition, directly forcing each client to adopt the same global parameters as local parameters, without effective local training, significantly reduces the performance of each client. To overcome these challenges, we propose a novel federated adversarial cross-modal hashing, called Dual Prototypes-based personalized Federated Adversarial (DP-FeAd), which provides iterated training of shared dual prototypes. Specifically, aiming to expand local hashing models beyond their knowledge realms, DP-FeAd enables participating clients to engage in cooperative learning through two constructions: cluster prototypes and unbiased prototypes, instead of the traditional global prototypes, ensuring both generalization and stability. Specifically, the cluster prototypes are derived from local class-level prototypes and adversarially trained with local approximate hash codes to align their distributions. The unbiased prototypes are averaged from cluster prototypes and integrated into the training of local hashing models to maintain consistency across different local class-level prototypes further. The experiments conducted on two benchmark datasets demonstrate that our proposed method significantly enhances the performance of deep cross-modal hashing models in both IID (Independent and Identically Distributed) and non-IID scenarios.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 12","pages":"12846-12860"},"PeriodicalIF":11.1,"publicationDate":"2025-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145729378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-11DOI: 10.1109/TCSVT.2025.3588269
Wenbin Yan;Hua Chen;Qingwei Wu;Xiaogang Zhang;Qiu Fang;Shengjie Hu;Yaonan Wang
Efficiently aggregating 4D light field information to achieve accurate semantic segmentation has always faced challenges in capturing long range dependency information (CNN-based) and the memory limitations of quadratic computational complexity (Transformer-based). Recently, the Mamba architecture, which utilizes the state space model (SSM), has achieved high performance under linear complexity in various vision tasks. However, directly applying Mamba to 4D light field scanning will lead to an inherent loss of multi-spatial-angular information. To address the above challenges, we introduce LFSSMam, a novel Light Field Semantic Segmentation architecture based on the selective state space model (Mamba). Firstly, LFSSMam presents an innovative spatial-angular selective scanning mechanism to decouple and scan 4D multi-dimensional light field data. It separately captures the rich spatial context, complementary angular and structural information of light field 2D slices within the state space. In addition, we design an SSM-attention Cross-Fusion Enhance Module to perform preferential scanning and fusion across multi-spatial-angular-modal light field information, adaptively aggregating and enhancing the central view features. Comprehensive experiments on synthetic and real world datasets demonstrate that LFSSMam achieves leading edge SOTA (State-Of-The-Art) performance (with a 6.97% improvement to LF-based methods) while reducing memory and computational complexity. This work provides valuable guidance for the efficient modeling and application of multi-spatial-angular information in light field semantic segmentation. Our code is available at https://github.com/HNU-WQW/LFSSMam
{"title":"LFSSMam: Efficient Aggregation of Multi-Spatial-Angular-Modal Information Using Selective SSM for Light Field Semantic Segmentation","authors":"Wenbin Yan;Hua Chen;Qingwei Wu;Xiaogang Zhang;Qiu Fang;Shengjie Hu;Yaonan Wang","doi":"10.1109/TCSVT.2025.3588269","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3588269","url":null,"abstract":"Efficiently aggregating 4D light field information to achieve accurate semantic segmentation has always faced challenges in capturing long range dependency information (CNN-based) and the memory limitations of quadratic computational complexity (Transformer-based). Recently, the Mamba architecture, which utilizes the state space model (SSM), has achieved high performance under linear complexity in various vision tasks. However, directly applying Mamba to 4D light field scanning will lead to an inherent loss of multi-spatial-angular information. To address the above challenges, we introduce LFSSMam, a novel Light Field Semantic Segmentation architecture based on the selective state space model (Mamba). Firstly, LFSSMam presents an innovative spatial-angular selective scanning mechanism to decouple and scan 4D multi-dimensional light field data. It separately captures the rich spatial context, complementary angular and structural information of light field 2D slices within the state space. In addition, we design an SSM-attention Cross-Fusion Enhance Module to perform preferential scanning and fusion across multi-spatial-angular-modal light field information, adaptively aggregating and enhancing the central view features. Comprehensive experiments on synthetic and real world datasets demonstrate that LFSSMam achieves leading edge SOTA (State-Of-The-Art) performance (with a 6.97% improvement to LF-based methods) while reducing memory and computational complexity. This work provides valuable guidance for the efficient modeling and application of multi-spatial-angular information in light field semantic segmentation. Our code is available at <uri>https://github.com/HNU-WQW/LFSSMam</uri>","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 12","pages":"12592-12606"},"PeriodicalIF":11.1,"publicationDate":"2025-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145674853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Modern compression systems use linear transformations in their encoding and decoding processes, with transforms providing compact signal representations. While multiple data-dependent transforms for image/video coding can adapt to diverse statistical characteristics, assembling large datasets to learn each transform is challenging. Also, the resulting transforms typically lack fast implementation, leading to significant computational costs. Thus, despite many papers proposing new transform families, the most recent compression standards predominantly use traditional separable sinusoidal transforms. This paper proposes integrating a new family of Symmetry-based Graph Fourier Transforms (SBGFTs) of variable sizes into a coding framework, focusing on the extension from our previously introduced $8times 8$ SBGFTs to the general case of NxN grids. SBGFTs are non-separable transforms that achieve sparse signal representation while maintaining low computational complexity thanks to their symmetry properties. Their design is based on our proposed algorithm, which generates symmetric graphs on the grid by adding specific symmetrical connections between nodes and does not require any data-dependent adaptation. Furthermore, for video intra-frame coding, we exploit the correlations between optimal graphs and prediction modes to reduce the cardinality of the transform sets, thus proposing a low-complexity framework. Experiments show that SBGFTs outperform the primary transforms integrated in the explicit Multiple Transform Selection (MTS) used in the latest VVC intra-coding, providing a bit rate saving percentage of $mathbf {6.23%}$ , with only a marginal increase in average complexity. A MATLAB implementation of the proposed algorithm is available online at https://github.com/AlessandroGnutti/Variable-SBGFTs.
{"title":"Variable-Size Symmetry-Based Graph Fourier Transforms for Image Compression","authors":"Alessandro Gnutti;Fabrizio Guerrini;Riccardo Leonardi;Antonio Ortega","doi":"10.1109/TCSVT.2025.3587753","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3587753","url":null,"abstract":"Modern compression systems use linear transformations in their encoding and decoding processes, with transforms providing compact signal representations. While multiple data-dependent transforms for image/video coding can adapt to diverse statistical characteristics, assembling large datasets to learn each transform is challenging. Also, the resulting transforms typically lack fast implementation, leading to significant computational costs. Thus, despite many papers proposing new transform families, the most recent compression standards predominantly use traditional separable sinusoidal transforms. This paper proposes integrating a new family of Symmetry-based Graph Fourier Transforms (SBGFTs) of variable sizes into a coding framework, focusing on the extension from our previously introduced <inline-formula> <tex-math>$8times 8$ </tex-math></inline-formula> SBGFTs to the general case of NxN grids. SBGFTs are non-separable transforms that achieve sparse signal representation while maintaining low computational complexity thanks to their symmetry properties. Their design is based on our proposed algorithm, which generates symmetric graphs on the grid by adding specific symmetrical connections between nodes and does not require any data-dependent adaptation. Furthermore, for video intra-frame coding, we exploit the correlations between optimal graphs and prediction modes to reduce the cardinality of the transform sets, thus proposing a low-complexity framework. Experiments show that SBGFTs outperform the primary transforms integrated in the explicit Multiple Transform Selection (MTS) used in the latest VVC intra-coding, providing a bit rate saving percentage of <inline-formula> <tex-math>$mathbf {6.23%}$ </tex-math></inline-formula>, with only a marginal increase in average complexity. <italic>A</i> MATLAB <italic>implementation of the proposed algorithm is available online at</i> <uri>https://github.com/AlessandroGnutti/Variable-SBGFTs</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 12","pages":"12772-12787"},"PeriodicalIF":11.1,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145674753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the field of no-reference image quality assessment (NR-IQA), the visual masking effect has long been a challenging issue. Although existing methods attempt to alleviate the interference caused by masking by generating pseudoreference images, the quality of these images is often constrained by the accuracy and reconstruction capabilities of image restoration algorithms. This can introduce additional biases, thereby affecting the reliability of the evaluation results. To address this problem, we propose a novel generative “noise” estimation framework (GNE-Vim) that eliminates the need for pseudoreference images. Instead, it deeply decouples the distortion components from degraded images and performs quality-aware modelling of these components. During the training phase, the model leverages both reference images and distortion components to guide the learning of the true distortion distribution. In the inference phase, quality prediction is conducted directly on the basis of the decoupled distortion components, making the evaluation results more aligned with human subjective perception. The experimental results demonstrate that the proposed method achieves strong performance across datasets containing various types of distortions. The source code is publicly available at the following website: https://github.com/opencodelxt/GNE-Vim
{"title":"No-Reference Image Quality Assessment: Exploring Intrinsic Distortion Characteristics via Generative Noise Estimation With Mamba","authors":"Xuting Lan;Weizhi Xian;Mingliang Zhou;Jielu Yan;Xuekai Wei;Jun Luo;Weijia Jia;Sam Kwong","doi":"10.1109/TCSVT.2025.3586106","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3586106","url":null,"abstract":"In the field of no-reference image quality assessment (NR-IQA), the visual masking effect has long been a challenging issue. Although existing methods attempt to alleviate the interference caused by masking by generating pseudoreference images, the quality of these images is often constrained by the accuracy and reconstruction capabilities of image restoration algorithms. This can introduce additional biases, thereby affecting the reliability of the evaluation results. To address this problem, we propose a novel generative “noise” estimation framework (GNE-Vim) that eliminates the need for pseudoreference images. Instead, it deeply decouples the distortion components from degraded images and performs quality-aware modelling of these components. During the training phase, the model leverages both reference images and distortion components to guide the learning of the true distortion distribution. In the inference phase, quality prediction is conducted directly on the basis of the decoupled distortion components, making the evaluation results more aligned with human subjective perception. The experimental results demonstrate that the proposed method achieves strong performance across datasets containing various types of distortions. The source code is publicly available at the following website: <uri>https://github.com/opencodelxt/GNE-Vim</uri>","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 12","pages":"12692-12706"},"PeriodicalIF":11.1,"publicationDate":"2025-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145674860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-07DOI: 10.1109/TCSVT.2025.3586442
Yiqian Wu;Hao Xu;Xiangjun Tang;Yue Shangguan;Hongbo Fu;Xiaogang Jin
3D-aware face generators are typically trained on 2D real-life face image datasets that primarily consist of near-frontal face data. Due to data limitations, these generators cannot generate one-quarter headshot 3D portraits with head, neck, and shoulder geometry, which is crucial for applications like talking heads. Two reasons account for this issue: First, existing facial recognition methods struggle with extracting facial data captured from large camera angles or back views. Second, it is challenging to learn a distribution of 3D portraits covering the one-quarter headshot region from single-view data due to significant geometric deformation caused by diverse body poses. To this end, we first create the dataset $it {360}^{circ }$ -Portrait-HQ ($it {360}^{circ }$ PHQ for short) which consists of high-quality single-view real portraits annotated with a variety of camera parameters (the yaw angles span the entire 360° range) and body poses. We then propose 3DPortraitGAN, the first 3D-aware one-quarter headshot portrait generator that learns a canonical 3D avatar distribution from the $it {360}^{circ }$ PHQ dataset with body pose self-learning. Our model can generate view-consistent portrait images from all camera angles with a canonical one-quarter headshot 3D representation. Our experiments show that the proposed framework can accurately predict portrait body poses and generate view-consistent, realistic portrait images with complete geometry from all camera angles.
{"title":"3DPortraitGAN: Learning One-Quarter Headshot 3D GANs From a Single-View Portrait Dataset With Diverse Body Poses","authors":"Yiqian Wu;Hao Xu;Xiangjun Tang;Yue Shangguan;Hongbo Fu;Xiaogang Jin","doi":"10.1109/TCSVT.2025.3586442","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3586442","url":null,"abstract":"3D-aware face generators are typically trained on 2D real-life face image datasets that primarily consist of near-frontal face data. Due to data limitations, these generators cannot generate <italic>one-quarter headshot</i> 3D portraits with head, neck, and shoulder geometry, which is crucial for applications like talking heads. Two reasons account for this issue: First, existing facial recognition methods struggle with extracting facial data captured from large camera angles or back views. Second, it is challenging to learn a distribution of 3D portraits covering the one-quarter headshot region from single-view data due to significant geometric deformation caused by diverse body poses. To this end, we first create the dataset <inline-formula> <tex-math>$it {360}^{circ }$ </tex-math></inline-formula>-<italic>Portrait</i>-<italic>HQ</i> (<inline-formula> <tex-math>$it {360}^{circ }$ </tex-math></inline-formula><italic>PHQ</i> for short) which consists of high-quality single-view real portraits annotated with a variety of camera parameters (the yaw angles span the entire 360° range) and body poses. We then propose <italic>3DPortraitGAN</i>, the first 3D-aware one-quarter headshot portrait generator that learns a canonical 3D avatar distribution from the <inline-formula> <tex-math>$it {360}^{circ }$ </tex-math></inline-formula><italic>PHQ</i> dataset with body pose self-learning. Our model can generate view-consistent portrait images from all camera angles with a canonical one-quarter headshot 3D representation. Our experiments show that the proposed framework can accurately predict portrait body poses and generate view-consistent, realistic portrait images with complete geometry from all camera angles.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 12","pages":"12760-12771"},"PeriodicalIF":11.1,"publicationDate":"2025-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145674678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}