Pub Date : 2025-09-22DOI: 10.1109/TCSVT.2025.3612697
Mingyue Niu;Zhuhong Shao;Yongjun He;Jianhua Tao;Björn W. Schuller
Physiological studies have shown that differences between depressed and healthy individuals are manifested in the audio and video modalities. Hence, some researchers have combined local and global information from audio or video modality to obtain the unimodal representation. Attention mechanisms or Multi-Layer Perceptrons (MLPs) are then used to complete the fusion of different representations. However, attention mechanisms or MLPs is essentially a linear aggregation manner, and lacks the ability to explore the element-wise interaction between local and global representations within and across modalities, which affects the accuracy of estimating the depression severity. To this end, we propose a Representation Interaction (RI) module, which uses the mutual linear adjustment to achieve element-wise interaction between representations. Thus, the RI module can be seen as an mutual observation of two representations, which helps to achieve complementary advantages and improve the model’s ability to characterize depression cues. Furthermore, since the interaction process generates multiple representations, we propose a Multi-representation Prediction (MP) module. This module implements multi-representation vectorization in a hierarchical manner from summarizing a single representation to aggregating multiple representations, and adopts the attention mechanism to obtain the estimation of an individual depression severity. In this way, we use the RI and MP modules to construct the Multimodal Local Global Interaction (MLGI) network. The experimental performance on AVEC 2013 and AVEC 2014 depression datasets demonstrates the effectiveness of our method.
{"title":"Multimodal Local Global Interaction Networks for Automatic Depression Severity Estimation","authors":"Mingyue Niu;Zhuhong Shao;Yongjun He;Jianhua Tao;Björn W. Schuller","doi":"10.1109/TCSVT.2025.3612697","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3612697","url":null,"abstract":"Physiological studies have shown that differences between depressed and healthy individuals are manifested in the audio and video modalities. Hence, some researchers have combined local and global information from audio or video modality to obtain the unimodal representation. Attention mechanisms or Multi-Layer Perceptrons (MLPs) are then used to complete the fusion of different representations. However, attention mechanisms or MLPs is essentially a linear aggregation manner, and lacks the ability to explore the element-wise interaction between local and global representations within and across modalities, which affects the accuracy of estimating the depression severity. To this end, we propose a Representation Interaction (RI) module, which uses the mutual linear adjustment to achieve element-wise interaction between representations. Thus, the RI module can be seen as an mutual observation of two representations, which helps to achieve complementary advantages and improve the model’s ability to characterize depression cues. Furthermore, since the interaction process generates multiple representations, we propose a Multi-representation Prediction (MP) module. This module implements multi-representation vectorization in a hierarchical manner from summarizing a single representation to aggregating multiple representations, and adopts the attention mechanism to obtain the estimation of an individual depression severity. In this way, we use the RI and MP modules to construct the Multimodal Local Global Interaction (MLGI) network. The experimental performance on AVEC 2013 and AVEC 2014 depression datasets demonstrates the effectiveness of our method.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 2","pages":"2649-2664"},"PeriodicalIF":11.1,"publicationDate":"2025-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146154461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-17DOI: 10.1109/TCSVT.2025.3610946
Longtao Feng;Qian Yin;Jiaqi Zhang;Yuwen He;Siwei Ma
In recent years, rate control (RC) for neural video coding (NVC) has become an active research area. However, existing RC methods in NVC neglect the actual rate-distortion (R-D) characteristics and lack dedicated optimization strategies for intra and inter modes, leading to significant bit rate errors. To address these issues, we propose a high accuracy RC method for NVC based on R-D modeling, which integrates intra frame RC, inter frame RC and bit allocation. Specifically, the rate-quantization parameter (R-Q) model and R-D model are established for both intra frame and inter frame in NVC. To derive the model parameters, intra frame parameters are estimated using high dimensional features, while inter frame parameters are derived using gradient descent based model update methods. Based on the proposed R-Q model, intra frame and inter frame RC methods are proposed to determine the quantization parameters (QP). Meanwhile, a bit allocation method is developed based on the derived R-D models to allocate bits for the intra frame and inter frame. Extensive experiments demonstrate that, benefiting from the accurate R-Q models derived by the proposed approach, highly accurate RC is achieved with only 0.56% average bit rate error. Compared with other methods, the proposed method reduces the average bit rate error by more than 4.18%, and achieves over 8.94% Bjøntegaard Delta Rate savings.
{"title":"High Accuracy Rate Control for Neural Video Coding Based on Rate-Distortion Modeling","authors":"Longtao Feng;Qian Yin;Jiaqi Zhang;Yuwen He;Siwei Ma","doi":"10.1109/TCSVT.2025.3610946","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3610946","url":null,"abstract":"In recent years, rate control (RC) for neural video coding (NVC) has become an active research area. However, existing RC methods in NVC neglect the actual rate-distortion (<italic>R-D</i>) characteristics and lack dedicated optimization strategies for intra and inter modes, leading to significant bit rate errors. To address these issues, we propose a high accuracy RC method for NVC based on <italic>R-D</i> modeling, which integrates intra frame RC, inter frame RC and bit allocation. Specifically, the rate-quantization parameter (<italic>R-Q</i>) model and <italic>R-D</i> model are established for both intra frame and inter frame in NVC. To derive the model parameters, intra frame parameters are estimated using high dimensional features, while inter frame parameters are derived using gradient descent based model update methods. Based on the proposed <italic>R-Q</i> model, intra frame and inter frame RC methods are proposed to determine the quantization parameters (QP). Meanwhile, a bit allocation method is developed based on the derived <italic>R-D</i> models to allocate bits for the intra frame and inter frame. Extensive experiments demonstrate that, benefiting from the accurate <italic>R-Q</i> models derived by the proposed approach, highly accurate RC is achieved with only 0.56% average bit rate error. Compared with other methods, the proposed method reduces the average bit rate error by more than 4.18%, and achieves over 8.94% Bjøntegaard Delta Rate savings.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 2","pages":"2551-2567"},"PeriodicalIF":11.1,"publicationDate":"2025-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146154451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-16DOI: 10.1109/TCSVT.2025.3609840
Liuxiang Qiu;Si Chen;Jing-Hao Xue;Da-Han Wang;Shunzhi Zhu;Yan Yan
Visible-infrared person re-identification (VI-ReID) is a cross-modality retrieval task that aims to match images of the same person across visible (VIS) and infrared (IR) modalities. Existing VI-ReID methods ignore high-order structure information of features and struggle to learn a reliable common feature space due to the modality discrepancy between VIS and IR images. To alleviate the above issues, we propose a novel high-order hierarchical middle-feature learning network (HOH-Net) for VI-ReID. We introduce a high-order structure learning (HSL) module to explore the high-order relationships of short- and long-range feature nodes, for significantly mitigating model collapse and effectively obtaining discriminative features. We further develop a fine-coarse graph attention alignment (FCGA) module, which efficiently aligns multi-modality feature nodes from node-level and region-level perspectives, ensuring reliable middle-feature representations. Moreover, we exploit a hierarchical middle-feature agent learning (HMAL) loss to hierarchically reduce the modality discrepancy at each stage of the network by using the agents of middle features. The proposed HMAL loss also exchanges detailed and semantic information between low- and high-stage networks. Finally, we introduce a modality-range identity-center contrastive (MRIC) loss to minimize the distances between VIS, IR, and middle features. Extensive experiments demonstrate that the proposed HOH-Net yields state-of-the-art performance on the image-based and video-based VI-ReID datasets. The code is available at: https://github.com/Jaulaucoeng/HOS-Net
{"title":"HOH-Net: High-Order Hierarchical Middle-Feature Learning Network for Visible-Infrared Person Re-Identification","authors":"Liuxiang Qiu;Si Chen;Jing-Hao Xue;Da-Han Wang;Shunzhi Zhu;Yan Yan","doi":"10.1109/TCSVT.2025.3609840","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3609840","url":null,"abstract":"Visible-infrared person re-identification (VI-ReID) is a cross-modality retrieval task that aims to match images of the same person across visible (VIS) and infrared (IR) modalities. Existing VI-ReID methods ignore high-order structure information of features and struggle to learn a reliable common feature space due to the modality discrepancy between VIS and IR images. To alleviate the above issues, we propose a novel high-order hierarchical middle-feature learning network (HOH-Net) for VI-ReID. We introduce a high-order structure learning (HSL) module to explore the high-order relationships of short- and long-range feature nodes, for significantly mitigating model collapse and effectively obtaining discriminative features. We further develop a fine-coarse graph attention alignment (FCGA) module, which efficiently aligns multi-modality feature nodes from node-level and region-level perspectives, ensuring reliable middle-feature representations. Moreover, we exploit a hierarchical middle-feature agent learning (HMAL) loss to hierarchically reduce the modality discrepancy at each stage of the network by using the agents of middle features. The proposed HMAL loss also exchanges detailed and semantic information between low- and high-stage networks. Finally, we introduce a modality-range identity-center contrastive (MRIC) loss to minimize the distances between VIS, IR, and middle features. Extensive experiments demonstrate that the proposed HOH-Net yields state-of-the-art performance on the image-based and video-based VI-ReID datasets. The code is available at: <uri>https://github.com/Jaulaucoeng/HOS-Net</uri>","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 2","pages":"2607-2622"},"PeriodicalIF":11.1,"publicationDate":"2025-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146154462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-16DOI: 10.1109/TCSVT.2025.3609776
Lang He;Weizhao Yang;Junnan Zhao;Haifeng Chen;Dongmei Jiang
Major depressive disorder (MDD) is projected to become one of the leading mental disorders by 2030. While audiovisual cues have garnered significant attention in depression recognition research owing to their non-invasive acquisition and rich emotional expressiveness. However, conventional centralized training paradigms raise substantial privacy concerns for individuals with depression and are further hindered by data heterogeneity and label inconsistency across datasets. To overcome these challenges, a hybrid architecture, termed Federated Domain Adversarial with Attention Mechanism (FedDAAM), for privacy preserving multimodal depression assessment, is proposed. FedDAAM introduces a mechanism by differentiating discriminative features into depression-public and depression-private features. Specifically, to extract visual depression-private features from the AVEC2013 and AVEC2014 datasets, a local attention-aware (LAA) architecture is developed. For the depression-public features, action units (AUs), landmarks, head poses, and eye gazes features are adopted. In addition, to consider the transferability and performance of individual client, a dynamic parameter aggregation mechanism, termed FedDyA, is proposed. Extensive validations are performed on the AVEC2013, AVEC2014 and AVEC2017 databases, resulting in root mean square error (RMSE) and mean absolute error (MAE) of 8.61/6.78, 8.59/6.77, and 4.71/3.68, respectively. More importantly, to the best of our knowledge, this is the first study to borrow federated learning (FL) for multimodal depression assessment. The proposed framework offers a novel solution for privacy-aware, distributed clinical diagnosis of depression. Code will be available at: https://github.com/helang818/FedDAAM/
{"title":"FedDAAM: Federated Domain Adversarial Learning With Attention Mechanism for Privacy Preserving Multimodal Depression Assessment","authors":"Lang He;Weizhao Yang;Junnan Zhao;Haifeng Chen;Dongmei Jiang","doi":"10.1109/TCSVT.2025.3609776","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3609776","url":null,"abstract":"Major depressive disorder (MDD) is projected to become one of the leading mental disorders by 2030. While audiovisual cues have garnered significant attention in depression recognition research owing to their non-invasive acquisition and rich emotional expressiveness. However, conventional centralized training paradigms raise substantial privacy concerns for individuals with depression and are further hindered by data heterogeneity and label inconsistency across datasets. To overcome these challenges, a hybrid architecture, termed <bold>Fed</b>erated <bold>D</b>omain <bold>A</b>dversarial with <bold>A</b>ttention <bold>M</b>echanism (FedDAAM), for privacy preserving multimodal depression assessment, is proposed. FedDAAM introduces a mechanism by differentiating discriminative features into depression-public and depression-private features. Specifically, to extract visual depression-private features from the AVEC2013 and AVEC2014 datasets, a local attention-aware (LAA) architecture is developed. For the depression-public features, action units (AUs), landmarks, head poses, and eye gazes features are adopted. In addition, to consider the transferability and performance of individual client, a dynamic parameter aggregation mechanism, termed FedDyA, is proposed. Extensive validations are performed on the AVEC2013, AVEC2014 and AVEC2017 databases, resulting in root mean square error (RMSE) and mean absolute error (MAE) of 8.61/6.78, 8.59/6.77, and 4.71/3.68, respectively. More importantly, to the best of our knowledge, this is the first study to borrow federated learning (FL) for multimodal depression assessment. The proposed framework offers a novel solution for privacy-aware, distributed clinical diagnosis of depression. Code will be available at: <uri>https://github.com/helang818/FedDAAM/</uri>","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 2","pages":"2635-2648"},"PeriodicalIF":11.1,"publicationDate":"2025-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146154442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-12DOI: 10.1109/TCSVT.2025.3609410
Yidong Song;Shilei Wang;Zhaochuan Zeng;Jikai Zheng;Zhenhua Wang;Jifeng Ning
Transformer-based trackers have demonstrated remarkable advancements in real-time tracking tasks on edge devices. Since lightweight backbone networks are typically designed for general-purpose tasks, our analysis reveals that, when applied to target tracking, they often contain structurally redundant layers, which limits the model’s efficiency. To address this issue, we propose a novel tracking framework that integrates backbone pruning with Hybrid Knowledge Distillation (HKD), effectively reducing model parameters and FLOPs while preserving high tracking accuracy. Inspired by the success of MiniLM and Focal and Global Distillation (FGD), we design a HKD framework tailored for tracking tasks. Our HKD introduces a multi-level and complementary distillation scheme, consisting of Token Distillation, Local Distillation, and Global Distillation. In Token Distillation, unlike MiniLM, which distills attention via QK dot-products and V, we disentangle and separately distill Q, K, and V representations to enhance structural attention alignment for tracking. For Local Distillation, we use the FGD concept by incorporating spatial foreground-background masks to capture region-specific discriminative cues more effectively. In Global Distillation, we use Vision Mamba module to model long-range dependencies and enhance semantic-level feature alignment. Our tracker HKDT achieves state-of-the-art (SOTA) performance across multiple datasets. On the GOT-10k benchmark, it demonstrates a groundbreaking 67.6% Average Overlap (AO), outperforming the current SOTA real-time tracker HiT-Base by 3.6% in accuracy while reducing computational costs by 64% and achieving 115% faster tracking speed on CPU platforms. The code and model will be available soon.
{"title":"Exploring Pruning-Based Efficient Object Tracking via Hybrid Knowledge Distillation","authors":"Yidong Song;Shilei Wang;Zhaochuan Zeng;Jikai Zheng;Zhenhua Wang;Jifeng Ning","doi":"10.1109/TCSVT.2025.3609410","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3609410","url":null,"abstract":"Transformer-based trackers have demonstrated remarkable advancements in real-time tracking tasks on edge devices. Since lightweight backbone networks are typically designed for general-purpose tasks, our analysis reveals that, when applied to target tracking, they often contain structurally redundant layers, which limits the model’s efficiency. To address this issue, we propose a novel tracking framework that integrates backbone pruning with Hybrid Knowledge Distillation (HKD), effectively reducing model parameters and FLOPs while preserving high tracking accuracy. Inspired by the success of MiniLM and Focal and Global Distillation (FGD), we design a HKD framework tailored for tracking tasks. Our HKD introduces a multi-level and complementary distillation scheme, consisting of Token Distillation, Local Distillation, and Global Distillation. In Token Distillation, unlike MiniLM, which distills attention via QK dot-products and V, we disentangle and separately distill Q, K, and V representations to enhance structural attention alignment for tracking. For Local Distillation, we use the FGD concept by incorporating spatial foreground-background masks to capture region-specific discriminative cues more effectively. In Global Distillation, we use Vision Mamba module to model long-range dependencies and enhance semantic-level feature alignment. Our tracker HKDT achieves state-of-the-art (SOTA) performance across multiple datasets. On the GOT-10k benchmark, it demonstrates a groundbreaking 67.6% Average Overlap (AO), outperforming the current SOTA real-time tracker HiT-Base by 3.6% in accuracy while reducing computational costs by 64% and achieving 115% faster tracking speed on CPU platforms. The code and model will be available soon.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 2","pages":"2433-2448"},"PeriodicalIF":11.1,"publicationDate":"2025-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146154443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-12DOI: 10.1109/TCSVT.2025.3609322
Xuan Xie;Xiang Yuan;Gong Cheng
Weakly supervised object detection has emerged as a cost-effective and promising solution in remote sensing, as it requires only image-level labels and alleviates the burden of labor-intensive instance-level annotations. Existing approaches tend to assign top-scoring proposals and their highly overlapping counterparts as positive samples, thereby overlooking the inherent gap between high classification confidence and precise localization, which in turn introduces the risk of part domination and instance missing. In order to address these concerns, this paper introduces an Instance-aware Label Assignment scheme for weakly supervised object detection in remote sensing images, termed ILA. Specifically, we propose a context-aware learning network that aims to prioritize regions fully covering the object over top-scoring yet incomplete candidates. This is empowered by the proposed context classification loss, which dynamically responds to the degree of object visibility, thereby driving the model toward representative proposals and mitigating the optimization dilemma caused by partial coverage. Additionally, an instance excavation module is implemented to reduce the risk of misclassifying object instances as negatives. At its core lies the proposed pseudo ground truth mining (PGM) algorithm, which constructs reliable pseudo boxes from the outputs of the basic multiple instance learning network to excavate potential object instances. Comprehensive evaluations on the challenging NWPU VHR-10.v2 and DIOR datasets underscore the efficacy of our approach, with achieved mean average precision (mAP) scores of 76.56% and 31.73%, respectively.
{"title":"Weakly Supervised Object Detection for Aerial Images With Instance-Aware Label Assignment","authors":"Xuan Xie;Xiang Yuan;Gong Cheng","doi":"10.1109/TCSVT.2025.3609322","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3609322","url":null,"abstract":"Weakly supervised object detection has emerged as a cost-effective and promising solution in remote sensing, as it requires only image-level labels and alleviates the burden of labor-intensive instance-level annotations. Existing approaches tend to assign top-scoring proposals and their highly overlapping counterparts as positive samples, thereby overlooking the inherent gap between high classification confidence and precise localization, which in turn introduces the risk of part domination and instance missing. In order to address these concerns, this paper introduces an <underline>I</u>nstance-aware <underline>L</u>abel <underline>A</u>ssignment scheme for weakly supervised object detection in remote sensing images, termed ILA. Specifically, we propose a context-aware learning network that aims to prioritize regions fully covering the object over top-scoring yet incomplete candidates. This is empowered by the proposed context classification loss, which dynamically responds to the degree of object visibility, thereby driving the model toward representative proposals and mitigating the optimization dilemma caused by partial coverage. Additionally, an instance excavation module is implemented to reduce the risk of misclassifying object instances as negatives. At its core lies the proposed pseudo ground truth mining (PGM) algorithm, which constructs reliable pseudo boxes from the outputs of the basic multiple instance learning network to excavate potential object instances. Comprehensive evaluations on the challenging NWPU VHR-10.v2 and DIOR datasets underscore the efficacy of our approach, with achieved mean average precision (mAP) scores of 76.56% and 31.73%, respectively.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 2","pages":"2492-2504"},"PeriodicalIF":11.1,"publicationDate":"2025-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146154467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-09DOI: 10.1109/TCSVT.2025.3600974
{"title":"IEEE Circuits and Systems Society Information","authors":"","doi":"10.1109/TCSVT.2025.3600974","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3600974","url":null,"abstract":"","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"C3-C3"},"PeriodicalIF":11.1,"publicationDate":"2025-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11154653","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-09DOI: 10.1109/TCSVT.2025.3600972
{"title":"IEEE Transactions on Circuits and Systems for Video Technology Publication Information","authors":"","doi":"10.1109/TCSVT.2025.3600972","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3600972","url":null,"abstract":"","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"C2-C2"},"PeriodicalIF":11.1,"publicationDate":"2025-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11154656","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021215","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Existing image-text retrieval methods mainly rely on region and word features to measure cross-modal similarities. Thus, dense cross-modal semantic alignment which matches regions and words becomes crucial. However, this is non-trivial due to the heterogeneity gap and the cross-modal attention used to achieve this alignment is inefficient. Towards solving this problem, we propose a novel framework that goes beyond the previous one-tower and two-tower frameworks to learn cross-modal consensus efficiently. The proposed framework does not align regions and words directly like existing methods but uses semantic prototypes as a bridge to attend specific contents with the same semantics among different modalities through semantic decoders, through which cross-modal semantic alignment is naturally achieved. Furthermore, we design a novel plug-and-play self-correction method based on optimal transport to alleviate the drawbacks of incomplete pairwise labels in existing multimodal datasets. On top of various base backbones, we carry out extensive experiments on two benchmark datasets, i.e., Flickr30K and MS-COCO, demonstrating the effectiveness, superiority and generalization of our method.
{"title":"Beyond One and Two Tower: Cross-Modal Consensus Learning for Image-Text Retrieval","authors":"Zhangxiang Shi;Yunlai Ding;Junyu Dong;Tianzhu Zhang","doi":"10.1109/TCSVT.2025.3605958","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3605958","url":null,"abstract":"Existing image-text retrieval methods mainly rely on region and word features to measure cross-modal similarities. Thus, dense cross-modal semantic alignment which matches regions and words becomes crucial. However, this is non-trivial due to the heterogeneity gap and the cross-modal attention used to achieve this alignment is inefficient. Towards solving this problem, we propose a novel framework that goes beyond the previous one-tower and two-tower frameworks to learn cross-modal consensus efficiently. The proposed framework does not align regions and words directly like existing methods but uses semantic prototypes as a bridge to attend specific contents with the same semantics among different modalities through semantic decoders, through which cross-modal semantic alignment is naturally achieved. Furthermore, we design a novel plug-and-play self-correction method based on optimal transport to alleviate the drawbacks of incomplete pairwise labels in existing multimodal datasets. On top of various base backbones, we carry out extensive experiments on two benchmark datasets, <italic>i.e.</i>, Flickr30K and MS-COCO, demonstrating the effectiveness, superiority and generalization of our method.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 2","pages":"2581-2593"},"PeriodicalIF":11.1,"publicationDate":"2025-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146154402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-29DOI: 10.1109/TCSVT.2025.3604033
Zhongqing Yu;Xin Liu;Yiu-Ming Cheung;Lei Zhu;Xing Xu;Nannan Wang
Deep cross-modal hashing models generally inherit the vulnerabilities of deep neural networks, making them susceptible to adversarial attacks and thus posing a serious security risk during real-world deployment. Current adversarial attack or defense strategies often establish a weak correlation between the hashing codes and the targeted semantic representations, and there is still a lack of related works that simultaneously consider the attack and defense for deep cross-modal hashing. To alleviate these concerns, we propose a Fuzzy-Prototype-guided Adversarial Attack and Defense (FPAD) framework to enhance the adversarial robustness of deep cross-modal hashing models. First, an adaptive fuzzy-prototype learning network (FpNet) is efficiently presented to extract a set of fuzzy-prototypes, aiming to encode the underlying semantic structure of the heterogeneous modalities in both feature and Hamming spaces. Then, these derived prototypical hash codes are heuristically employed to supervise the generation of high-quality adversarial examples, while a fuzzy-prototype rectification scheme is simultaneously designed to preserve the latent semantic consistency between the adversarial and benign examples. By mixing the adversarial samples with the original training samples as the augmented inputs, an efficient fuzzy-prototype-guided adversarial learning framework is proposed to execute the collaborative adversarial training and generate robust cross-modal hash codes with high adversarial defense capabilities, therefore resisting various attacks and benefiting various challenging cross-modal hashing tasks. Extensive experiments evaluated on benchmark datasets show that the proposed FPAD framework not only produces high-quality adversarial samples to enhance the adversarial training process, but also shows its high adversarial defense capability to benefit various cross-modal hashing tasks. The code is available at: https://github.com/yzq131/FPAD
{"title":"FPAD: Fuzzy-Prototype-Guided Adversarial Attack and Defense for Deep Cross-Modal Hashing","authors":"Zhongqing Yu;Xin Liu;Yiu-Ming Cheung;Lei Zhu;Xing Xu;Nannan Wang","doi":"10.1109/TCSVT.2025.3604033","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3604033","url":null,"abstract":"Deep cross-modal hashing models generally inherit the vulnerabilities of deep neural networks, making them susceptible to adversarial attacks and thus posing a serious security risk during real-world deployment. Current adversarial attack or defense strategies often establish a weak correlation between the hashing codes and the targeted semantic representations, and there is still a lack of related works that simultaneously consider the attack and defense for deep cross-modal hashing. To alleviate these concerns, we propose a Fuzzy-Prototype-guided Adversarial Attack and Defense (FPAD) framework to enhance the adversarial robustness of deep cross-modal hashing models. First, an adaptive fuzzy-prototype learning network (FpNet) is efficiently presented to extract a set of fuzzy-prototypes, aiming to encode the underlying semantic structure of the heterogeneous modalities in both feature and Hamming spaces. Then, these derived prototypical hash codes are heuristically employed to supervise the generation of high-quality adversarial examples, while a fuzzy-prototype rectification scheme is simultaneously designed to preserve the latent semantic consistency between the adversarial and benign examples. By mixing the adversarial samples with the original training samples as the augmented inputs, an efficient fuzzy-prototype-guided adversarial learning framework is proposed to execute the collaborative adversarial training and generate robust cross-modal hash codes with high adversarial defense capabilities, therefore resisting various attacks and benefiting various challenging cross-modal hashing tasks. Extensive experiments evaluated on benchmark datasets show that the proposed FPAD framework not only produces high-quality adversarial samples to enhance the adversarial training process, but also shows its high adversarial defense capability to benefit various cross-modal hashing tasks. The code is available at: <uri>https://github.com/yzq131/FPAD</uri>","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 2","pages":"2568-2580"},"PeriodicalIF":11.1,"publicationDate":"2025-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146154438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}