Pub Date : 2025-08-29DOI: 10.1109/TCSVT.2025.3604034
Zhi Wang;Zixuan Wang;Chao Xu;Shengze Cai
Particle image-based fluid measurement techniques are widely used to study complex flows in nature and industrial processes. Despite that particle tracking velocimetry (PTV) has shown potential in various experimental applications for quantitatively capturing unsteady flow characteristics, estimating fluid motion with long displacement and high particle density remains challenging. We propose an artificial-intelligence-enhanced PTV framework to track particle trajectories from consecutive images. The proposed framework, called GOTrack+ (a learning framework with graph optimal transport for particle tracking velocimetry), contains three components: a convolutional neural network-based particle detector for particle recognition and sub-pixel coordinate localization; a graph neural network-based initial displacement predictor for fluid motion estimation; and a graph-based optimal transport particle tracker for continuous particle trajectory linking. Each component of GOTrack+ can be extracted and used independently, not only to enhance classical PTV algorithms but also as a simple, fast, accurate, and robust alternative to traditional PTV programs. Comprehensive evaluations, including numerical simulations and real-world experiments, have shown that GOTrack+ achieves state-of-the-art performance compared to recent PTV approaches. All the codes are available at https://github.com/wuwuwuas/GOTrack.git
{"title":"GOTrack+: A Deep Learning Framework With Graph Optimal Transport for Particle Tracking Velocimetry","authors":"Zhi Wang;Zixuan Wang;Chao Xu;Shengze Cai","doi":"10.1109/TCSVT.2025.3604034","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3604034","url":null,"abstract":"Particle image-based fluid measurement techniques are widely used to study complex flows in nature and industrial processes. Despite that particle tracking velocimetry (PTV) has shown potential in various experimental applications for quantitatively capturing unsteady flow characteristics, estimating fluid motion with long displacement and high particle density remains challenging. We propose an artificial-intelligence-enhanced PTV framework to track particle trajectories from consecutive images. The proposed framework, called GOTrack+ (a learning framework with graph optimal transport for particle tracking velocimetry), contains three components: a convolutional neural network-based particle detector for particle recognition and sub-pixel coordinate localization; a graph neural network-based initial displacement predictor for fluid motion estimation; and a graph-based optimal transport particle tracker for continuous particle trajectory linking. Each component of GOTrack+ can be extracted and used independently, not only to enhance classical PTV algorithms but also as a simple, fast, accurate, and robust alternative to traditional PTV programs. Comprehensive evaluations, including numerical simulations and real-world experiments, have shown that GOTrack+ achieves state-of-the-art performance compared to recent PTV approaches. All the codes are available at <uri>https://github.com/wuwuwuas/GOTrack.git</uri>","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 2","pages":"2358-2371"},"PeriodicalIF":11.1,"publicationDate":"2025-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146154410","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-26DOI: 10.1109/TCSVT.2025.3603110
Jianxiang Dong;Zhaozheng Yin
Temporal Sentence Grounding (TSG) aims at localizing a temporal interval in an untrimmed video that contains the most relevant semantics to a given query sentence. Most existing methods either focus on addressing the problem in a fully-supervised manner where the temporal boundary annotations are provided, or are dedicated to weakly-supervised TSG without any boundary annotations. However, the former ones suffer from expensive annotation cost and the latter ones only give inferior grounding performance. In this paper, we propose an Annotation-efficient Hybrid Learning (AHL) framework that aims to achieve good TSG performance with less annotation cost by leveraging weakly semi-supervised learning, contrastive learning and active learning: (1) AHL includes a progressive pseudo-label self-learning module which generates pseudo labels and progressively selects reliable ones to re-train the model in a progressive manner; (2) AHL includes a novel self-guided contrastive learning method that performs proposal-level contrastive learning based on weakly-labeled data to align the visual and language feature; (3) AHL explores the fully-labeled set construction by gradually expanding it via actively searching on the informative weakly-labeled samples, from the aspects of both difficulty and diversity. We conduct extensive experiments on ActivityNet and Charades-STA datasets and results verify the effectiveness of our proposed AHL to exploit the weakly-labeled data and to achieve the same performance as fully-supervised method, with much less annotation cost. Our code is available at https://github.com/DJX1995/AHL
{"title":"Annotation-Efficient Hybrid Learning for Temporal Sentence Grounding","authors":"Jianxiang Dong;Zhaozheng Yin","doi":"10.1109/TCSVT.2025.3603110","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3603110","url":null,"abstract":"Temporal Sentence Grounding (TSG) aims at localizing a temporal interval in an untrimmed video that contains the most relevant semantics to a given query sentence. Most existing methods either focus on addressing the problem in a fully-supervised manner where the temporal boundary annotations are provided, or are dedicated to weakly-supervised TSG without any boundary annotations. However, the former ones suffer from expensive annotation cost and the latter ones only give inferior grounding performance. In this paper, we propose an Annotation-efficient Hybrid Learning (AHL) framework that aims to achieve good TSG performance with less annotation cost by leveraging weakly semi-supervised learning, contrastive learning and active learning: (1) AHL includes a progressive pseudo-label self-learning module which generates pseudo labels and progressively selects reliable ones to re-train the model in a progressive manner; (2) AHL includes a novel self-guided contrastive learning method that performs proposal-level contrastive learning based on weakly-labeled data to align the visual and language feature; (3) AHL explores the fully-labeled set construction by gradually expanding it via actively searching on the informative weakly-labeled samples, from the aspects of both difficulty and diversity. We conduct extensive experiments on ActivityNet and Charades-STA datasets and results verify the effectiveness of our proposed AHL to exploit the weakly-labeled data and to achieve the same performance as fully-supervised method, with much less annotation cost. Our code is available at <uri>https://github.com/DJX1995/AHL</uri>","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 2","pages":"2594-2606"},"PeriodicalIF":11.1,"publicationDate":"2025-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146154447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Different from the traditional semi-supervised learning paradigm that is constrained by the close-world assumption, Generalized Category Discovery (GCD) presumes that the unlabeled dataset contains new categories not appearing in the labeled set, and aims to not only classify old categories but also discover new categories in the unlabeled data. Existing studies on GCD typically devote to transferring the general knowledge from the self-supervised pretrained model to the target GCD task via some fine-tuning strategies, such as partial tuning and prompt learning. Nevertheless, these fine-tuning methods fail to make a sound balance between the generalization capacity of pretrained backbone and the adaptability to the GCD task. To fill this gap, in this paper, we propose a novel adapter-tuning-based method named AdaptGCD, which is the first work to introduce the adapter tuning into the GCD task and provides some key insights expected to enlighten future research. Furthermore, considering the discrepancy of supervision information between the old and new classes, a multi-expert adapter structure equipped with a route assignment constraint is elaborately devised, such that the data from old and new classes are separated into different expert groups. Extensive experiments are conducted on 7 widely-used datasets. The remarkable performance improvements highlight the efficacy of our proposal and it can be also combined with other advanced methods like SPTNet for further enhancement.
{"title":"AdaptGCD: Multi-Expert Adapter Tuning for Generalized Category Discovery","authors":"Yuxun Qu;Yongqiang Tang;Chenyang Zhang;Wensheng Zhang","doi":"10.1109/TCSVT.2025.3602981","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3602981","url":null,"abstract":"Different from the traditional semi-supervised learning paradigm that is constrained by the close-world assumption, Generalized Category Discovery (GCD) presumes that the unlabeled dataset contains new categories not appearing in the labeled set, and aims to not only classify old categories but also discover new categories in the unlabeled data. Existing studies on GCD typically devote to transferring the general knowledge from the self-supervised pretrained model to the target GCD task via some fine-tuning strategies, such as partial tuning and prompt learning. Nevertheless, these fine-tuning methods fail to make a sound balance between the generalization capacity of pretrained backbone and the adaptability to the GCD task. To fill this gap, in this paper, we propose a novel adapter-tuning-based method named AdaptGCD, which is the first work to introduce the adapter tuning into the GCD task and provides some key insights expected to enlighten future research. Furthermore, considering the discrepancy of supervision information between the old and new classes, a multi-expert adapter structure equipped with a route assignment constraint is elaborately devised, such that the data from old and new classes are separated into different expert groups. Extensive experiments are conducted on 7 widely-used datasets. The remarkable performance improvements highlight the efficacy of our proposal and it can be also combined with other advanced methods like SPTNet for further enhancement.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 2","pages":"2344-2357"},"PeriodicalIF":11.1,"publicationDate":"2025-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146154408","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-26DOI: 10.1109/TCSVT.2025.3602970
Zimu Lu;Ning Xu;Hongshuo Tian;Lanjun Wang;An-An Liu
Medical Visual Question Answering (Medical VQA) is an essential task that facilitates the automated interpretation of complex clinical imagery with corresponding textual questions, thereby supporting both clinicians and patients in making informed medical decisions. With the rapid progress of Vision-Language Pretraining (VLP) in general domains, the development of medical VLP models has emerged as a rapidly growing interdisciplinary area at the intersection of artificial intelligence (AI) and healthcare. However, few works have been proposed to evaluate the adversarial robustness of medical VLP models, which faces two primary challenges: 1) the complexity of medical texts, stemming from the presence of terminologies, poses significant challenges for models in comprehending the text for adversarial attack; 2) the diversity of medical images arises from the variety of anatomical regions depicted, which requires models to determine critical anatomical regions for attack. In this paper, we propose a novel multimodal adversarial attack generator for evaluating the robustness of medical VLP models. Specifically, for the complexity of medical texts, we integrate medical knowledge when crafting text adversarial samples, which can facilitate the terminologies understanding and adversarial strength; for the diversity of medical images, we divide the anatomical regions into either global or local regions in medical images, which are determined by learned balance weights for perturbations. Our experimental study not only provides a quantitative understanding in medical VLP models, but also underscores the critical need for thorough safety evaluations before implementing them in real-world medical applications.
{"title":"Medical VLP Model Is Vulnerable: Toward Multimodal Adversarial Attack on Large Medical Vision-Language Models","authors":"Zimu Lu;Ning Xu;Hongshuo Tian;Lanjun Wang;An-An Liu","doi":"10.1109/TCSVT.2025.3602970","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3602970","url":null,"abstract":"Medical Visual Question Answering (Medical VQA) is an essential task that facilitates the automated interpretation of complex clinical imagery with corresponding textual questions, thereby supporting both clinicians and patients in making informed medical decisions. With the rapid progress of Vision-Language Pretraining (VLP) in general domains, the development of medical VLP models has emerged as a rapidly growing interdisciplinary area at the intersection of artificial intelligence (AI) and healthcare. However, few works have been proposed to evaluate the adversarial robustness of medical VLP models, which faces two primary challenges: 1) the complexity of medical texts, stemming from the presence of terminologies, poses significant challenges for models in comprehending the text for adversarial attack; 2) the diversity of medical images arises from the variety of anatomical regions depicted, which requires models to determine critical anatomical regions for attack. In this paper, we propose a novel multimodal adversarial attack generator for evaluating the robustness of medical VLP models. Specifically, for the complexity of medical texts, we integrate medical knowledge when crafting text adversarial samples, which can facilitate the terminologies understanding and adversarial strength; for the diversity of medical images, we divide the anatomical regions into either global or local regions in medical images, which are determined by learned balance weights for perturbations. Our experimental study not only provides a quantitative understanding in medical VLP models, but also underscores the critical need for thorough safety evaluations before implementing them in real-world medical applications.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 2","pages":"2478-2491"},"PeriodicalIF":11.1,"publicationDate":"2025-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146154464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-25DOI: 10.1109/TCSVT.2025.3602506
Hao Wang;Junyan Huo;Fei Yang;Shuai Wan;Gaoxing Chen;Kun Yang;Luis Herranz;Fuzheng Yang
With the growing prevalence of screen content images in multimedia communication, efficient compression has become increasingly crucial. Unlike natural scene images, screen content typically contains rich text regions that exhibit unique characteristics and low correlation with surrounding non-text elements. The intricate mixture of text and non-text within images poses significant challenges for existing learned compression networks, as the text and non-text features are severely entangled in the latent domain along the channel dimension, leading to compromised reconstruction quality and suboptimal entropy estimation. In this paper, we propose a novel Disentangled Image Compression Architecture (DICA) that enhances the analysis module and the entropy model of existing compression architectures to address these limitations. First, we introduce a Disentangled Analysis Module (DAM) by augmenting original analysis modules with an additional text approximation branch and a disentangling network. They work in concert to disentangle latent features into text and non-text classes along the channel dimension, resulting in a more structured feature distribution that better aligns with compression requirements. Second, we propose a Disentangled Channel-Conditional Entropy Model (DCEM) that efficiently leverages the feature distribution bias introduced by DAM, thereby further improving compression performance. Experimental results demonstrate that the proposed DICA, along with DAM and DCEM can be integrated into various channel-conditional compression backbones, significantly improving their performance in screen content compression–particularly in hard-to-compress text regions. When integrated with an advanced WACNN backbone, our method achieves a 13% overall BD-Rate gain and a 16% BD-Rate gain in text regions on the SIQAD dataset.
{"title":"Text and Non-Text Latent Feature Disentanglement for Screen Content Image Compression","authors":"Hao Wang;Junyan Huo;Fei Yang;Shuai Wan;Gaoxing Chen;Kun Yang;Luis Herranz;Fuzheng Yang","doi":"10.1109/TCSVT.2025.3602506","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3602506","url":null,"abstract":"With the growing prevalence of screen content images in multimedia communication, efficient compression has become increasingly crucial. Unlike natural scene images, screen content typically contains rich text regions that exhibit unique characteristics and low correlation with surrounding non-text elements. The intricate mixture of text and non-text within images poses significant challenges for existing learned compression networks, as the text and non-text features are severely entangled in the latent domain along the channel dimension, leading to compromised reconstruction quality and suboptimal entropy estimation. In this paper, we propose a novel <bold>Disentangled Image Compression Architecture (DICA)</b> that enhances the analysis module and the entropy model of existing compression architectures to address these limitations. First, we introduce a <bold>Disentangled Analysis Module (DAM)</b> by augmenting original analysis modules with an additional text approximation branch and a disentangling network. They work in concert to disentangle latent features into text and non-text classes along the channel dimension, resulting in a more structured feature distribution that better aligns with compression requirements. Second, we propose a Disentangled Channel-Conditional Entropy Model (DCEM) that efficiently leverages the feature distribution bias introduced by DAM, thereby further improving compression performance. Experimental results demonstrate that the proposed DICA, along with DAM and DCEM can be integrated into various channel-conditional compression backbones, significantly improving their performance in screen content compression–particularly in hard-to-compress text regions. When integrated with an advanced WACNN backbone, our method achieves a 13% overall BD-Rate gain and a 16% BD-Rate gain in text regions on the SIQAD dataset.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 2","pages":"2505-2519"},"PeriodicalIF":11.1,"publicationDate":"2025-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146154404","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-25DOI: 10.1109/TCSVT.2025.3602214
Karama Abdelhedi;Faten Chaabane;Walid Wannes;William Puech;Chokri Ben Amar
Today, the popularity of 3D videos is increasing significantly. This trend can be attributed to their immersive appeal and lifelike experience. In an era dominated by the widespread distribution of digital content, data integrity, and ownership, all of these elements are of crucial importance. In this context, the practice of traitor tracing, closely related to Digital Rights Management (DRM), facilitates the identification and tracking of unauthorized users who have violated copyright in order to share illegal copyright-protected content. In this paper, we propose a solution to this problem, we introduce an innovative traitor tracing approach focused on 3D video, with a particular focus on the DIBR (Depth Image-Based Rendering) format, which can be vulnerable to an Interleaving attack strategy. For this purpose, we develop a new phylogeny tree construction method designed to combat collusion attacks. Our experimental evaluations demonstrate the effectiveness of our proposed approach particularly when applied to long fingerprinting codes. Compared to Tardos’ approach, our method delivers very good results, even for a large number of colluders.
{"title":"Phylogeny-Based Traitor Tracing Method for Interleaving Attacks","authors":"Karama Abdelhedi;Faten Chaabane;Walid Wannes;William Puech;Chokri Ben Amar","doi":"10.1109/TCSVT.2025.3602214","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3602214","url":null,"abstract":"Today, the popularity of 3D videos is increasing significantly. This trend can be attributed to their immersive appeal and lifelike experience. In an era dominated by the widespread distribution of digital content, data integrity, and ownership, all of these elements are of crucial importance. In this context, the practice of traitor tracing, closely related to Digital Rights Management (DRM), facilitates the identification and tracking of unauthorized users who have violated copyright in order to share illegal copyright-protected content. In this paper, we propose a solution to this problem, we introduce an innovative traitor tracing approach focused on 3D video, with a particular focus on the DIBR (Depth Image-Based Rendering) format, which can be vulnerable to an Interleaving attack strategy. For this purpose, we develop a new phylogeny tree construction method designed to combat collusion attacks. Our experimental evaluations demonstrate the effectiveness of our proposed approach particularly when applied to long fingerprinting codes. Compared to Tardos’ approach, our method delivers very good results, even for a large number of colluders.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 2","pages":"2623-2634"},"PeriodicalIF":11.1,"publicationDate":"2025-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146154456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-25DOI: 10.1109/TCSVT.2025.3602391
Wei Luo;Peng Xing;Yunkang Cao;Haiming Yao;Weiming Shen;Zechao Li
Unsupervised anomaly detection plays a pivotal role in industrial defect inspection and medical image analysis, with most methods relying on the reconstruction framework. However, these methods may suffer from over-generalization, enabling them to reconstruct anomalies well, which leads to poor detection performance. To address this issue, instead of focusing solely on normality reconstruction, we propose an innovative Uncertainty-Integrated Anomaly Perception and Restoration Attention Network (URA-Net), which explicitly restores abnormal patterns to their corresponding normality. First, unlike traditional image reconstruction methods, we utilize a pre-trained convolutional neural network to extract multi-level semantic features as the reconstruction target. To assist the URA-Net learning to restore anomalies, we introduce a novel feature-level artificial anomaly synthesis module to generate anomalous samples for training. Subsequently, a novel uncertainty-integrated anomaly perception module based on Bayesian neural networks is introduced to learn the distributions of anomalous and normal features. This facilitates the estimation of anomalous regions and ambiguous boundaries, laying the foundation for subsequent anomaly restoration. Then, we propose a novel restoration attention mechanism that leverages global normal semantic information to restore detected anomalous regions, thereby obtaining defect-free restored features. Finally, we employ residual maps between input features and restored features for anomaly detection and localization. The comprehensive experimental results on two industrial datasets, MVTec AD and BTAD, along with a medical image dataset, OCT-2017, unequivocally demonstrate the effectiveness and superiority of the proposed method.
{"title":"URA-Net: Uncertainty-Integrated Anomaly Perception and Restoration Attention Network for Unsupervised Anomaly Detection","authors":"Wei Luo;Peng Xing;Yunkang Cao;Haiming Yao;Weiming Shen;Zechao Li","doi":"10.1109/TCSVT.2025.3602391","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3602391","url":null,"abstract":"Unsupervised anomaly detection plays a pivotal role in industrial defect inspection and medical image analysis, with most methods relying on the reconstruction framework. However, these methods may suffer from over-generalization, enabling them to reconstruct anomalies well, which leads to poor detection performance. To address this issue, instead of focusing solely on normality reconstruction, we propose an innovative Uncertainty-Integrated Anomaly Perception and Restoration Attention Network (URA-Net), which explicitly restores abnormal patterns to their corresponding normality. First, unlike traditional image reconstruction methods, we utilize a pre-trained convolutional neural network to extract multi-level semantic features as the reconstruction target. To assist the URA-Net learning to restore anomalies, we introduce a novel feature-level artificial anomaly synthesis module to generate anomalous samples for training. Subsequently, a novel uncertainty-integrated anomaly perception module based on Bayesian neural networks is introduced to learn the distributions of anomalous and normal features. This facilitates the estimation of anomalous regions and ambiguous boundaries, laying the foundation for subsequent anomaly restoration. Then, we propose a novel restoration attention mechanism that leverages global normal semantic information to restore detected anomalous regions, thereby obtaining defect-free restored features. Finally, we employ residual maps between input features and restored features for anomaly detection and localization. The comprehensive experimental results on two industrial datasets, MVTec AD and BTAD, along with a medical image dataset, OCT-2017, unequivocally demonstrate the effectiveness and superiority of the proposed method.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 2","pages":"2464-2477"},"PeriodicalIF":11.1,"publicationDate":"2025-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146154448","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The optimization of block-level quantization parameters (QP) is critical to improving the performance of practical block-based video compression encoders, but the extremely large optimization space makes it challenging to solve. Existing solutions, e.g. HEVC encoder x265, usually add some optimization constraints of the block-independent assumption and linear distortion propagation model, which limits compression efficiency improvement to a certain extent. To address this problem, a deep learning-based encoder-only adaptive quantization method (DAQ) is proposed in this paper, where a deep network is designed to adaptively model the joint temporal propagation relationship of quantization among blocks. Specifically, DAQ consists of two phases: in the training phase, considering the heavy searching cost of the traditional codec, we introduce a well-designed end-to-end learned block-based video compression network as an effective training proxy tool for the deep encoder-side network. While in the deployment phase, the trained deep network is applied to jointly predict all block QPs in a frame for the traditional encoder. Besides, our network deploys only on the encoder side without changing the standard decoder and has very low inference complexity, making it able to apply in practice. At last, we deploy DAQ in HEVC and VVC encoder for performance comparison, and the experimental results demonstrate that DAQ significantly outperforms practically used x265 with on average 15.0%, 10.9% BD-rate reduction under the SSIM and PSNR, and also achieves 12.5%, 5.0% coding gain than VTM. Moreover, for deploying deep video codec in practice, this work provides a new insight for optimizing the encoder parameters with a large space.
{"title":"Deep Network-Based Adaptive Quantization for Practical Video Coding","authors":"Shuai Huo;Hewei Liu;Jiawen Gu;Dengchao Jin;Meng Lei;Bo Huang;Chao Zhou","doi":"10.1109/TCSVT.2025.3601718","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3601718","url":null,"abstract":"The optimization of block-level quantization parameters (QP) is critical to improving the performance of practical block-based video compression encoders, but the extremely large optimization space makes it challenging to solve. Existing solutions, e.g. HEVC encoder x265, usually add some optimization constraints of the block-independent assumption and linear distortion propagation model, which limits compression efficiency improvement to a certain extent. To address this problem, a deep learning-based encoder-only adaptive quantization method (DAQ) is proposed in this paper, where a deep network is designed to adaptively model the joint temporal propagation relationship of quantization among blocks. Specifically, DAQ consists of two phases: in the training phase, considering the heavy searching cost of the traditional codec, we introduce a well-designed end-to-end learned block-based video compression network as an effective training proxy tool for the deep encoder-side network. While in the deployment phase, the trained deep network is applied to jointly predict all block QPs in a frame for the traditional encoder. Besides, our network deploys only on the encoder side without changing the standard decoder and has very low inference complexity, making it able to apply in practice. At last, we deploy DAQ in HEVC and VVC encoder for performance comparison, and the experimental results demonstrate that DAQ significantly outperforms practically used x265 with on average 15.0%, 10.9% BD-rate reduction under the SSIM and PSNR, and also achieves 12.5%, 5.0% coding gain than VTM. Moreover, for deploying deep video codec in practice, this work provides a new insight for optimizing the encoder parameters with a large space.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 2","pages":"2538-2550"},"PeriodicalIF":11.1,"publicationDate":"2025-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146154444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-20DOI: 10.1109/TCSVT.2025.3600881
Zhenrong Zhang;Jianan Liu;Yuxuan Xia;Tao Huang;Qing-Long Han;Hongbin Liu
Online Multi-Object Tracking (MOT) plays a pivotal role in autonomous systems. The state-of-the-art approaches usually employ a tracking-by-detection method, and data association plays a critical role. This paper proposes a learning and graph-optimized (LEGO) modular tracker to improve data association performance in the existing literature. The proposed LEGO tracker integrates graph optimization, which efficiently formulates the association score map, facilitating the accurate and efficient matching of objects across time frames. To further enhance the state update process, the Kalman filter is added to ensure consistent tracking by incorporating temporal coherence in the object states to further enhance the state update process. Our proposed method, utilising LiDAR alone, has shown exceptional performance compared to other online tracking approaches, including LiDAR-based and LiDAR-camera fusion-based methods. LEGO ranked $3^{rd}$ among all trackers (both online and offline) and $2^{nd}$ among all online trackers in the KITTI MOT benchmark for cars, (https://www.cvlibs.net/datasets/kitti/eval_tracking.php) at the time of submitting results to KITTI object tracking evaluation ranking board. Moreover, our method also achieves competitive performance on the Waymo open dataset benchmark.
{"title":"LEGO: Learning and Graph-Optimized Modular Tracker for Online Multi-Object Tracking With Point Clouds","authors":"Zhenrong Zhang;Jianan Liu;Yuxuan Xia;Tao Huang;Qing-Long Han;Hongbin Liu","doi":"10.1109/TCSVT.2025.3600881","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3600881","url":null,"abstract":"Online Multi-Object Tracking (MOT) plays a pivotal role in autonomous systems. The state-of-the-art approaches usually employ a tracking-by-detection method, and data association plays a critical role. This paper proposes a learning and graph-optimized (LEGO) modular tracker to improve data association performance in the existing literature. The proposed LEGO tracker integrates graph optimization, which efficiently formulates the association score map, facilitating the accurate and efficient matching of objects across time frames. To further enhance the state update process, the Kalman filter is added to ensure consistent tracking by incorporating temporal coherence in the object states to further enhance the state update process. Our proposed method, utilising LiDAR alone, has shown exceptional performance compared to other online tracking approaches, including LiDAR-based and LiDAR-camera fusion-based methods. LEGO ranked <inline-formula> <tex-math>$3^{rd}$ </tex-math></inline-formula> among all trackers (both online and offline) and <inline-formula> <tex-math>$2^{nd}$ </tex-math></inline-formula> among all online trackers in the KITTI MOT benchmark for cars, (<uri>https://www.cvlibs.net/datasets/kitti/eval_tracking.php</uri>) at the time of submitting results to KITTI object tracking evaluation ranking board. Moreover, our method also achieves competitive performance on the Waymo open dataset benchmark.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 2","pages":"2419-2432"},"PeriodicalIF":11.1,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146154399","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The popularity of template-generated videos has recently experienced a significant increase on social media platforms. In general, videos from the same template share similar temporal characteristics, which are unfortunately ignored in the current compression schemes. In view of this, we aim to examine how such temporal priors from templates can be effectively utilized during the compression process for template-generated videos. First, a comprehensive statistical analysis is conducted, revealing that the coding decisions, including the merge, non-affine, and motion information, across template-generated videos are strongly correlated. Subsequently, leveraging such correlations as prior knowledge, a simple yet effective prior-driven compression scheme for template-generated videos is proposed. In particular, a mode decision pruning algorithm is devised to dynamically skip unnecessarily advanced motion vector prediction (AMVP) or affine AMVP decisions. Moreover, an improved AMVP motion estimation algorithm is applied to further accelerate reference frame selection and the motion estimation process. Experimental results on the versatile video coding (VVC) platform VTM-23.0 demonstrate that the proposed scheme achieves moderate time reductions of 14.31% and 14.99% under the Low-Delay P (LDP) and Low-Delay B (LDB) configurations, respectively, while maintaining negligible increases in Bjøntegaard Delta Rate (BD-Rate) of 0.15% and 0.18%, respectively.
{"title":"Mining Temporal Priors for Template-Generated Video Compression","authors":"Feng Xing;Yingwen Zhang;Meng Wang;Hengyu Man;Yongbing Zhang;Shiqi Wang;Xiaopeng Fan;Wen Gao","doi":"10.1109/TCSVT.2025.3599239","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3599239","url":null,"abstract":"The popularity of template-generated videos has recently experienced a significant increase on social media platforms. In general, videos from the same template share similar temporal characteristics, which are unfortunately ignored in the current compression schemes. In view of this, we aim to examine how such temporal priors from templates can be effectively utilized during the compression process for template-generated videos. First, a comprehensive statistical analysis is conducted, revealing that the coding decisions, including the merge, non-affine, and motion information, across template-generated videos are strongly correlated. Subsequently, leveraging such correlations as prior knowledge, a simple yet effective prior-driven compression scheme for template-generated videos is proposed. In particular, a mode decision pruning algorithm is devised to dynamically skip unnecessarily advanced motion vector prediction (AMVP) or affine AMVP decisions. Moreover, an improved AMVP motion estimation algorithm is applied to further accelerate reference frame selection and the motion estimation process. Experimental results on the versatile video coding (VVC) platform VTM-23.0 demonstrate that the proposed scheme achieves moderate time reductions of 14.31% and 14.99% under the Low-Delay P (LDP) and Low-Delay B (LDB) configurations, respectively, while maintaining negligible increases in Bjøntegaard Delta Rate (BD-Rate) of 0.15% and 0.18%, respectively.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 1","pages":"1160-1172"},"PeriodicalIF":11.1,"publicationDate":"2025-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146049298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}