首页 > 最新文献

IEEE Transactions on Circuits and Systems for Video Technology最新文献

英文 中文
DRFC: An End-to-End Deep Dynamic RF Signal Compression Framework 一个端到端深度动态射频信号压缩框架
IF 11.1 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-08-07 DOI: 10.1109/TCSVT.2025.3596840
Xihua Sheng;Peilin Chen;Shiqi Wang;Dapeng Oliver Wu
Radio frequency (RF) signals have gained widespread adoption in intelligent perception systems due to their unique advantages, including non-line-of-sight propagation capability, robustness in low-light environments, and inherent privacy preservation. However, their substantial data volumes, generated by the dual-polarization direction characteristic, result in significant challenges to data storage and transmission. To address this, we propose the first end-to-end deep dynamic RF signal compression (DRFC) framework, which primarily focuses on exploiting cross-directional correlation in dynamic RF signals. The proposed framework incorporates four key innovations: (1) a mask-guided RF motion estimation module that leverages Doppler shifts and electromagnetic noise characteristics to identify regions of significant motion using a threshold-based mask, significantly improving motion estimation accuracy; (2) a cross-directional RF motion entropy model that utilizes cross-directional RF motion latent priors to refine the probability distribution for motion entropy coding; (3) a cross-directional RF context mining module that predicts RF contexts from temporal and cross-directional reference signals, adaptively fusing these contexts with confidence maps to maximize complementary information utilization; and (4) a cross-directional RF contextual entropy model that incorporates cross-directional RF contextual latent priors to optimize contextual entropy modeling. Experimental results demonstrate the superiority of our framework over existing codecs. Our DRFC framework achieves significant bitrate savings on benchmark datasets, establishing a strong baseline for future research in this field.
射频(RF)信号由于其独特的优势,包括非视距传播能力、低光环境下的鲁棒性和固有的隐私保护,在智能感知系统中得到了广泛的应用。然而,由于双极化方向特性产生的大量数据量,给数据存储和传输带来了重大挑战。为了解决这个问题,我们提出了第一个端到端深度动态射频信号压缩(DRFC)框架,该框架主要侧重于利用动态射频信号中的交叉相关性。提出的框架包含四个关键创新:(1)掩模引导的射频运动估计模块,利用多普勒频移和电磁噪声特性,使用基于阈值的掩模识别显著运动区域,显著提高运动估计精度;(2)利用射频运动潜先验优化运动熵编码概率分布的射频运动熵模型;(3)一个双向射频上下文挖掘模块,该模块从时间和双向参考信号中预测射频上下文,并自适应地将这些上下文与置信度图融合,以最大限度地利用互补信息;(4)结合双向射频上下文潜在先验优化上下文熵建模的双向射频上下文熵模型。实验结果表明,该框架优于现有的编解码器。我们的DRFC框架在基准数据集上实现了显着的比特率节省,为该领域的未来研究建立了强有力的基线。
{"title":"DRFC: An End-to-End Deep Dynamic RF Signal Compression Framework","authors":"Xihua Sheng;Peilin Chen;Shiqi Wang;Dapeng Oliver Wu","doi":"10.1109/TCSVT.2025.3596840","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3596840","url":null,"abstract":"Radio frequency (RF) signals have gained widespread adoption in intelligent perception systems due to their unique advantages, including non-line-of-sight propagation capability, robustness in low-light environments, and inherent privacy preservation. However, their substantial data volumes, generated by the dual-polarization direction characteristic, result in significant challenges to data storage and transmission. To address this, we propose the first end-to-end deep dynamic RF signal compression (DRFC) framework, which primarily focuses on exploiting cross-directional correlation in dynamic RF signals. The proposed framework incorporates four key innovations: (1) a mask-guided RF motion estimation module that leverages Doppler shifts and electromagnetic noise characteristics to identify regions of significant motion using a threshold-based mask, significantly improving motion estimation accuracy; (2) a cross-directional RF motion entropy model that utilizes cross-directional RF motion latent priors to refine the probability distribution for motion entropy coding; (3) a cross-directional RF context mining module that predicts RF contexts from temporal and cross-directional reference signals, adaptively fusing these contexts with confidence maps to maximize complementary information utilization; and (4) a cross-directional RF contextual entropy model that incorporates cross-directional RF contextual latent priors to optimize contextual entropy modeling. Experimental results demonstrate the superiority of our framework over existing codecs. Our DRFC framework achieves significant bitrate savings on benchmark datasets, establishing a strong baseline for future research in this field.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 1","pages":"1104-1116"},"PeriodicalIF":11.1,"publicationDate":"2025-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146049301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
IEEE Circuits and Systems Society Information IEEE电路与系统学会信息
IF 11.1 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-08-05 DOI: 10.1109/TCSVT.2025.3592055
{"title":"IEEE Circuits and Systems Society Information","authors":"","doi":"10.1109/TCSVT.2025.3592055","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3592055","url":null,"abstract":"","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 8","pages":"C3-C3"},"PeriodicalIF":11.1,"publicationDate":"2025-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11114434","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144782148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Pixel-Level Just Noticeable Difference in Sonar Images: Modeling and Applications 声纳图像的像素级差异:建模和应用
IF 11.1 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-08-05 DOI: 10.1109/TCSVT.2025.3596153
Weiling Chen;Weiming Lin;Qianxue Feng;Rongxin Zhang;Tiesong Zhao
Sonar images are vital in ocean explorations but face transmission challenges due to limited bandwidth and unstable channels. The Just Noticeable Difference (JND) represents the minimum distortion detectable by human observers. By eliminating perceptual redundancy, JND offers a solution for efficient compression and accurate Image Quality Assessment (IQA) to enable reliable transmission. However, existing JND models prove inadequate for sonar images due to their unique redundancy distributions and the absence of pixel-level annotated data. To bridge these gaps, we propose the first sonar-specific, picture-level JND dataset and a weakly supervised JND model that infers pixel-level JND from picture-level annotations. Our approach starts with pretraining a perceptually lossy/lossless predictor, which collaborates with sonar image properties to drive an unsupervised generator producing Critically Distorted Images (CDIs). These CDIs maximize pixel differences while preserving perceptual fidelity, enabling precise JND map derivation. Furthermore, we systematically investigate JND-guided optimization for sonar image compression and IQA algorithms, demonstrating favorable performance enhancements.
声纳图像在海洋探测中至关重要,但由于带宽有限和信道不稳定,声纳图像的传输面临挑战。刚可察觉差分(JND)表示人类观察者可检测到的最小失真。通过消除感知冗余,JND提供了高效压缩和准确图像质量评估(IQA)的解决方案,从而实现可靠的传输。然而,现有的JND模型由于其独特的冗余分布和缺乏像素级注释数据而被证明不适合声纳图像。为了弥补这些差距,我们提出了第一个声纳特定的图像级JND数据集和一个弱监督JND模型,该模型可以从图像级注释推断像素级JND。我们的方法从预训练感知有损/无损预测器开始,该预测器与声纳图像属性协同驱动无监督生成器产生严重失真图像(cdi)。这些cdi在保持感知保真度的同时最大化像素差异,从而实现精确的JND地图派生。此外,我们系统地研究了jnd引导的声纳图像压缩和IQA算法优化,证明了良好的性能增强。
{"title":"Pixel-Level Just Noticeable Difference in Sonar Images: Modeling and Applications","authors":"Weiling Chen;Weiming Lin;Qianxue Feng;Rongxin Zhang;Tiesong Zhao","doi":"10.1109/TCSVT.2025.3596153","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3596153","url":null,"abstract":"Sonar images are vital in ocean explorations but face transmission challenges due to limited bandwidth and unstable channels. The Just Noticeable Difference (JND) represents the minimum distortion detectable by human observers. By eliminating perceptual redundancy, JND offers a solution for efficient compression and accurate Image Quality Assessment (IQA) to enable reliable transmission. However, existing JND models prove inadequate for sonar images due to their unique redundancy distributions and the absence of pixel-level annotated data. To bridge these gaps, we propose the first sonar-specific, picture-level JND dataset and a weakly supervised JND model that infers pixel-level JND from picture-level annotations. Our approach starts with pretraining a perceptually lossy/lossless predictor, which collaborates with sonar image properties to drive an unsupervised generator producing Critically Distorted Images (CDIs). These CDIs maximize pixel differences while preserving perceptual fidelity, enabling precise JND map derivation. Furthermore, we systematically investigate JND-guided optimization for sonar image compression and IQA algorithms, demonstrating favorable performance enhancements.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 1","pages":"1173-1184"},"PeriodicalIF":11.1,"publicationDate":"2025-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146049273","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Comprehensive Survey on Video Summarization: Challenges and Advances 视频摘要研究综述:挑战与进展
IF 11.1 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-08-05 DOI: 10.1109/TCSVT.2025.3596006
Hongxi Li;Yubo Zhu;Zirui Shang;Ziyi Wang;Xinxiao Wu
Video data is growing exponentially daily due to the popularity of video-sharing platforms and the proliferation of video capture devices. The video summarization task has been proposed to remove redundancy while maintaining as many critical parts of the video as possible so that users can browse and process videos more effectively, which has received increasing attention from researchers. The existing research addresses the challenges faced by video summarization methods from various perspectives, such as temporal dependency, data scarcity, user preference, and high precision. This paper reviews representative and state-of-the-art methods, analyzes recent research advances, datasets, and performance evaluations, and discusses future directions. We hope this survey can help future research explore the potential directions of video summarization methods.
由于视频分享平台的普及和视频捕捉设备的激增,视频数据每天都在呈指数级增长。视频摘要任务的提出是为了在尽可能多地保留视频关键部分的同时去除冗余,使用户能够更有效地浏览和处理视频,越来越受到研究者的关注。现有的研究从时间依赖性、数据稀缺性、用户偏好和高精度等多个角度解决了视频摘要方法面临的挑战。本文回顾了代表性和最先进的方法,分析了最近的研究进展,数据集和绩效评估,并讨论了未来的方向。我们希望这一调查可以帮助未来的研究探索视频摘要方法的潜在方向。
{"title":"A Comprehensive Survey on Video Summarization: Challenges and Advances","authors":"Hongxi Li;Yubo Zhu;Zirui Shang;Ziyi Wang;Xinxiao Wu","doi":"10.1109/TCSVT.2025.3596006","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3596006","url":null,"abstract":"Video data is growing exponentially daily due to the popularity of video-sharing platforms and the proliferation of video capture devices. The video summarization task has been proposed to remove redundancy while maintaining as many critical parts of the video as possible so that users can browse and process videos more effectively, which has received increasing attention from researchers. The existing research addresses the challenges faced by video summarization methods from various perspectives, such as temporal dependency, data scarcity, user preference, and high precision. This paper reviews representative and state-of-the-art methods, analyzes recent research advances, datasets, and performance evaluations, and discusses future directions. We hope this survey can help future research explore the potential directions of video summarization methods.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 1","pages":"1216-1233"},"PeriodicalIF":11.1,"publicationDate":"2025-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146049295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hyperspectral Image Compression With Spectral-Spatial Coupling and Group-Wise Context Modeling 基于光谱空间耦合和分组上下文建模的高光谱图像压缩
IF 11.1 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-08-05 DOI: 10.1109/TCSVT.2025.3596061
Wei Wei;Chenxu Zhao;Shuyi Zhao;Lei Zhang;Yanning Zhang
The rich spectral information within hyperspectral images (HSIs) results in large data volumes. Thus finding a compact representation for HSIs while maintaining reconstruction quality is a fundamental task for numerous applications. Though the existing learning-based compression methods and context models have shown strong rate-distortion (RD) performance, these methods only pay their attention on spatial redundancy without considering the spectral redundancy of HSIs, which thus impedes further improvement of their performance on HSI. Moreover, the strictly sequential autoregressive nature of context models leads to inefficiency, further limiting their practical applications. In this paper, leveraging the spectral priors unique to HSIs, we propose a hybrid Transformer-CNN architecture to find compact latent representations of HSIs. In specific, we construct Spectral-Spatial Coupling Transformer Group (SSCTG) to cooperatively extract spatial and spectral features of HSIs. Additionally, we propose Group-wise Context Model (GCM) to further enhance the parallel processing capability of autoregression within context models, significantly improving the coding efficiency. Extensive experiments demonstrate the effectiveness of the proposed method, achieving superior RD performance compared to state-of-the-art methods while maintaining high efficiency of codecs.
高光谱图像中丰富的光谱信息导致了大数据量。因此,在保持重建质量的同时,找到hsi的紧凑表示是许多应用的基本任务。尽管现有的基于学习的压缩方法和上下文模型表现出较强的率失真性能,但这些方法只关注空间冗余,而没有考虑HSI的频谱冗余,从而阻碍了其在HSI上的进一步提高。此外,上下文模型严格的顺序自回归特性导致效率低下,进一步限制了它们的实际应用。在本文中,利用hsi特有的光谱先验,我们提出了一种混合Transformer-CNN架构来寻找hsi的紧凑潜在表示。具体而言,我们构建了频谱-空间耦合变压器群(SSCTG),以协同提取hsi的空间和光谱特征。此外,我们提出了分组明智上下文模型(Group-wise Context Model, GCM),进一步增强了上下文模型内自回归的并行处理能力,显著提高了编码效率。大量的实验证明了所提出方法的有效性,与最先进的方法相比,在保持编解码器高效率的同时,实现了优越的RD性能。
{"title":"Hyperspectral Image Compression With Spectral-Spatial Coupling and Group-Wise Context Modeling","authors":"Wei Wei;Chenxu Zhao;Shuyi Zhao;Lei Zhang;Yanning Zhang","doi":"10.1109/TCSVT.2025.3596061","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3596061","url":null,"abstract":"The rich spectral information within hyperspectral images (HSIs) results in large data volumes. Thus finding a compact representation for HSIs while maintaining reconstruction quality is a fundamental task for numerous applications. Though the existing learning-based compression methods and context models have shown strong rate-distortion (RD) performance, these methods only pay their attention on spatial redundancy without considering the spectral redundancy of HSIs, which thus impedes further improvement of their performance on HSI. Moreover, the strictly sequential autoregressive nature of context models leads to inefficiency, further limiting their practical applications. In this paper, leveraging the spectral priors unique to HSIs, we propose a hybrid Transformer-CNN architecture to find compact latent representations of HSIs. In specific, we construct Spectral-Spatial Coupling Transformer Group (SSCTG) to cooperatively extract spatial and spectral features of HSIs. Additionally, we propose Group-wise Context Model (GCM) to further enhance the parallel processing capability of autoregression within context models, significantly improving the coding efficiency. Extensive experiments demonstrate the effectiveness of the proposed method, achieving superior RD performance compared to state-of-the-art methods while maintaining high efficiency of codecs.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 1","pages":"1130-1142"},"PeriodicalIF":11.1,"publicationDate":"2025-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146049293","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HAhb-KG: Hierarchical Augmented Knowledge Graph for Human Behavior Assisting Cross-Modal Learning Action Detection HAhb-KG:人类行为的层次增强知识图谱,协助跨模态学习动作检测
IF 11.1 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-08-04 DOI: 10.1109/TCSVT.2025.3595145
Xiaochen Wang;Dehui Kong;Jinghua Li;Jing Wang;Baocai Yin
Action detection in untrimmed, densely annotated video datasets is a challenging task due to the presence of composite actions and co-occurring actions in videos. To facilitate action detection in such intricate scenarios, leveraging ample prior information from the data and comprehending the context of actions in the video are the most important two clues. Specifically, the co-occurrence probability of actions can effectively capture the temporal relationships and associations among actions, aiding the model in recognizing multiple actions occurring simultaneously. Additionally, aggregating action information from different levels of the data into a comprehensive graph and describing human actions from various semantic layers can significantly reduce ambiguities in action detection. Based on this, a novel knowledge graph, Hierarchical Augmented Knowledge Graph for human behaviour (HAhb-KG), is proposed, which brings together action-related prior knowledge on different levels into a unified hierarchical graph. The graph describes human behaviour from various semantic aspects by defining diversified graph nodes, and augments the nodes and relationships with corresponding images and probability of co-occurrence respectively, to introduce textual modality information and weigh the associations between actions. In order to mine the knowledge related to the input video in the knowledge graph, HAhb-KG oriented knowledge understanding framework is proposed to embed multi-modal knowledge as a valuable supplement to visual information. Incorporated with the framework, a cross-modal learning action detection model is designed to achieve high accuracy in action detection tasks, which validates the effectiveness of HAhb-KG. Our method achieves gains of 1.45(mAP) and 2.28(mAP) in action detection experiments on the Charades and TSU datasets, respectively, which show that the proposed method outperforms existing knowledge-based action detection methods.
由于视频中存在复合动作和共同发生的动作,在未修剪、密集注释的视频数据集中进行动作检测是一项具有挑战性的任务。为了便于在如此复杂的场景中进行动作检测,利用数据中的充足先验信息和理解视频中的动作背景是最重要的两条线索。具体来说,动作的共现概率可以有效地捕捉动作之间的时间关系和关联,帮助模型识别同时发生的多个动作。此外,将来自不同数据层次的动作信息聚合成一个综合的图,并从不同的语义层描述人类的动作,可以显著减少动作检测中的歧义。在此基础上,提出了一种新的知识图谱——人类行为层次增强知识图谱(HAhb-KG),它将不同层次的动作相关先验知识集合成一个统一的层次图。该图通过定义多样化的图节点,从不同的语义方面描述人类行为,并分别用相应的图像和共现概率来增加节点和关系,引入文本情态信息,权衡动作之间的关联。为了挖掘知识图中与输入视频相关的知识,提出了面向HAhb-KG的知识理解框架,嵌入多模态知识,作为视觉信息的有价值补充。结合该框架,设计了一个跨模态学习动作检测模型,在动作检测任务中实现了较高的准确率,验证了HAhb-KG的有效性。在Charades和TSU数据集上的动作检测实验中,我们的方法分别获得了1.45(mAP)和2.28(mAP)的增益,表明该方法优于现有的基于知识的动作检测方法。
{"title":"HAhb-KG: Hierarchical Augmented Knowledge Graph for Human Behavior Assisting Cross-Modal Learning Action Detection","authors":"Xiaochen Wang;Dehui Kong;Jinghua Li;Jing Wang;Baocai Yin","doi":"10.1109/TCSVT.2025.3595145","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3595145","url":null,"abstract":"Action detection in untrimmed, densely annotated video datasets is a challenging task due to the presence of composite actions and co-occurring actions in videos. To facilitate action detection in such intricate scenarios, leveraging ample prior information from the data and comprehending the context of actions in the video are the most important two clues. Specifically, the co-occurrence probability of actions can effectively capture the temporal relationships and associations among actions, aiding the model in recognizing multiple actions occurring simultaneously. Additionally, aggregating action information from different levels of the data into a comprehensive graph and describing human actions from various semantic layers can significantly reduce ambiguities in action detection. Based on this, a novel knowledge graph, Hierarchical Augmented Knowledge Graph for human behaviour (HAhb-KG), is proposed, which brings together action-related prior knowledge on different levels into a unified hierarchical graph. The graph describes human behaviour from various semantic aspects by defining diversified graph nodes, and augments the nodes and relationships with corresponding images and probability of co-occurrence respectively, to introduce textual modality information and weigh the associations between actions. In order to mine the knowledge related to the input video in the knowledge graph, HAhb-KG oriented knowledge understanding framework is proposed to embed multi-modal knowledge as a valuable supplement to visual information. Incorporated with the framework, a cross-modal learning action detection model is designed to achieve high accuracy in action detection tasks, which validates the effectiveness of HAhb-KG. Our method achieves gains of 1.45(mAP) and 2.28(mAP) in action detection experiments on the Charades and TSU datasets, respectively, which show that the proposed method outperforms existing knowledge-based action detection methods.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 1","pages":"1045-1060"},"PeriodicalIF":11.1,"publicationDate":"2025-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146082060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hierarchical Topology Meets Temporal Occupancy: A Comprehensive Model for Multi-Person Pose Tracking 层次拓扑满足时间占用:一种多人姿态跟踪的综合模型
IF 11.1 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-08-04 DOI: 10.1109/TCSVT.2025.3595104
Muyu Li;Henan Hu;Yingfeng Wang;Sen Qiu;Xudong Zhao
Existing approaches to multi-person pose tracking often suffer from low-confidence detections due to inter-instance and intra-instance occlusions, as well as non-canonical poses. In this work, we propose a novel solution by addressing two critical aspects: incomplete joint temporal dependencies and spatio-temporal voxelization. First, we introduce a method for extracting hierarchical relationships between joints based on human dynamics, enabling the model to reason about occlusions within the spatial topology of the human body. This hierarchical approach tackles incomplete joint visibility by leveraging the interdependencies between joints in both space and time. Second, we present a spatio-temporal occupancy network for multi-person pose tracking. By stacking 2D pose data over time to create a spatio-temporal voxel grid, the model captures temporal relationships between instances and joints, enhancing spatio-temporal correlations and learning keypoint distributions under occlusions or non-canonical poses. Extensive experiments on the PoseTrack2017, PoseTrack2018, and PoseTrack21 dataset demonstrate that our method improves multi-person pose tracking performance, achieving state-of-the-art mAP.
现有的多人姿态跟踪方法往往由于实例间和实例内遮挡以及非规范姿态而存在低置信度检测问题。在这项工作中,我们提出了一个新的解决方案,通过解决两个关键方面:不完全联合时间依赖性和时空体素化。首先,我们引入了一种基于人体动力学提取关节之间层次关系的方法,使模型能够在人体的空间拓扑结构中推断咬合。这种分层方法通过利用关节在空间和时间上的相互依赖性来解决不完整的关节可见性。其次,我们提出了一个用于多人姿态跟踪的时空占用网络。通过将2D姿态数据随时间叠加以创建时空体素网格,该模型捕获实例和关节之间的时间关系,增强时空相关性并学习遮挡或非规范姿态下的关键点分布。在PoseTrack2017、PoseTrack2018和PoseTrack21数据集上进行的大量实验表明,我们的方法提高了多人姿态跟踪性能,实现了最先进的mAP。
{"title":"Hierarchical Topology Meets Temporal Occupancy: A Comprehensive Model for Multi-Person Pose Tracking","authors":"Muyu Li;Henan Hu;Yingfeng Wang;Sen Qiu;Xudong Zhao","doi":"10.1109/TCSVT.2025.3595104","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3595104","url":null,"abstract":"Existing approaches to multi-person pose tracking often suffer from low-confidence detections due to inter-instance and intra-instance occlusions, as well as non-canonical poses. In this work, we propose a novel solution by addressing two critical aspects: incomplete joint temporal dependencies and spatio-temporal voxelization. First, we introduce a method for extracting hierarchical relationships between joints based on human dynamics, enabling the model to reason about occlusions within the spatial topology of the human body. This hierarchical approach tackles incomplete joint visibility by leveraging the interdependencies between joints in both space and time. Second, we present a spatio-temporal occupancy network for multi-person pose tracking. By stacking 2D pose data over time to create a spatio-temporal voxel grid, the model captures temporal relationships between instances and joints, enhancing spatio-temporal correlations and learning keypoint distributions under occlusions or non-canonical poses. Extensive experiments on the PoseTrack2017, PoseTrack2018, and PoseTrack21 dataset demonstrate that our method improves multi-person pose tracking performance, achieving state-of-the-art mAP.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 1","pages":"1061-1074"},"PeriodicalIF":11.1,"publicationDate":"2025-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146082012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Spatial–Temporal Correlation Information-Based Rate Control for Versatile Video Coding 基于时空相关信息的多用途视频编码速率控制
IF 11.1 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-08-04 DOI: 10.1109/TCSVT.2025.3595555
Zeming Zhao;Xiaohai He;Shuhua Xiong;Meng Wang;Shiqi Wang
Although lambda-domain-based rate control is widely used in video encoders, developing an efficient rate control scheme for Coding Tree Units (CTUs) under the rate-distortion (R-D) principle remains a significant challenge. In this paper, we propose a spatial-temporal correlation information-based rate control scheme for Versatile Video Coding (VVC), aiming to improve coding performance. We introduce a weight estimation network to establish a CTU-level bit allocation strategy that fully exploits spatial-temporal contextual information. Moreover, the CTU-level coding parameter $lambda $ is adaptively optimized based on a dependency factor derived from distortion dependency information in both the spatial and temporal domains. Experimental results demonstrate that, compared to the default VVC rate control, the proposed scheme achieves BD-Rate savings of 6.48%, 17.33% and 13.75% in terms of the Peak Signal-to-Noise Ratio (PSNR), the Multi-Scale Structural Similarity Index (MS-SSIM) and the Video Multimethod Assessment Fusion (VMAF), respectively, under the Low Delay_P (LDP) configuration in the VVC Test Model (VTM) 19.0. Furthermore, the proposed method outperforms other state-of-the-art rate control schemes.
尽管基于lambda域的码率控制在视频编码器中得到了广泛的应用,但在码率失真(R-D)原理下,开发一种有效的编码树单元(ctu)码率控制方案仍然是一个重大挑战。本文提出了一种基于时空相关信息的通用视频编码(VVC)速率控制方案,以提高编码性能。我们引入了一个权重估计网络来建立一个充分利用时空上下文信息的cpu级比特分配策略。此外,基于时空域畸变依赖信息衍生的依赖因子,对ccu级编码参数$lambda $进行了自适应优化。实验结果表明,在VVC测试模型(VTM) 19.0的低延迟p (LDP)配置下,与默认的VVC速率控制相比,该方案在峰值信噪比(PSNR)、多尺度结构相似度指数(MS-SSIM)和视频多方法评估融合(VMAF)方面分别节省了6.48%、17.33%和13.75%的hd - rate。此外,所提出的方法优于其他最先进的费率控制方案。
{"title":"Spatial–Temporal Correlation Information-Based Rate Control for Versatile Video Coding","authors":"Zeming Zhao;Xiaohai He;Shuhua Xiong;Meng Wang;Shiqi Wang","doi":"10.1109/TCSVT.2025.3595555","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3595555","url":null,"abstract":"Although lambda-domain-based rate control is widely used in video encoders, developing an efficient rate control scheme for Coding Tree Units (CTUs) under the rate-distortion (R-D) principle remains a significant challenge. In this paper, we propose a spatial-temporal correlation information-based rate control scheme for Versatile Video Coding (VVC), aiming to improve coding performance. We introduce a weight estimation network to establish a CTU-level bit allocation strategy that fully exploits spatial-temporal contextual information. Moreover, the CTU-level coding parameter <inline-formula> <tex-math>$lambda $ </tex-math></inline-formula> is adaptively optimized based on a dependency factor derived from distortion dependency information in both the spatial and temporal domains. Experimental results demonstrate that, compared to the default VVC rate control, the proposed scheme achieves BD-Rate savings of 6.48%, 17.33% and 13.75% in terms of the Peak Signal-to-Noise Ratio (PSNR), the Multi-Scale Structural Similarity Index (MS-SSIM) and the Video Multimethod Assessment Fusion (VMAF), respectively, under the Low Delay_P (LDP) configuration in the VVC Test Model (VTM) 19.0. Furthermore, the proposed method outperforms other state-of-the-art rate control schemes.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 1","pages":"1117-1129"},"PeriodicalIF":11.1,"publicationDate":"2025-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146049283","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Beyond Inserting: Learning Subject Embedding for Semantic-Fidelity Personalized Diffusion Generation 超越插入:学习主题嵌入的语义保真个性化扩散生成
IF 11.1 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-07-22 DOI: 10.1109/TCSVT.2025.3588882
Yang Li;Songlin Yang;Wei Wang;Jing Dong
Text-to-Image (T2I) personalization based on advanced diffusion models (e.g., Stable Diffusion), which aims to generate images of target subjects given various prompts, has drawn huge attention. However, when users require personalized image generation for specific subjects such as themselves or their pet cat, the T2I models fail to accurately generate their subject-preserved images. The main problem is that pre-trained T2I models do not learn the T2I mapping between the target subjects and their corresponding visual contents. Even if multiple target subject images are provided, previous personalization methods either failed to accurately fit the subject region or lost the interactive generative ability with other existing concepts in T2I model space. For example, they are unable to generate T2I-aligned and semantic-fidelity images for the given prompts with other concepts such as scenes (“Eiffel Tower”), actions (“holding a basketball”), and facial attributes (“eyes closed”). In this paper, we focus on inserting accurate and interactive subject embedding into the Stable Diffusion Model for semantic-fidelity personalized generation using one image. We address this challenge from two perspectives: subject-wise attention loss and semantic-fidelity token optimization. Specifically, we propose a subject-wise attention loss to guide the subject embedding onto a manifold with high subject identity similarity and diverse interactive generative ability. Then, we optimize one subject representation as multiple per-stage tokens, and each token contains two disentangled features. This expansion of the textual conditioning space enhances the semantic control, thereby improving semantic-fidelity. We conduct extensive experiments on the most challenging subjects, face identities, to validate that our results exhibit superior subject accuracy and fine-grained manipulation ability. We further validate the generalization of our methods on various non-face subjects.
基于先进扩散模型(如稳定扩散)的文本到图像(tt2i)个性化,其目的是在各种提示下生成目标对象的图像,引起了广泛关注。然而,当用户需要为自己或宠物猫等特定对象生成个性化图像时,T2I模型无法准确地生成受试者保留的图像。主要问题是预训练的T2I模型没有学习到目标被试与其相应的视觉内容之间的T2I映射。即使提供了多个目标主体图像,以往的个性化方法要么无法准确拟合主体区域,要么失去了与T2I模型空间中其他现有概念的交互生成能力。例如,它们无法为给定的提示生成符合twi和语义保真度的图像,这些图像包含其他概念,如场景(“埃菲尔铁塔”)、动作(“拿着篮球”)和面部属性(“闭上眼睛”)。在本文中,我们专注于在稳定扩散模型中插入精确的交互式主题嵌入,用于使用一张图像进行语义保真度个性化生成。我们从两个角度解决这一挑战:主题明智的注意力丢失和语义保真令牌优化。具体来说,我们提出了一种基于主体的注意力缺失方法来引导主体嵌入到具有高度主体身份相似性和多元交互生成能力的流形上。然后,我们将一个主题表示优化为多个逐阶段令牌,每个令牌包含两个解纠缠的特征。这种对文本条件反射空间的扩展增强了语义控制,从而提高了语义保真度。我们在最具挑战性的主题上进行了广泛的实验,面部身份,以验证我们的结果显示出卓越的主题准确性和细粒度操作能力。我们进一步验证了我们的方法在各种非面部主题上的泛化性。
{"title":"Beyond Inserting: Learning Subject Embedding for Semantic-Fidelity Personalized Diffusion Generation","authors":"Yang Li;Songlin Yang;Wei Wang;Jing Dong","doi":"10.1109/TCSVT.2025.3588882","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3588882","url":null,"abstract":"Text-to-Image (T2I) personalization based on advanced diffusion models (e.g., Stable Diffusion), which aims to generate images of target subjects given various prompts, has drawn huge attention. However, when users require personalized image generation for specific subjects such as themselves or their pet cat, the T2I models fail to accurately generate their subject-preserved images. The main problem is that pre-trained T2I models do not learn the T2I mapping between the target subjects and their corresponding visual contents. Even if multiple target subject images are provided, previous personalization methods either failed to accurately fit the subject region or lost the interactive generative ability with other existing concepts in T2I model space. For example, they are unable to generate T2I-aligned and semantic-fidelity images for the given prompts with other concepts such as scenes (“Eiffel Tower”), actions (“holding a basketball”), and facial attributes (“eyes closed”). In this paper, we focus on inserting accurate and interactive subject embedding into the Stable Diffusion Model for semantic-fidelity personalized generation using one image. We address this challenge from two perspectives: subject-wise attention loss and semantic-fidelity token optimization. Specifically, we propose a subject-wise attention loss to guide the subject embedding onto a manifold with high subject identity similarity and diverse interactive generative ability. Then, we optimize one subject representation as multiple per-stage tokens, and each token contains two disentangled features. This expansion of the textual conditioning space enhances the semantic control, thereby improving semantic-fidelity. We conduct extensive experiments on the most challenging subjects, face identities, to validate that our results exhibit superior subject accuracy and fine-grained manipulation ability. We further validate the generalization of our methods on various non-face subjects.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 12","pages":"12607-12621"},"PeriodicalIF":11.1,"publicationDate":"2025-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145674742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DRLN: Disparity-Aware Rescaling Learning Network for Multi-View Video Coding Optimization 基于差分感知的多视点视频编码优化学习网络
IF 11.1 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-07-15 DOI: 10.1109/TCSVT.2025.3588516
Shiwei Wang;Liquan Shen;Peiying Wu;Zhaoyi Tian;Feifeng Wang
Efficient compression of multi-view video data is a critical challenge for various applications due to the large volume of data involved. Although multi-view video coding (MVC) has introduced inter-view prediction techniques to reduce video redundancies, further reduction can be achieved by encoding a subset of views at a lower resolution through asymmetric rescaling, achieving higher compression efficiency. However, existing network-based rescaling approaches are designed solely for single-viewpoint videos. These methods neglect inter-view characteristics inherent in multi-view videos, resulting in suboptimal performance. To address this issue, we first propose a Disparity-aware Rescaling Learning Network (DRLN) that integrates disparity-aware feature extraction and multi-resolution adaptive rescaling to enhance MVC efficiency by minimizing both self- and inter-view redundancies. On the one hand, during the encoding stage, our method leverages the non-local correlation of multi-view contexts and performs adaptive downscaling with an early-exit mechanism, resulting in substantial multi-view bitrate savings. On the other hand, during the decoding stage, a dynamic aggregation strategy is proposed to facilitate effective interaction with inter-view features, utilizing the inter-view and cross-scale information to reconstruct fine-grained multi-view videos. Extensive experiments show that our network achieves a significant 26.31% BD-Rate reduction compared to the 3D-HEVC standard baseline, offering state of-the-art coding performance.
由于涉及的数据量很大,多视图视频数据的有效压缩是各种应用的关键挑战。尽管多视图视频编码(MVC)已经引入了视图间预测技术来减少视频冗余,但通过非对称重新缩放以较低分辨率对视图子集进行编码,从而实现更高的压缩效率,可以进一步减少冗余。然而,现有的基于网络的缩放方法是专为单视点视频设计的。这些方法忽略了多视点视频固有的视点间特征,导致性能不理想。为了解决这个问题,我们首先提出了一个差异感知重缩放学习网络(DRLN),该网络集成了差异感知特征提取和多分辨率自适应重缩放,通过最小化自冗余和互冗余来提高MVC效率。一方面,在编码阶段,我们的方法利用了多视图上下文的非局部相关性,并通过早期退出机制进行自适应降尺度,从而大大节省了多视图比特率。另一方面,在解码阶段,提出了一种动态聚合策略,以促进与访谈视图特征的有效交互,利用访谈视图和跨尺度信息重构细粒度多视图视频。大量实验表明,与3D-HEVC标准基线相比,我们的网络实现了26.31%的BD-Rate降低,提供了最先进的编码性能。
{"title":"DRLN: Disparity-Aware Rescaling Learning Network for Multi-View Video Coding Optimization","authors":"Shiwei Wang;Liquan Shen;Peiying Wu;Zhaoyi Tian;Feifeng Wang","doi":"10.1109/TCSVT.2025.3588516","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3588516","url":null,"abstract":"Efficient compression of multi-view video data is a critical challenge for various applications due to the large volume of data involved. Although multi-view video coding (MVC) has introduced inter-view prediction techniques to reduce video redundancies, further reduction can be achieved by encoding a subset of views at a lower resolution through asymmetric rescaling, achieving higher compression efficiency. However, existing network-based rescaling approaches are designed solely for single-viewpoint videos. These methods neglect inter-view characteristics inherent in multi-view videos, resulting in suboptimal performance. To address this issue, we first propose a Disparity-aware Rescaling Learning Network (DRLN) that integrates disparity-aware feature extraction and multi-resolution adaptive rescaling to enhance MVC efficiency by minimizing both self- and inter-view redundancies. On the one hand, during the encoding stage, our method leverages the non-local correlation of multi-view contexts and performs adaptive downscaling with an early-exit mechanism, resulting in substantial multi-view bitrate savings. On the other hand, during the decoding stage, a dynamic aggregation strategy is proposed to facilitate effective interaction with inter-view features, utilizing the inter-view and cross-scale information to reconstruct fine-grained multi-view videos. Extensive experiments show that our network achieves a significant 26.31% BD-Rate reduction compared to the 3D-HEVC standard baseline, offering state of-the-art coding performance.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 12","pages":"12788-12801"},"PeriodicalIF":11.1,"publicationDate":"2025-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145674762","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Circuits and Systems for Video Technology
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1