Pub Date : 2025-08-07DOI: 10.1109/TCSVT.2025.3596840
Xihua Sheng;Peilin Chen;Shiqi Wang;Dapeng Oliver Wu
Radio frequency (RF) signals have gained widespread adoption in intelligent perception systems due to their unique advantages, including non-line-of-sight propagation capability, robustness in low-light environments, and inherent privacy preservation. However, their substantial data volumes, generated by the dual-polarization direction characteristic, result in significant challenges to data storage and transmission. To address this, we propose the first end-to-end deep dynamic RF signal compression (DRFC) framework, which primarily focuses on exploiting cross-directional correlation in dynamic RF signals. The proposed framework incorporates four key innovations: (1) a mask-guided RF motion estimation module that leverages Doppler shifts and electromagnetic noise characteristics to identify regions of significant motion using a threshold-based mask, significantly improving motion estimation accuracy; (2) a cross-directional RF motion entropy model that utilizes cross-directional RF motion latent priors to refine the probability distribution for motion entropy coding; (3) a cross-directional RF context mining module that predicts RF contexts from temporal and cross-directional reference signals, adaptively fusing these contexts with confidence maps to maximize complementary information utilization; and (4) a cross-directional RF contextual entropy model that incorporates cross-directional RF contextual latent priors to optimize contextual entropy modeling. Experimental results demonstrate the superiority of our framework over existing codecs. Our DRFC framework achieves significant bitrate savings on benchmark datasets, establishing a strong baseline for future research in this field.
{"title":"DRFC: An End-to-End Deep Dynamic RF Signal Compression Framework","authors":"Xihua Sheng;Peilin Chen;Shiqi Wang;Dapeng Oliver Wu","doi":"10.1109/TCSVT.2025.3596840","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3596840","url":null,"abstract":"Radio frequency (RF) signals have gained widespread adoption in intelligent perception systems due to their unique advantages, including non-line-of-sight propagation capability, robustness in low-light environments, and inherent privacy preservation. However, their substantial data volumes, generated by the dual-polarization direction characteristic, result in significant challenges to data storage and transmission. To address this, we propose the first end-to-end deep dynamic RF signal compression (DRFC) framework, which primarily focuses on exploiting cross-directional correlation in dynamic RF signals. The proposed framework incorporates four key innovations: (1) a mask-guided RF motion estimation module that leverages Doppler shifts and electromagnetic noise characteristics to identify regions of significant motion using a threshold-based mask, significantly improving motion estimation accuracy; (2) a cross-directional RF motion entropy model that utilizes cross-directional RF motion latent priors to refine the probability distribution for motion entropy coding; (3) a cross-directional RF context mining module that predicts RF contexts from temporal and cross-directional reference signals, adaptively fusing these contexts with confidence maps to maximize complementary information utilization; and (4) a cross-directional RF contextual entropy model that incorporates cross-directional RF contextual latent priors to optimize contextual entropy modeling. Experimental results demonstrate the superiority of our framework over existing codecs. Our DRFC framework achieves significant bitrate savings on benchmark datasets, establishing a strong baseline for future research in this field.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 1","pages":"1104-1116"},"PeriodicalIF":11.1,"publicationDate":"2025-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146049301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-05DOI: 10.1109/TCSVT.2025.3592055
{"title":"IEEE Circuits and Systems Society Information","authors":"","doi":"10.1109/TCSVT.2025.3592055","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3592055","url":null,"abstract":"","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 8","pages":"C3-C3"},"PeriodicalIF":11.1,"publicationDate":"2025-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11114434","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144782148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sonar images are vital in ocean explorations but face transmission challenges due to limited bandwidth and unstable channels. The Just Noticeable Difference (JND) represents the minimum distortion detectable by human observers. By eliminating perceptual redundancy, JND offers a solution for efficient compression and accurate Image Quality Assessment (IQA) to enable reliable transmission. However, existing JND models prove inadequate for sonar images due to their unique redundancy distributions and the absence of pixel-level annotated data. To bridge these gaps, we propose the first sonar-specific, picture-level JND dataset and a weakly supervised JND model that infers pixel-level JND from picture-level annotations. Our approach starts with pretraining a perceptually lossy/lossless predictor, which collaborates with sonar image properties to drive an unsupervised generator producing Critically Distorted Images (CDIs). These CDIs maximize pixel differences while preserving perceptual fidelity, enabling precise JND map derivation. Furthermore, we systematically investigate JND-guided optimization for sonar image compression and IQA algorithms, demonstrating favorable performance enhancements.
{"title":"Pixel-Level Just Noticeable Difference in Sonar Images: Modeling and Applications","authors":"Weiling Chen;Weiming Lin;Qianxue Feng;Rongxin Zhang;Tiesong Zhao","doi":"10.1109/TCSVT.2025.3596153","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3596153","url":null,"abstract":"Sonar images are vital in ocean explorations but face transmission challenges due to limited bandwidth and unstable channels. The Just Noticeable Difference (JND) represents the minimum distortion detectable by human observers. By eliminating perceptual redundancy, JND offers a solution for efficient compression and accurate Image Quality Assessment (IQA) to enable reliable transmission. However, existing JND models prove inadequate for sonar images due to their unique redundancy distributions and the absence of pixel-level annotated data. To bridge these gaps, we propose the first sonar-specific, picture-level JND dataset and a weakly supervised JND model that infers pixel-level JND from picture-level annotations. Our approach starts with pretraining a perceptually lossy/lossless predictor, which collaborates with sonar image properties to drive an unsupervised generator producing Critically Distorted Images (CDIs). These CDIs maximize pixel differences while preserving perceptual fidelity, enabling precise JND map derivation. Furthermore, we systematically investigate JND-guided optimization for sonar image compression and IQA algorithms, demonstrating favorable performance enhancements.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 1","pages":"1173-1184"},"PeriodicalIF":11.1,"publicationDate":"2025-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146049273","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Video data is growing exponentially daily due to the popularity of video-sharing platforms and the proliferation of video capture devices. The video summarization task has been proposed to remove redundancy while maintaining as many critical parts of the video as possible so that users can browse and process videos more effectively, which has received increasing attention from researchers. The existing research addresses the challenges faced by video summarization methods from various perspectives, such as temporal dependency, data scarcity, user preference, and high precision. This paper reviews representative and state-of-the-art methods, analyzes recent research advances, datasets, and performance evaluations, and discusses future directions. We hope this survey can help future research explore the potential directions of video summarization methods.
{"title":"A Comprehensive Survey on Video Summarization: Challenges and Advances","authors":"Hongxi Li;Yubo Zhu;Zirui Shang;Ziyi Wang;Xinxiao Wu","doi":"10.1109/TCSVT.2025.3596006","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3596006","url":null,"abstract":"Video data is growing exponentially daily due to the popularity of video-sharing platforms and the proliferation of video capture devices. The video summarization task has been proposed to remove redundancy while maintaining as many critical parts of the video as possible so that users can browse and process videos more effectively, which has received increasing attention from researchers. The existing research addresses the challenges faced by video summarization methods from various perspectives, such as temporal dependency, data scarcity, user preference, and high precision. This paper reviews representative and state-of-the-art methods, analyzes recent research advances, datasets, and performance evaluations, and discusses future directions. We hope this survey can help future research explore the potential directions of video summarization methods.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 1","pages":"1216-1233"},"PeriodicalIF":11.1,"publicationDate":"2025-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146049295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The rich spectral information within hyperspectral images (HSIs) results in large data volumes. Thus finding a compact representation for HSIs while maintaining reconstruction quality is a fundamental task for numerous applications. Though the existing learning-based compression methods and context models have shown strong rate-distortion (RD) performance, these methods only pay their attention on spatial redundancy without considering the spectral redundancy of HSIs, which thus impedes further improvement of their performance on HSI. Moreover, the strictly sequential autoregressive nature of context models leads to inefficiency, further limiting their practical applications. In this paper, leveraging the spectral priors unique to HSIs, we propose a hybrid Transformer-CNN architecture to find compact latent representations of HSIs. In specific, we construct Spectral-Spatial Coupling Transformer Group (SSCTG) to cooperatively extract spatial and spectral features of HSIs. Additionally, we propose Group-wise Context Model (GCM) to further enhance the parallel processing capability of autoregression within context models, significantly improving the coding efficiency. Extensive experiments demonstrate the effectiveness of the proposed method, achieving superior RD performance compared to state-of-the-art methods while maintaining high efficiency of codecs.
{"title":"Hyperspectral Image Compression With Spectral-Spatial Coupling and Group-Wise Context Modeling","authors":"Wei Wei;Chenxu Zhao;Shuyi Zhao;Lei Zhang;Yanning Zhang","doi":"10.1109/TCSVT.2025.3596061","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3596061","url":null,"abstract":"The rich spectral information within hyperspectral images (HSIs) results in large data volumes. Thus finding a compact representation for HSIs while maintaining reconstruction quality is a fundamental task for numerous applications. Though the existing learning-based compression methods and context models have shown strong rate-distortion (RD) performance, these methods only pay their attention on spatial redundancy without considering the spectral redundancy of HSIs, which thus impedes further improvement of their performance on HSI. Moreover, the strictly sequential autoregressive nature of context models leads to inefficiency, further limiting their practical applications. In this paper, leveraging the spectral priors unique to HSIs, we propose a hybrid Transformer-CNN architecture to find compact latent representations of HSIs. In specific, we construct Spectral-Spatial Coupling Transformer Group (SSCTG) to cooperatively extract spatial and spectral features of HSIs. Additionally, we propose Group-wise Context Model (GCM) to further enhance the parallel processing capability of autoregression within context models, significantly improving the coding efficiency. Extensive experiments demonstrate the effectiveness of the proposed method, achieving superior RD performance compared to state-of-the-art methods while maintaining high efficiency of codecs.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 1","pages":"1130-1142"},"PeriodicalIF":11.1,"publicationDate":"2025-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146049293","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Action detection in untrimmed, densely annotated video datasets is a challenging task due to the presence of composite actions and co-occurring actions in videos. To facilitate action detection in such intricate scenarios, leveraging ample prior information from the data and comprehending the context of actions in the video are the most important two clues. Specifically, the co-occurrence probability of actions can effectively capture the temporal relationships and associations among actions, aiding the model in recognizing multiple actions occurring simultaneously. Additionally, aggregating action information from different levels of the data into a comprehensive graph and describing human actions from various semantic layers can significantly reduce ambiguities in action detection. Based on this, a novel knowledge graph, Hierarchical Augmented Knowledge Graph for human behaviour (HAhb-KG), is proposed, which brings together action-related prior knowledge on different levels into a unified hierarchical graph. The graph describes human behaviour from various semantic aspects by defining diversified graph nodes, and augments the nodes and relationships with corresponding images and probability of co-occurrence respectively, to introduce textual modality information and weigh the associations between actions. In order to mine the knowledge related to the input video in the knowledge graph, HAhb-KG oriented knowledge understanding framework is proposed to embed multi-modal knowledge as a valuable supplement to visual information. Incorporated with the framework, a cross-modal learning action detection model is designed to achieve high accuracy in action detection tasks, which validates the effectiveness of HAhb-KG. Our method achieves gains of 1.45(mAP) and 2.28(mAP) in action detection experiments on the Charades and TSU datasets, respectively, which show that the proposed method outperforms existing knowledge-based action detection methods.
{"title":"HAhb-KG: Hierarchical Augmented Knowledge Graph for Human Behavior Assisting Cross-Modal Learning Action Detection","authors":"Xiaochen Wang;Dehui Kong;Jinghua Li;Jing Wang;Baocai Yin","doi":"10.1109/TCSVT.2025.3595145","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3595145","url":null,"abstract":"Action detection in untrimmed, densely annotated video datasets is a challenging task due to the presence of composite actions and co-occurring actions in videos. To facilitate action detection in such intricate scenarios, leveraging ample prior information from the data and comprehending the context of actions in the video are the most important two clues. Specifically, the co-occurrence probability of actions can effectively capture the temporal relationships and associations among actions, aiding the model in recognizing multiple actions occurring simultaneously. Additionally, aggregating action information from different levels of the data into a comprehensive graph and describing human actions from various semantic layers can significantly reduce ambiguities in action detection. Based on this, a novel knowledge graph, Hierarchical Augmented Knowledge Graph for human behaviour (HAhb-KG), is proposed, which brings together action-related prior knowledge on different levels into a unified hierarchical graph. The graph describes human behaviour from various semantic aspects by defining diversified graph nodes, and augments the nodes and relationships with corresponding images and probability of co-occurrence respectively, to introduce textual modality information and weigh the associations between actions. In order to mine the knowledge related to the input video in the knowledge graph, HAhb-KG oriented knowledge understanding framework is proposed to embed multi-modal knowledge as a valuable supplement to visual information. Incorporated with the framework, a cross-modal learning action detection model is designed to achieve high accuracy in action detection tasks, which validates the effectiveness of HAhb-KG. Our method achieves gains of 1.45(mAP) and 2.28(mAP) in action detection experiments on the Charades and TSU datasets, respectively, which show that the proposed method outperforms existing knowledge-based action detection methods.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 1","pages":"1045-1060"},"PeriodicalIF":11.1,"publicationDate":"2025-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146082060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Existing approaches to multi-person pose tracking often suffer from low-confidence detections due to inter-instance and intra-instance occlusions, as well as non-canonical poses. In this work, we propose a novel solution by addressing two critical aspects: incomplete joint temporal dependencies and spatio-temporal voxelization. First, we introduce a method for extracting hierarchical relationships between joints based on human dynamics, enabling the model to reason about occlusions within the spatial topology of the human body. This hierarchical approach tackles incomplete joint visibility by leveraging the interdependencies between joints in both space and time. Second, we present a spatio-temporal occupancy network for multi-person pose tracking. By stacking 2D pose data over time to create a spatio-temporal voxel grid, the model captures temporal relationships between instances and joints, enhancing spatio-temporal correlations and learning keypoint distributions under occlusions or non-canonical poses. Extensive experiments on the PoseTrack2017, PoseTrack2018, and PoseTrack21 dataset demonstrate that our method improves multi-person pose tracking performance, achieving state-of-the-art mAP.
{"title":"Hierarchical Topology Meets Temporal Occupancy: A Comprehensive Model for Multi-Person Pose Tracking","authors":"Muyu Li;Henan Hu;Yingfeng Wang;Sen Qiu;Xudong Zhao","doi":"10.1109/TCSVT.2025.3595104","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3595104","url":null,"abstract":"Existing approaches to multi-person pose tracking often suffer from low-confidence detections due to inter-instance and intra-instance occlusions, as well as non-canonical poses. In this work, we propose a novel solution by addressing two critical aspects: incomplete joint temporal dependencies and spatio-temporal voxelization. First, we introduce a method for extracting hierarchical relationships between joints based on human dynamics, enabling the model to reason about occlusions within the spatial topology of the human body. This hierarchical approach tackles incomplete joint visibility by leveraging the interdependencies between joints in both space and time. Second, we present a spatio-temporal occupancy network for multi-person pose tracking. By stacking 2D pose data over time to create a spatio-temporal voxel grid, the model captures temporal relationships between instances and joints, enhancing spatio-temporal correlations and learning keypoint distributions under occlusions or non-canonical poses. Extensive experiments on the PoseTrack2017, PoseTrack2018, and PoseTrack21 dataset demonstrate that our method improves multi-person pose tracking performance, achieving state-of-the-art mAP.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 1","pages":"1061-1074"},"PeriodicalIF":11.1,"publicationDate":"2025-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146082012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-04DOI: 10.1109/TCSVT.2025.3595555
Zeming Zhao;Xiaohai He;Shuhua Xiong;Meng Wang;Shiqi Wang
Although lambda-domain-based rate control is widely used in video encoders, developing an efficient rate control scheme for Coding Tree Units (CTUs) under the rate-distortion (R-D) principle remains a significant challenge. In this paper, we propose a spatial-temporal correlation information-based rate control scheme for Versatile Video Coding (VVC), aiming to improve coding performance. We introduce a weight estimation network to establish a CTU-level bit allocation strategy that fully exploits spatial-temporal contextual information. Moreover, the CTU-level coding parameter $lambda $ is adaptively optimized based on a dependency factor derived from distortion dependency information in both the spatial and temporal domains. Experimental results demonstrate that, compared to the default VVC rate control, the proposed scheme achieves BD-Rate savings of 6.48%, 17.33% and 13.75% in terms of the Peak Signal-to-Noise Ratio (PSNR), the Multi-Scale Structural Similarity Index (MS-SSIM) and the Video Multimethod Assessment Fusion (VMAF), respectively, under the Low Delay_P (LDP) configuration in the VVC Test Model (VTM) 19.0. Furthermore, the proposed method outperforms other state-of-the-art rate control schemes.
{"title":"Spatial–Temporal Correlation Information-Based Rate Control for Versatile Video Coding","authors":"Zeming Zhao;Xiaohai He;Shuhua Xiong;Meng Wang;Shiqi Wang","doi":"10.1109/TCSVT.2025.3595555","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3595555","url":null,"abstract":"Although lambda-domain-based rate control is widely used in video encoders, developing an efficient rate control scheme for Coding Tree Units (CTUs) under the rate-distortion (R-D) principle remains a significant challenge. In this paper, we propose a spatial-temporal correlation information-based rate control scheme for Versatile Video Coding (VVC), aiming to improve coding performance. We introduce a weight estimation network to establish a CTU-level bit allocation strategy that fully exploits spatial-temporal contextual information. Moreover, the CTU-level coding parameter <inline-formula> <tex-math>$lambda $ </tex-math></inline-formula> is adaptively optimized based on a dependency factor derived from distortion dependency information in both the spatial and temporal domains. Experimental results demonstrate that, compared to the default VVC rate control, the proposed scheme achieves BD-Rate savings of 6.48%, 17.33% and 13.75% in terms of the Peak Signal-to-Noise Ratio (PSNR), the Multi-Scale Structural Similarity Index (MS-SSIM) and the Video Multimethod Assessment Fusion (VMAF), respectively, under the Low Delay_P (LDP) configuration in the VVC Test Model (VTM) 19.0. Furthermore, the proposed method outperforms other state-of-the-art rate control schemes.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 1","pages":"1117-1129"},"PeriodicalIF":11.1,"publicationDate":"2025-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146049283","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-22DOI: 10.1109/TCSVT.2025.3588882
Yang Li;Songlin Yang;Wei Wang;Jing Dong
Text-to-Image (T2I) personalization based on advanced diffusion models (e.g., Stable Diffusion), which aims to generate images of target subjects given various prompts, has drawn huge attention. However, when users require personalized image generation for specific subjects such as themselves or their pet cat, the T2I models fail to accurately generate their subject-preserved images. The main problem is that pre-trained T2I models do not learn the T2I mapping between the target subjects and their corresponding visual contents. Even if multiple target subject images are provided, previous personalization methods either failed to accurately fit the subject region or lost the interactive generative ability with other existing concepts in T2I model space. For example, they are unable to generate T2I-aligned and semantic-fidelity images for the given prompts with other concepts such as scenes (“Eiffel Tower”), actions (“holding a basketball”), and facial attributes (“eyes closed”). In this paper, we focus on inserting accurate and interactive subject embedding into the Stable Diffusion Model for semantic-fidelity personalized generation using one image. We address this challenge from two perspectives: subject-wise attention loss and semantic-fidelity token optimization. Specifically, we propose a subject-wise attention loss to guide the subject embedding onto a manifold with high subject identity similarity and diverse interactive generative ability. Then, we optimize one subject representation as multiple per-stage tokens, and each token contains two disentangled features. This expansion of the textual conditioning space enhances the semantic control, thereby improving semantic-fidelity. We conduct extensive experiments on the most challenging subjects, face identities, to validate that our results exhibit superior subject accuracy and fine-grained manipulation ability. We further validate the generalization of our methods on various non-face subjects.
{"title":"Beyond Inserting: Learning Subject Embedding for Semantic-Fidelity Personalized Diffusion Generation","authors":"Yang Li;Songlin Yang;Wei Wang;Jing Dong","doi":"10.1109/TCSVT.2025.3588882","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3588882","url":null,"abstract":"Text-to-Image (T2I) personalization based on advanced diffusion models (e.g., Stable Diffusion), which aims to generate images of target subjects given various prompts, has drawn huge attention. However, when users require personalized image generation for specific subjects such as themselves or their pet cat, the T2I models fail to accurately generate their subject-preserved images. The main problem is that pre-trained T2I models do not learn the T2I mapping between the target subjects and their corresponding visual contents. Even if multiple target subject images are provided, previous personalization methods either failed to accurately fit the subject region or lost the interactive generative ability with other existing concepts in T2I model space. For example, they are unable to generate T2I-aligned and semantic-fidelity images for the given prompts with other concepts such as scenes (“Eiffel Tower”), actions (“holding a basketball”), and facial attributes (“eyes closed”). In this paper, we focus on inserting accurate and interactive subject embedding into the Stable Diffusion Model for semantic-fidelity personalized generation using one image. We address this challenge from two perspectives: subject-wise attention loss and semantic-fidelity token optimization. Specifically, we propose a subject-wise attention loss to guide the subject embedding onto a manifold with high subject identity similarity and diverse interactive generative ability. Then, we optimize one subject representation as multiple per-stage tokens, and each token contains two disentangled features. This expansion of the textual conditioning space enhances the semantic control, thereby improving semantic-fidelity. We conduct extensive experiments on the most challenging subjects, face identities, to validate that our results exhibit superior subject accuracy and fine-grained manipulation ability. We further validate the generalization of our methods on various non-face subjects.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 12","pages":"12607-12621"},"PeriodicalIF":11.1,"publicationDate":"2025-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145674742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-15DOI: 10.1109/TCSVT.2025.3588516
Shiwei Wang;Liquan Shen;Peiying Wu;Zhaoyi Tian;Feifeng Wang
Efficient compression of multi-view video data is a critical challenge for various applications due to the large volume of data involved. Although multi-view video coding (MVC) has introduced inter-view prediction techniques to reduce video redundancies, further reduction can be achieved by encoding a subset of views at a lower resolution through asymmetric rescaling, achieving higher compression efficiency. However, existing network-based rescaling approaches are designed solely for single-viewpoint videos. These methods neglect inter-view characteristics inherent in multi-view videos, resulting in suboptimal performance. To address this issue, we first propose a Disparity-aware Rescaling Learning Network (DRLN) that integrates disparity-aware feature extraction and multi-resolution adaptive rescaling to enhance MVC efficiency by minimizing both self- and inter-view redundancies. On the one hand, during the encoding stage, our method leverages the non-local correlation of multi-view contexts and performs adaptive downscaling with an early-exit mechanism, resulting in substantial multi-view bitrate savings. On the other hand, during the decoding stage, a dynamic aggregation strategy is proposed to facilitate effective interaction with inter-view features, utilizing the inter-view and cross-scale information to reconstruct fine-grained multi-view videos. Extensive experiments show that our network achieves a significant 26.31% BD-Rate reduction compared to the 3D-HEVC standard baseline, offering state of-the-art coding performance.
{"title":"DRLN: Disparity-Aware Rescaling Learning Network for Multi-View Video Coding Optimization","authors":"Shiwei Wang;Liquan Shen;Peiying Wu;Zhaoyi Tian;Feifeng Wang","doi":"10.1109/TCSVT.2025.3588516","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3588516","url":null,"abstract":"Efficient compression of multi-view video data is a critical challenge for various applications due to the large volume of data involved. Although multi-view video coding (MVC) has introduced inter-view prediction techniques to reduce video redundancies, further reduction can be achieved by encoding a subset of views at a lower resolution through asymmetric rescaling, achieving higher compression efficiency. However, existing network-based rescaling approaches are designed solely for single-viewpoint videos. These methods neglect inter-view characteristics inherent in multi-view videos, resulting in suboptimal performance. To address this issue, we first propose a Disparity-aware Rescaling Learning Network (DRLN) that integrates disparity-aware feature extraction and multi-resolution adaptive rescaling to enhance MVC efficiency by minimizing both self- and inter-view redundancies. On the one hand, during the encoding stage, our method leverages the non-local correlation of multi-view contexts and performs adaptive downscaling with an early-exit mechanism, resulting in substantial multi-view bitrate savings. On the other hand, during the decoding stage, a dynamic aggregation strategy is proposed to facilitate effective interaction with inter-view features, utilizing the inter-view and cross-scale information to reconstruct fine-grained multi-view videos. Extensive experiments show that our network achieves a significant 26.31% BD-Rate reduction compared to the 3D-HEVC standard baseline, offering state of-the-art coding performance.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 12","pages":"12788-12801"},"PeriodicalIF":11.1,"publicationDate":"2025-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145674762","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}