Existing image steganography schemes always introduce obvious modification traces to the cover image, resulting in the risk of secret information leakage. To address this issue, an end-to-end framework for joint makeup style transfer and image steganography is proposed in this paper to achieve imperceptible higher-capacity data hiding. In the scheme, a Parsing-guided Semantic Feature Alignment (PSFA) module is designed to transfer the style of a makeup image to an object non-makeup image, thereby generating a content-style integrated feature matrix. Meanwhile, a Multi-Scale Feature Fusion and Data Embedding (MFFDE) module was devised to encode the secret image into its latent features and fuse them with the generated content-style integrated feature matrix, as well as the non-makeup image features across multiple scales, to achieve the makeup-stego image. As a result, the style of the makeup image is well transformed and the secret image is imperceptibly embedded simultaneously without directly modifying the pixels of the original non-makeup image. Additionally, a Residual-aware Information Compensation Network (RICN) is developed to compensate the loss of the secret image arising from the multilevel data embedding, thereby further enhancing the quality of the reconstructed secret image. Experimental results show that the proposed scheme achieves superior steganalysis resistance capability and visual quality in both makeup-stego images and recovered secret images, compared with other state-of-the-art schemes.
{"title":"An End-to-End Framework for Joint Makeup Style Transfer and Image Steganography","authors":"Meihong Yang;Ziyi Feng;Bin Ma;Jian Xu;Yongjin Xian;Linna Zhou","doi":"10.1109/TCSVT.2025.3599551","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3599551","url":null,"abstract":"Existing image steganography schemes always introduce obvious modification traces to the cover image, resulting in the risk of secret information leakage. To address this issue, an end-to-end framework for joint makeup style transfer and image steganography is proposed in this paper to achieve imperceptible higher-capacity data hiding. In the scheme, a Parsing-guided Semantic Feature Alignment (PSFA) module is designed to transfer the style of a makeup image to an object non-makeup image, thereby generating a content-style integrated feature matrix. Meanwhile, a Multi-Scale Feature Fusion and Data Embedding (MFFDE) module was devised to encode the secret image into its latent features and fuse them with the generated content-style integrated feature matrix, as well as the non-makeup image features across multiple scales, to achieve the makeup-stego image. As a result, the style of the makeup image is well transformed and the secret image is imperceptibly embedded simultaneously without directly modifying the pixels of the original non-makeup image. Additionally, a Residual-aware Information Compensation Network (RICN) is developed to compensate the loss of the secret image arising from the multilevel data embedding, thereby further enhancing the quality of the reconstructed secret image. Experimental results show that the proposed scheme achieves superior steganalysis resistance capability and visual quality in both makeup-stego images and recovered secret images, compared with other state-of-the-art schemes.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 1","pages":"1293-1308"},"PeriodicalIF":11.1,"publicationDate":"2025-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146049267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-18DOI: 10.1109/TCSVT.2025.3599856
You Wu;Yongxin Li;Mengyuan Liu;Xucheng Wang;Xiangyang Yang;Hengzhou Ye;Dan Zeng;Qijun Zhao;Shuiwang Li
Transformer-based models have improved visual tracking, but most still cannot run in real time on resource-limited devices, especially for unmanned aerial vehicle (UAV) tracking. To achieve a better balance between performance and efficiency, we propose AVTrack, an adaptive computation tracking framework that adaptively activates transformer blocks through an Activation Module (AM), which dynamically optimizes the ViT architecture by selectively engaging relevant components. To address extreme viewpoint variations, we propose to learn view-invariant representations via mutual information (MI) maximization. In addition, we propose AVTrack-MD, an enhanced tracker incorporating a novel MI maximization-based multi-teacher knowledge distillation framework. Leveraging multiple off-the-shelf AVTrack models as teachers, we maximize the MI between their aggregated softened features and the corresponding softened feature of the student model, improving the generalization and performance of the student, especially under noisy conditions. Extensive experiments show that AVTrack-MD achieves performance comparable to AVTrack’s performance while reducing model complexity and boosting average tracking speed by over 17%. Codes is available at https://github.com/wuyou3474/AVTrack
{"title":"Learning an Adaptive and View-Invariant Vision Transformer for Real-Time UAV Tracking","authors":"You Wu;Yongxin Li;Mengyuan Liu;Xucheng Wang;Xiangyang Yang;Hengzhou Ye;Dan Zeng;Qijun Zhao;Shuiwang Li","doi":"10.1109/TCSVT.2025.3599856","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3599856","url":null,"abstract":"Transformer-based models have improved visual tracking, but most still cannot run in real time on resource-limited devices, especially for unmanned aerial vehicle (UAV) tracking. To achieve a better balance between performance and efficiency, we propose AVTrack, an adaptive computation tracking framework that adaptively activates transformer blocks through an Activation Module (AM), which dynamically optimizes the ViT architecture by selectively engaging relevant components. To address extreme viewpoint variations, we propose to learn view-invariant representations via mutual information (MI) maximization. In addition, we propose AVTrack-MD, an enhanced tracker incorporating a novel MI maximization-based multi-teacher knowledge distillation framework. Leveraging multiple off-the-shelf AVTrack models as teachers, we maximize the MI between their aggregated softened features and the corresponding softened feature of the student model, improving the generalization and performance of the student, especially under noisy conditions. Extensive experiments show that AVTrack-MD achieves performance comparable to AVTrack’s performance while reducing model complexity and boosting average tracking speed by over 17%. Codes is available at <uri>https://github.com/wuyou3474/AVTrack</uri>","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 2","pages":"2403-2418"},"PeriodicalIF":11.1,"publicationDate":"2025-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146154436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Generating high-quality facial photos from fine-detailed sketches is a long-standing research topic that remains unsolved. The scarcity of large-scale paired data due to the cost of acquiring hand-drawn sketches poses a major challenge. Existing methods either lose identity information with oversimplified representations, or rely on costly inversion and strict alignment when using StyleGAN-based priors, limiting their practical applicability. Our primary finding in this work is that the discrete codebook and decoder trained through self-reconstruction in the photo domain can learn rich priors, helping to reduce ambiguity in cross-domain mapping even with current small-scale paired datasets. Based on this, a cross-domain mapping network can be directly constructed. However, empirical findings indicate that using the discrete codebook for cross-domain mapping often results in unrealistic textures and distorted spatial layouts. Therefore, we propose a Hierarchical Adaptive Texture-Spatial Correction (HATSC) module to correct the flaws in texture and spatial layouts. Besides, we introduce a Saliency-based Key Details Enhancement (SKDE) module to further enhance the synthesis quality. Overall, we present a “reconstruct-cross-enhance” pipeline for synthesizing facial photos from fine-detailed sketches. Experiments demonstrate that our method generates high-quality facial photos and significantly outperforms previous approaches across a wide range of challenging benchmarks. The code is publicly available at: https://github.com/Gardenia-chen/DECP
{"title":"Fine-Detailed Facial Sketch-to-Photo Synthesis With Detail-Enhanced Codebook Priors","authors":"Mingrui Zhu;Jianhang Chen;Xin Wei;Nannan Wang;Xinbo Gao","doi":"10.1109/TCSVT.2025.3598016","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3598016","url":null,"abstract":"Generating high-quality facial photos from fine-detailed sketches is a long-standing research topic that remains unsolved. The scarcity of large-scale paired data due to the cost of acquiring hand-drawn sketches poses a major challenge. Existing methods either lose identity information with oversimplified representations, or rely on costly inversion and strict alignment when using StyleGAN-based priors, limiting their practical applicability. Our primary finding in this work is that the discrete codebook and decoder trained through self-reconstruction in the photo domain can learn rich priors, helping to reduce ambiguity in cross-domain mapping even with current small-scale paired datasets. Based on this, a cross-domain mapping network can be directly constructed. However, empirical findings indicate that using the discrete codebook for cross-domain mapping often results in unrealistic textures and distorted spatial layouts. Therefore, we propose a Hierarchical Adaptive Texture-Spatial Correction (HATSC) module to correct the flaws in texture and spatial layouts. Besides, we introduce a Saliency-based Key Details Enhancement (SKDE) module to further enhance the synthesis quality. Overall, we present a “reconstruct-cross-enhance” pipeline for synthesizing facial photos from fine-detailed sketches. Experiments demonstrate that our method generates high-quality facial photos and significantly outperforms previous approaches across a wide range of challenging benchmarks. The code is publicly available at: <uri>https://github.com/Gardenia-chen/DECP</uri>","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 1","pages":"1075-1088"},"PeriodicalIF":11.1,"publicationDate":"2025-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146049263","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Point cloud attribute compression is challenged by fitting the attribute signals living on irregular geometric structures. Existing methods cannot achieve compact multiscale representation for high-fidelity reconstruction using the handcrafted transforms or deep learning-based techniques. In this paper, we propose a novel geometry-aware lifting-based multiscale network via spatial-channel lifting scheme for point cloud attribute compression. The proposed network cascades geometry-aware spatial lifting to reduce spatial redundancy by adaptively capturing irregular geometric structures and progressive channel lifting to progressively reduce channel-wise redundancy in multiscale representation. Furthermore, we design the split, predict, and update operations for geometry-aware spatial lifting to fully exploit the geometry information representing irregular structures. We develop geometry-aware adaptive split to equally split input points with significance scores indicating their dependencies, and propose geometry-aware cross-attention filtering for the predict and update operations for decorrelation based on geometry information. To our best knowledge, this paper achieves the first lifting-based learned transform for point cloud compression that enjoys reversibility guarantees of multiscale representation to enhance rate-distortion performance. Experimental results show that the proposed framework achieves state-of-the-art performance on extensive point cloud datasets, and outperforms latest MPEG G-PCC standard and most recent deep learning based methods.
{"title":"Point Cloud Attribute Compression With Geometry-Aware Lifting-Based Multiscale Networks","authors":"Xin Li;Shaohui Li;Wenrui Dai;Han Li;Nuowen Kan;Chenglin Li;Junni Zou;Hongkai Xiong","doi":"10.1109/TCSVT.2025.3597448","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3597448","url":null,"abstract":"Point cloud attribute compression is challenged by fitting the attribute signals living on irregular geometric structures. Existing methods cannot achieve compact multiscale representation for high-fidelity reconstruction using the handcrafted transforms or deep learning-based techniques. In this paper, we propose a novel geometry-aware lifting-based multiscale network via spatial-channel lifting scheme for point cloud attribute compression. The proposed network cascades geometry-aware spatial lifting to reduce spatial redundancy by adaptively capturing irregular geometric structures and progressive channel lifting to progressively reduce channel-wise redundancy in multiscale representation. Furthermore, we design the split, predict, and update operations for geometry-aware spatial lifting to fully exploit the geometry information representing irregular structures. We develop geometry-aware adaptive split to equally split input points with significance scores indicating their dependencies, and propose geometry-aware cross-attention filtering for the predict and update operations for decorrelation based on geometry information. To our best knowledge, this paper achieves the first lifting-based learned transform for point cloud compression that enjoys reversibility guarantees of multiscale representation to enhance rate-distortion performance. Experimental results show that the proposed framework achieves state-of-the-art performance on extensive point cloud datasets, and outperforms latest MPEG G-PCC standard and most recent deep learning based methods.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 1","pages":"1143-1159"},"PeriodicalIF":11.1,"publicationDate":"2025-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146049270","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Document images are vulnerable to tampering attacks from image editing tools and deep models. Therefore, the Document Tampering Localization (DTL) task has received increasing attention in recent years. However, given the wide variety of document types (e.g., contracts, certificates, ID cards), our analysis shows that existing DTL methods struggle with document images containing diverse background colors and varying semantic contents. Further analysis and experiments verify that the varying background color and semantic contents interfere with the forensic feature extraction process in the existing DTL methods. To address this issue, we propose two disentanglement modules to mitigate such interference and improve the ability of forgery trace detection. First, we design a Color Disentanglement (CD) module that applies disentangled learning representation to forensic features. The CD module, grounded in real-world prior knowledge, effectively decouples color information from forensic features, thereby improving robustness against varying background colors. Second, we propose the Semantic Disentanglement (SD) module, which performs image-level clustering on the tampering probability map during the inference process. The SD module focuses on tampering probabilities for each pixel, while discarding local semantic information (e.g., font, location, and shape). It leads to strong robustness against variations in document content. The evaluations demonstrate that our CD-SD method outperforms existing methods by 45.12% or 0.162 on the F1 metric in cross-dataset tests. Ablation studies show that the CD and SD modules improve the F1 score by 7.98% and 13.38%, respectively, across different backbones. Our method delivers consistent and stable improvements across various experimental protocols. Moreover, it is compatible with many DTL methods in a plug-and-play fashion.
{"title":"Generalized Document Tampering Localization via Color and Semantic Disentanglement","authors":"Shiqiang Zheng;Changsheng Chen;Shen Chen;Taiping Yao;Shouhong Ding;Bin Li;Jiwu Huang","doi":"10.1109/TCSVT.2025.3597602","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3597602","url":null,"abstract":"Document images are vulnerable to tampering attacks from image editing tools and deep models. Therefore, the Document Tampering Localization (DTL) task has received increasing attention in recent years. However, given the wide variety of document types (<italic>e.g.</i>, contracts, certificates, ID cards), our analysis shows that existing DTL methods struggle with document images containing diverse background colors and varying semantic contents. Further analysis and experiments verify that the varying background color and semantic contents interfere with the forensic feature extraction process in the existing DTL methods. To address this issue, we propose two disentanglement modules to mitigate such interference and improve the ability of forgery trace detection. First, we design a Color Disentanglement (CD) module that applies disentangled learning representation to forensic features. The CD module, grounded in real-world prior knowledge, effectively decouples color information from forensic features, thereby improving robustness against varying background colors. Second, we propose the Semantic Disentanglement (SD) module, which performs image-level clustering on the tampering probability map during the inference process. The SD module focuses on tampering probabilities for each pixel, while discarding local semantic information (<italic>e.g.</i>, font, location, and shape). It leads to strong robustness against variations in document content. The evaluations demonstrate that our CD-SD method outperforms existing methods by 45.12% or 0.162 on the F1 metric in cross-dataset tests. Ablation studies show that the CD and SD modules improve the F1 score by 7.98% and 13.38%, respectively, across different backbones. Our method delivers consistent and stable improvements across various experimental protocols. Moreover, it is compatible with many DTL methods in a plug-and-play fashion.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 1","pages":"1279-1292"},"PeriodicalIF":11.1,"publicationDate":"2025-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146049290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-11DOI: 10.1109/TCSVT.2025.3597604
Likun Gao;Xinhui Xue;Haowen Zheng
Hyperspectral band selection seeks to identify a compact subset of informative spectral channels that preserves task–relevant information while mitigating the storage, transmission, and computational burdens imposed by high–dimensional data. Yet prevailing techniques face two pervasive limitations: (i) scoring- or ranking-based methods assess bands independently, overlooking the joint dependency that determine their true utility; and (ii) combinatorial search approaches, though theoretically exhaustive, require prohibitive enumeration that is incompatible with the scale and end-to-end nature of modern deep-learning pipelines. We recast band selection as a combinatorial inference problem and propose a task-agnostic framework that embeds a learnable Band Selection Layer equipped with an Expectation–Maximization–driven Sparsity Loss The E-step efficiently enumerates the expected likelihood of all k-out-of-B band subsets via dynamic programming, thereby making implicit dependencies explicit; the M-step optimises band importances toward a provably k-sparse solution without post-hoc thresholding. Comprehensive theoretical analysis proves the absence of spurious local maxima and guarantees convergence to an exact sparse optimum. Extensive experiments on three public benchmarks (KSC, HT2013, HT2018), two auxiliary tasks (anomaly and target detection), and six classifiers demonstrate that the proposed method consistently surpasses state-of-the-art baselines. The results confirm that EM-guided sparsification not only stabilises the sparsity pattern but also yields interpretable inter-band dependency structures, making the framework a robust and broadly applicable tool for hyperspectral analysis and other sparsity-oriented vision problems.
{"title":"Sparse Hyperspectral Band Selection Based on Expectation Maximization","authors":"Likun Gao;Xinhui Xue;Haowen Zheng","doi":"10.1109/TCSVT.2025.3597604","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3597604","url":null,"abstract":"Hyperspectral band selection seeks to identify a compact subset of informative spectral channels that preserves task–relevant information while mitigating the storage, transmission, and computational burdens imposed by high–dimensional data. Yet prevailing techniques face two pervasive limitations: (i) scoring- or ranking-based methods assess bands independently, overlooking the joint dependency that determine their true utility; and (ii) combinatorial search approaches, though theoretically exhaustive, require prohibitive enumeration that is incompatible with the scale and end-to-end nature of modern deep-learning pipelines. We recast band selection as a combinatorial inference problem and propose a task-agnostic framework that embeds a learnable Band Selection Layer equipped with an Expectation–Maximization–driven Sparsity Loss The E-step efficiently enumerates the expected likelihood of all <italic>k</i>-out-of-<italic>B</i> band subsets via dynamic programming, thereby making implicit dependencies explicit; the M-step optimises band importances toward a provably <italic>k</i>-sparse solution without post-hoc thresholding. Comprehensive theoretical analysis proves the absence of spurious local maxima and guarantees convergence to an exact sparse optimum. Extensive experiments on three public benchmarks (KSC, HT2013, HT2018), two auxiliary tasks (anomaly and target detection), and six classifiers demonstrate that the proposed method consistently surpasses state-of-the-art baselines. The results confirm that EM-guided sparsification not only stabilises the sparsity pattern but also yields interpretable inter-band dependency structures, making the framework a robust and broadly applicable tool for hyperspectral analysis and other sparsity-oriented vision problems.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 1","pages":"1265-1278"},"PeriodicalIF":11.1,"publicationDate":"2025-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146049302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-08DOI: 10.1109/TCSVT.2025.3597097
Hengchang Wang;Li Liu;Huaxiang Zhang;Lei Zhu;Xiaojun Chang;Hao Du
Image-text matching as a fundamental cross-modal understanding task presents unique challenges in weakly-aligned scenarios. Such data typically feature highly abstract textual captions with sparse entity References, creating a significant semantic gap with visual content. Current mainstream methods, primarily designed for strongly aligned data pairs, employ dynamic modeling or multi-dimensional similarity computation to achieve feature space mapping. However, they struggle with information asymmetry and modal heterogeneity in weakly aligned cases. To address this, we propose a Visual Perception Knowledge Enhancement (VPKE) framework. Unlike existing methods based on strong alignment assumptions, this framework mines latent image semantics through vision-language models and generates auxiliary captions, overcoming the information bottleneck of traditional text modalities. Its core innovation lies in an adaptive knowledge distillation mechanism that combines retrieval-augmented generation (RAG) with key entity extraction. This mechanism effectively filters noise when introducing external knowledge while optimizing cross-modal feature integration. The framework employs multi-level similarity evaluation to dynamically adjust fusion weights among original text, key entities, and auxiliary captions, enabling adaptive integration of diverse semantic features and significantly improving model flexibility. Additionally, multi-scale feature extraction further enhances cross-modal representation capabilities. Experimental results show that the proposed method performs excellently in image-text retrieval tasks on the MSCOCO and Flickr30K datasets, validating its effectiveness.
{"title":"VisualRAG: Knowledge-Guided Retrieval Augmentation for Image-Text Matching","authors":"Hengchang Wang;Li Liu;Huaxiang Zhang;Lei Zhu;Xiaojun Chang;Hao Du","doi":"10.1109/TCSVT.2025.3597097","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3597097","url":null,"abstract":"Image-text matching as a fundamental cross-modal understanding task presents unique challenges in weakly-aligned scenarios. Such data typically feature highly abstract textual captions with sparse entity References, creating a significant semantic gap with visual content. Current mainstream methods, primarily designed for strongly aligned data pairs, employ dynamic modeling or multi-dimensional similarity computation to achieve feature space mapping. However, they struggle with information asymmetry and modal heterogeneity in weakly aligned cases. To address this, we propose a Visual Perception Knowledge Enhancement (VPKE) framework. Unlike existing methods based on strong alignment assumptions, this framework mines latent image semantics through vision-language models and generates auxiliary captions, overcoming the information bottleneck of traditional text modalities. Its core innovation lies in an adaptive knowledge distillation mechanism that combines retrieval-augmented generation (RAG) with key entity extraction. This mechanism effectively filters noise when introducing external knowledge while optimizing cross-modal feature integration. The framework employs multi-level similarity evaluation to dynamically adjust fusion weights among original text, key entities, and auxiliary captions, enabling adaptive integration of diverse semantic features and significantly improving model flexibility. Additionally, multi-scale feature extraction further enhances cross-modal representation capabilities. Experimental results show that the proposed method performs excellently in image-text retrieval tasks on the MSCOCO and Flickr30K datasets, validating its effectiveness.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 1","pages":"1234-1248"},"PeriodicalIF":11.1,"publicationDate":"2025-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146049291","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-07DOI: 10.1109/TCSVT.2025.3596636
Nuowen Kan;Chenglin Li;Yuankun Jiang;Wenrui Dai;Junni Zou;Hongkai Xiong;Laura Toni
Adaptive bitrate (ABR) streaming is a popular technique used to improve the quality of experience (QoE) for users who watch videos online, which, for example, can provide a smoother video playback by dynamically adjusting the requested video quality with associated bitrate according to the constrained yet diverse network conditions. Recently, learning-based ABR algorithms have achieved a notable performance gain with lower inference overhead than the conventional heuristic or model-based baselines. However, their performance may degrade significantly in an unseen network environment with time-varying and heterogeneous throughput dynamics. For a better generalization, in this paper, we propose a meta-reinforcement learning (meta-RL)-based neural ABR algorithm that is able to quickly adapt its policy to these unseen throughput dynamics. Specifically, we propose a model-free system framework comprising an inference network and a policy network. The inference network infers distribution of the latent representation for underlying dynamics based on the recent throughout context, while the policy network is trained to quickly adapt to the changing throughout dynamics with the sampled latent representation. To effectively learn the inference network and meta-policy on mixed dynamics of the practical ABR scenarios, we further design a variational information bottleneck theory-based loss function for training the inference and policy networks, whose objective is to strike a trade-off between brevity of the latent representation and expressiveness of the meta-policy. We also derive a theoretically necessary condition for the bitrate versions that yield higher long-term QoE, based on which a dynamic action pruning strategy is further developed for practical implementation. This pruning strategy can not only prevent unsafe policy outputs in midst of unseen throughput dynamics, but may also reduce the computational complexity of model-based ABR algorithms. Finally, the meta-training and meta-adaptation procedures of our proposed algorithm are implemented across a range of throughput dynamics. The empirical evaluations on various datasets containing real-world network traces verify that our algorithm surpasses the state-of-the-art ABR algorithms, particularly in terms of the average chunk QoE and fast adaptation across out-of-distribution throughput traces.
{"title":"MERINA+: Improving Generalization for Neural Video Adaptation via Information-Theoretic Meta-Reinforcement Learning","authors":"Nuowen Kan;Chenglin Li;Yuankun Jiang;Wenrui Dai;Junni Zou;Hongkai Xiong;Laura Toni","doi":"10.1109/TCSVT.2025.3596636","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3596636","url":null,"abstract":"Adaptive bitrate (ABR) streaming is a popular technique used to improve the quality of experience (QoE) for users who watch videos online, which, for example, can provide a smoother video playback by dynamically adjusting the requested video quality with associated bitrate according to the constrained yet diverse network conditions. Recently, learning-based ABR algorithms have achieved a notable performance gain with lower inference overhead than the conventional heuristic or model-based baselines. However, their performance may degrade significantly in an unseen network environment with time-varying and heterogeneous throughput dynamics. For a better generalization, in this paper, we propose a meta-reinforcement learning (meta-RL)-based neural ABR algorithm that is able to quickly adapt its policy to these unseen throughput dynamics. Specifically, we propose a model-free system framework comprising an inference network and a policy network. The inference network infers distribution of the latent representation for underlying dynamics based on the recent throughout context, while the policy network is trained to quickly adapt to the changing throughout dynamics with the sampled latent representation. To effectively learn the inference network and meta-policy on mixed dynamics of the practical ABR scenarios, we further design a variational information bottleneck theory-based loss function for training the inference and policy networks, whose objective is to strike a trade-off between brevity of the latent representation and expressiveness of the meta-policy. We also derive a theoretically necessary condition for the bitrate versions that yield higher long-term QoE, based on which a dynamic action pruning strategy is further developed for practical implementation. This pruning strategy can not only prevent unsafe policy outputs in midst of unseen throughput dynamics, but may also reduce the computational complexity of model-based ABR algorithms. Finally, the meta-training and meta-adaptation procedures of our proposed algorithm are implemented across a range of throughput dynamics. The empirical evaluations on various datasets containing real-world network traces verify that our algorithm surpasses the state-of-the-art ABR algorithms, particularly in terms of the average chunk QoE and fast adaptation across out-of-distribution throughput traces.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 1","pages":"1185-1202"},"PeriodicalIF":11.1,"publicationDate":"2025-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146049281","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-07DOI: 10.1109/TCSVT.2025.3596815
Shanzhi Yin;Bolin Chen;Shiqi Wang;Yan Ye
In this paper, we propose a novel Multi-granularity Temporal Trajectory Factorization (MTTF) framework for generative human video compression, which holds great potential for bandwidth-constrained human-centric video communication. In particular, the proposed multi-granularity feature factorization strategy can facilitate to implicitly characterize the high-dimensional visual signal into compact motion vectors for representation compactness and further transform these vectors into fine-grained fields for motion expressibility. As such, the coded bit-stream can be entailed with enough visual motion information at the lowest representation cost. Meanwhile, a resolution-expandable generative module is developed with enhanced background stability, such that the proposed framework can be optimized towards higher reconstruction robustness and more flexible resolution adaptation. Experimental results show that proposed method outperforms latest generative models and the state-of-the-art video coding standard Versatile Video Coding (VVC) on both talking-face videos and moving-body videos in terms of both objective and subjective quality. The project page can be found at https://github.com/xyzysz/Extreme-Human-Video-Compression-with-MTTF
{"title":"Generative Human Video Compression With Multi-Granularity Temporal Trajectory Factorization","authors":"Shanzhi Yin;Bolin Chen;Shiqi Wang;Yan Ye","doi":"10.1109/TCSVT.2025.3596815","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3596815","url":null,"abstract":"In this paper, we propose a novel Multi-granularity Temporal Trajectory Factorization (MTTF) framework for generative human video compression, which holds great potential for bandwidth-constrained human-centric video communication. In particular, the proposed multi-granularity feature factorization strategy can facilitate to implicitly characterize the high-dimensional visual signal into compact motion vectors for representation compactness and further transform these vectors into fine-grained fields for motion expressibility. As such, the coded bit-stream can be entailed with enough visual motion information at the lowest representation cost. Meanwhile, a resolution-expandable generative module is developed with enhanced background stability, such that the proposed framework can be optimized towards higher reconstruction robustness and more flexible resolution adaptation. Experimental results show that proposed method outperforms latest generative models and the state-of-the-art video coding standard Versatile Video Coding (VVC) on both talking-face videos and moving-body videos in terms of both objective and subjective quality. The project page can be found at <uri>https://github.com/xyzysz/Extreme-Human-Video-Compression-with-MTTF</uri>","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 1","pages":"1089-1103"},"PeriodicalIF":11.1,"publicationDate":"2025-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146049271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Medical image analysis plays key role in computer-aided diagnosis, where segmentation and classification are essential and interconnected tasks. While multi-task learning (MTL) has been widely explored to leverage inter-task synergies, effectively guiding knowledge transfer to prevent task conflict and negative transfer remains a key challenge, particularly in anatomically complex diagnostic scenarios. This paper presents LTRMTL-Net, a novel multi-task learning framework for medical image analysis that simultaneously addresses segmentation and classification tasks guided by lesion regions and spatial relationships of tissues. The proposed architecture integrates an Enhanced Lesion Region Fusion (ELRF) module that leverages GradCAM-guided attention mechanisms to precisely locate and enhance lesion regions, providing critical prior knowledge for both tasks. Tissue Space Structure Prediction (TSSP) component captures local-global spatial dependencies through contrastive learning, establishing effective anatomical context modeling. The core encoder employs Hybrid Wavelet-State Attention blocks that combine modulated wavelet transform convolutions with structured state space models to extract multi-scale features while maintaining computational efficiency. Dual-stream inputs with symmetric architecture accommodate single-source scenarios across diverse medical imaging applications. Experimental results on mammography and breast ultrasound datasets demonstrate that the proposed method captures fine-grained lesion boundary details while providing accurate malignancy classification. Harnessing cooperative knowledge transfer between segmentation and classification, guided by anatomical priors, boosts diagnostic performance and provides comprehensive, interpretable clinical insights.
医学图像分析在计算机辅助诊断中起着至关重要的作用,其中分割和分类是必不可少且相互关联的任务。虽然多任务学习(MTL)已被广泛探索以利用任务间的协同作用,但有效指导知识转移以防止任务冲突和负迁移仍然是一个关键挑战,特别是在解剖复杂的诊断场景中。LTRMTL-Net是一种新的多任务学习框架,用于医学图像分析,同时解决由病变区域和组织空间关系指导的分割和分类任务。所提出的架构集成了增强病变区域融合(Enhanced Lesion Region Fusion, ELRF)模块,该模块利用gradcam引导的注意力机制来精确定位和增强病变区域,为这两项任务提供关键的先验知识。组织空间结构预测(TSSP)组件通过对比学习捕获局部-全局空间依赖关系,建立有效的解剖上下文建模。核心编码器采用混合小波状态注意块,将调制小波变换卷积与结构化状态空间模型相结合,在保持计算效率的同时提取多尺度特征。对称架构的双流输入可适应不同医学成像应用中的单源场景。乳房x线摄影和乳腺超声数据集的实验结果表明,该方法在提供准确的恶性肿瘤分类的同时,捕获了细粒度的病变边界细节。利用分割和分类之间的合作知识转移,在解剖学先验的指导下,提高诊断性能,并提供全面的,可解释的临床见解。
{"title":"Multi-Task Learning Network for Medical Image Analysis Guided by Lesion Regions and Spatial Relationships of Tissues","authors":"Guowei Dai;Duwei Dai;Chaoyu Wang;Qingfeng Tang;Matthew Hamilton;Hu Chen;Yi Zhang","doi":"10.1109/TCSVT.2025.3596803","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3596803","url":null,"abstract":"Medical image analysis plays key role in computer-aided diagnosis, where segmentation and classification are essential and interconnected tasks. While multi-task learning (MTL) has been widely explored to leverage inter-task synergies, effectively guiding knowledge transfer to prevent task conflict and negative transfer remains a key challenge, particularly in anatomically complex diagnostic scenarios. This paper presents LTRMTL-Net, a novel multi-task learning framework for medical image analysis that simultaneously addresses segmentation and classification tasks guided by lesion regions and spatial relationships of tissues. The proposed architecture integrates an Enhanced Lesion Region Fusion (ELRF) module that leverages GradCAM-guided attention mechanisms to precisely locate and enhance lesion regions, providing critical prior knowledge for both tasks. Tissue Space Structure Prediction (TSSP) component captures local-global spatial dependencies through contrastive learning, establishing effective anatomical context modeling. The core encoder employs Hybrid Wavelet-State Attention blocks that combine modulated wavelet transform convolutions with structured state space models to extract multi-scale features while maintaining computational efficiency. Dual-stream inputs with symmetric architecture accommodate single-source scenarios across diverse medical imaging applications. Experimental results on mammography and breast ultrasound datasets demonstrate that the proposed method captures fine-grained lesion boundary details while providing accurate malignancy classification. Harnessing cooperative knowledge transfer between segmentation and classification, guided by anatomical priors, boosts diagnostic performance and provides comprehensive, interpretable clinical insights.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 1","pages":"1249-1264"},"PeriodicalIF":11.1,"publicationDate":"2025-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146049279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}