首页 > 最新文献

IEEE Transactions on Circuits and Systems for Video Technology最新文献

英文 中文
An End-to-End Framework for Joint Makeup Style Transfer and Image Steganography 联合化妆风格转移和图像隐写的端到端框架
IF 11.1 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-08-18 DOI: 10.1109/TCSVT.2025.3599551
Meihong Yang;Ziyi Feng;Bin Ma;Jian Xu;Yongjin Xian;Linna Zhou
Existing image steganography schemes always introduce obvious modification traces to the cover image, resulting in the risk of secret information leakage. To address this issue, an end-to-end framework for joint makeup style transfer and image steganography is proposed in this paper to achieve imperceptible higher-capacity data hiding. In the scheme, a Parsing-guided Semantic Feature Alignment (PSFA) module is designed to transfer the style of a makeup image to an object non-makeup image, thereby generating a content-style integrated feature matrix. Meanwhile, a Multi-Scale Feature Fusion and Data Embedding (MFFDE) module was devised to encode the secret image into its latent features and fuse them with the generated content-style integrated feature matrix, as well as the non-makeup image features across multiple scales, to achieve the makeup-stego image. As a result, the style of the makeup image is well transformed and the secret image is imperceptibly embedded simultaneously without directly modifying the pixels of the original non-makeup image. Additionally, a Residual-aware Information Compensation Network (RICN) is developed to compensate the loss of the secret image arising from the multilevel data embedding, thereby further enhancing the quality of the reconstructed secret image. Experimental results show that the proposed scheme achieves superior steganalysis resistance capability and visual quality in both makeup-stego images and recovered secret images, compared with other state-of-the-art schemes.
现有的图像隐写方案往往会对封面图像引入明显的修改痕迹,存在秘密信息泄露的风险。为了解决这一问题,本文提出了一种端到端的组合样式传输和图像隐写框架,以实现难以察觉的高容量数据隐藏。在该方案中,设计了一个解析引导语义特征对齐(PSFA)模块,将化妆图像的风格转换为对象非化妆图像,从而生成内容风格的集成特征矩阵。同时,设计了多尺度特征融合与数据嵌入(MFFDE)模块,将隐密图像编码为隐密图像的潜在特征,并与生成的内容式集成特征矩阵以及跨多尺度的非化妆图像特征融合,实现隐密图像的化妆。这样在不直接修改原始素颜图像像素的情况下,很好地变换了化妆图像的风格,并在不知不觉中同时嵌入了秘密图像。此外,提出了残差感知信息补偿网络(RICN)来补偿多层数据嵌入所造成的秘密图像的损失,从而进一步提高了重建秘密图像的质量。实验结果表明,与现有的隐写算法相比,该算法在隐写修复图像和恢复秘密图像上都具有较好的抗隐写能力和视觉质量。
{"title":"An End-to-End Framework for Joint Makeup Style Transfer and Image Steganography","authors":"Meihong Yang;Ziyi Feng;Bin Ma;Jian Xu;Yongjin Xian;Linna Zhou","doi":"10.1109/TCSVT.2025.3599551","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3599551","url":null,"abstract":"Existing image steganography schemes always introduce obvious modification traces to the cover image, resulting in the risk of secret information leakage. To address this issue, an end-to-end framework for joint makeup style transfer and image steganography is proposed in this paper to achieve imperceptible higher-capacity data hiding. In the scheme, a Parsing-guided Semantic Feature Alignment (PSFA) module is designed to transfer the style of a makeup image to an object non-makeup image, thereby generating a content-style integrated feature matrix. Meanwhile, a Multi-Scale Feature Fusion and Data Embedding (MFFDE) module was devised to encode the secret image into its latent features and fuse them with the generated content-style integrated feature matrix, as well as the non-makeup image features across multiple scales, to achieve the makeup-stego image. As a result, the style of the makeup image is well transformed and the secret image is imperceptibly embedded simultaneously without directly modifying the pixels of the original non-makeup image. Additionally, a Residual-aware Information Compensation Network (RICN) is developed to compensate the loss of the secret image arising from the multilevel data embedding, thereby further enhancing the quality of the reconstructed secret image. Experimental results show that the proposed scheme achieves superior steganalysis resistance capability and visual quality in both makeup-stego images and recovered secret images, compared with other state-of-the-art schemes.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 1","pages":"1293-1308"},"PeriodicalIF":11.1,"publicationDate":"2025-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146049267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Learning an Adaptive and View-Invariant Vision Transformer for Real-Time UAV Tracking 一种用于无人机实时跟踪的自适应视觉不变性学习
IF 11.1 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-08-18 DOI: 10.1109/TCSVT.2025.3599856
You Wu;Yongxin Li;Mengyuan Liu;Xucheng Wang;Xiangyang Yang;Hengzhou Ye;Dan Zeng;Qijun Zhao;Shuiwang Li
Transformer-based models have improved visual tracking, but most still cannot run in real time on resource-limited devices, especially for unmanned aerial vehicle (UAV) tracking. To achieve a better balance between performance and efficiency, we propose AVTrack, an adaptive computation tracking framework that adaptively activates transformer blocks through an Activation Module (AM), which dynamically optimizes the ViT architecture by selectively engaging relevant components. To address extreme viewpoint variations, we propose to learn view-invariant representations via mutual information (MI) maximization. In addition, we propose AVTrack-MD, an enhanced tracker incorporating a novel MI maximization-based multi-teacher knowledge distillation framework. Leveraging multiple off-the-shelf AVTrack models as teachers, we maximize the MI between their aggregated softened features and the corresponding softened feature of the student model, improving the generalization and performance of the student, especially under noisy conditions. Extensive experiments show that AVTrack-MD achieves performance comparable to AVTrack’s performance while reducing model complexity and boosting average tracking speed by over 17%. Codes is available at https://github.com/wuyou3474/AVTrack
基于变压器的模型改进了视觉跟踪,但大多数模型仍然不能在资源有限的设备上实时运行,特别是对于无人机(UAV)跟踪。为了在性能和效率之间实现更好的平衡,我们提出了AVTrack,这是一个自适应计算跟踪框架,它通过激活模块(AM)自适应激活变压器块,该模块通过选择性地参与相关组件来动态优化ViT架构。为了解决极端的视点变化,我们建议通过互信息(MI)最大化来学习视点不变表示。此外,我们提出了AVTrack-MD,这是一种增强的跟踪器,结合了一种新的基于MI最大化的多教师知识蒸馏框架。利用多个现成的AVTrack模型作为教师,我们最大化了它们聚合的软化特征与学生模型的相应软化特征之间的MI,提高了学生模型的泛化和性能,特别是在噪声条件下。大量的实验表明,AVTrack- md在降低模型复杂性和提高平均跟踪速度超过17%的同时,实现了与AVTrack性能相当的性能。代码可在https://github.com/wuyou3474/AVTrack上获得
{"title":"Learning an Adaptive and View-Invariant Vision Transformer for Real-Time UAV Tracking","authors":"You Wu;Yongxin Li;Mengyuan Liu;Xucheng Wang;Xiangyang Yang;Hengzhou Ye;Dan Zeng;Qijun Zhao;Shuiwang Li","doi":"10.1109/TCSVT.2025.3599856","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3599856","url":null,"abstract":"Transformer-based models have improved visual tracking, but most still cannot run in real time on resource-limited devices, especially for unmanned aerial vehicle (UAV) tracking. To achieve a better balance between performance and efficiency, we propose AVTrack, an adaptive computation tracking framework that adaptively activates transformer blocks through an Activation Module (AM), which dynamically optimizes the ViT architecture by selectively engaging relevant components. To address extreme viewpoint variations, we propose to learn view-invariant representations via mutual information (MI) maximization. In addition, we propose AVTrack-MD, an enhanced tracker incorporating a novel MI maximization-based multi-teacher knowledge distillation framework. Leveraging multiple off-the-shelf AVTrack models as teachers, we maximize the MI between their aggregated softened features and the corresponding softened feature of the student model, improving the generalization and performance of the student, especially under noisy conditions. Extensive experiments show that AVTrack-MD achieves performance comparable to AVTrack’s performance while reducing model complexity and boosting average tracking speed by over 17%. Codes is available at <uri>https://github.com/wuyou3474/AVTrack</uri>","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 2","pages":"2403-2418"},"PeriodicalIF":11.1,"publicationDate":"2025-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146154436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fine-Detailed Facial Sketch-to-Photo Synthesis With Detail-Enhanced Codebook Priors 精细的面部素描到照片合成与细节增强的码本先验
IF 11.1 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-08-12 DOI: 10.1109/TCSVT.2025.3598016
Mingrui Zhu;Jianhang Chen;Xin Wei;Nannan Wang;Xinbo Gao
Generating high-quality facial photos from fine-detailed sketches is a long-standing research topic that remains unsolved. The scarcity of large-scale paired data due to the cost of acquiring hand-drawn sketches poses a major challenge. Existing methods either lose identity information with oversimplified representations, or rely on costly inversion and strict alignment when using StyleGAN-based priors, limiting their practical applicability. Our primary finding in this work is that the discrete codebook and decoder trained through self-reconstruction in the photo domain can learn rich priors, helping to reduce ambiguity in cross-domain mapping even with current small-scale paired datasets. Based on this, a cross-domain mapping network can be directly constructed. However, empirical findings indicate that using the discrete codebook for cross-domain mapping often results in unrealistic textures and distorted spatial layouts. Therefore, we propose a Hierarchical Adaptive Texture-Spatial Correction (HATSC) module to correct the flaws in texture and spatial layouts. Besides, we introduce a Saliency-based Key Details Enhancement (SKDE) module to further enhance the synthesis quality. Overall, we present a “reconstruct-cross-enhance” pipeline for synthesizing facial photos from fine-detailed sketches. Experiments demonstrate that our method generates high-quality facial photos and significantly outperforms previous approaches across a wide range of challenging benchmarks. The code is publicly available at: https://github.com/Gardenia-chen/DECP
从精细的草图中生成高质量的面部照片是一个长期未解决的研究课题。由于获取手绘草图的成本,大规模配对数据的稀缺性构成了一个主要挑战。现有方法在使用基于stylegan的先验时,要么过于简化表示而丢失身份信息,要么依赖昂贵的反演和严格的对齐,限制了它们的实际适用性。我们在这项工作中的主要发现是,通过在照片域进行自重建训练的离散码本和解码器可以学习丰富的先验,即使使用当前的小规模配对数据集,也有助于减少跨域映射中的歧义。在此基础上,可以直接构建跨域映射网络。然而,实证结果表明,使用离散码本进行跨域映射往往会导致不真实的纹理和扭曲的空间布局。因此,我们提出了一种分层自适应纹理空间校正(HATSC)模块来校正纹理和空间布局的缺陷。此外,我们还引入了基于显著性的关键细节增强(SKDE)模块来进一步提高合成质量。总的来说,我们提出了一个“重建-交叉增强”的管道,用于从精细的草图合成面部照片。实验表明,我们的方法可以生成高质量的面部照片,并且在一系列具有挑战性的基准测试中显著优于以前的方法。该代码可在https://github.com/Gardenia-chen/DECP公开获取
{"title":"Fine-Detailed Facial Sketch-to-Photo Synthesis With Detail-Enhanced Codebook Priors","authors":"Mingrui Zhu;Jianhang Chen;Xin Wei;Nannan Wang;Xinbo Gao","doi":"10.1109/TCSVT.2025.3598016","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3598016","url":null,"abstract":"Generating high-quality facial photos from fine-detailed sketches is a long-standing research topic that remains unsolved. The scarcity of large-scale paired data due to the cost of acquiring hand-drawn sketches poses a major challenge. Existing methods either lose identity information with oversimplified representations, or rely on costly inversion and strict alignment when using StyleGAN-based priors, limiting their practical applicability. Our primary finding in this work is that the discrete codebook and decoder trained through self-reconstruction in the photo domain can learn rich priors, helping to reduce ambiguity in cross-domain mapping even with current small-scale paired datasets. Based on this, a cross-domain mapping network can be directly constructed. However, empirical findings indicate that using the discrete codebook for cross-domain mapping often results in unrealistic textures and distorted spatial layouts. Therefore, we propose a Hierarchical Adaptive Texture-Spatial Correction (HATSC) module to correct the flaws in texture and spatial layouts. Besides, we introduce a Saliency-based Key Details Enhancement (SKDE) module to further enhance the synthesis quality. Overall, we present a “reconstruct-cross-enhance” pipeline for synthesizing facial photos from fine-detailed sketches. Experiments demonstrate that our method generates high-quality facial photos and significantly outperforms previous approaches across a wide range of challenging benchmarks. The code is publicly available at: <uri>https://github.com/Gardenia-chen/DECP</uri>","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 1","pages":"1075-1088"},"PeriodicalIF":11.1,"publicationDate":"2025-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146049263","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Point Cloud Attribute Compression With Geometry-Aware Lifting-Based Multiscale Networks 基于几何感知提升的多尺度网络点云属性压缩
IF 11.1 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-08-11 DOI: 10.1109/TCSVT.2025.3597448
Xin Li;Shaohui Li;Wenrui Dai;Han Li;Nuowen Kan;Chenglin Li;Junni Zou;Hongkai Xiong
Point cloud attribute compression is challenged by fitting the attribute signals living on irregular geometric structures. Existing methods cannot achieve compact multiscale representation for high-fidelity reconstruction using the handcrafted transforms or deep learning-based techniques. In this paper, we propose a novel geometry-aware lifting-based multiscale network via spatial-channel lifting scheme for point cloud attribute compression. The proposed network cascades geometry-aware spatial lifting to reduce spatial redundancy by adaptively capturing irregular geometric structures and progressive channel lifting to progressively reduce channel-wise redundancy in multiscale representation. Furthermore, we design the split, predict, and update operations for geometry-aware spatial lifting to fully exploit the geometry information representing irregular structures. We develop geometry-aware adaptive split to equally split input points with significance scores indicating their dependencies, and propose geometry-aware cross-attention filtering for the predict and update operations for decorrelation based on geometry information. To our best knowledge, this paper achieves the first lifting-based learned transform for point cloud compression that enjoys reversibility guarantees of multiscale representation to enhance rate-distortion performance. Experimental results show that the proposed framework achieves state-of-the-art performance on extensive point cloud datasets, and outperforms latest MPEG G-PCC standard and most recent deep learning based methods.
点云属性压缩的难点在于对存在于不规则几何结构上的属性信号进行拟合。现有方法无法使用手工变换或基于深度学习的技术实现高保真重建的紧凑多尺度表示。本文提出了一种基于空间通道提升方案的基于几何感知的多尺度网络,用于点云属性压缩。该网络通过自适应捕获不规则几何结构和渐进通道提升来逐步减少多尺度表示中的通道冗余,从而级联几何感知空间提升以减少空间冗余。此外,我们设计了几何感知空间提升的分割、预测和更新操作,以充分利用代表不规则结构的几何信息。我们开发了几何感知的自适应分割,以显著性分数表示它们的依赖关系,并提出了几何感知的交叉注意过滤,用于基于几何信息的去相关预测和更新操作。据我们所知,本文实现了第一个基于提升的点云压缩学习变换,它具有多尺度表示的可逆性保证,从而提高了率失真性能。实验结果表明,该框架在广泛的点云数据集上实现了最先进的性能,并且优于最新的MPEG - pcc标准和最新的基于深度学习的方法。
{"title":"Point Cloud Attribute Compression With Geometry-Aware Lifting-Based Multiscale Networks","authors":"Xin Li;Shaohui Li;Wenrui Dai;Han Li;Nuowen Kan;Chenglin Li;Junni Zou;Hongkai Xiong","doi":"10.1109/TCSVT.2025.3597448","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3597448","url":null,"abstract":"Point cloud attribute compression is challenged by fitting the attribute signals living on irregular geometric structures. Existing methods cannot achieve compact multiscale representation for high-fidelity reconstruction using the handcrafted transforms or deep learning-based techniques. In this paper, we propose a novel geometry-aware lifting-based multiscale network via spatial-channel lifting scheme for point cloud attribute compression. The proposed network cascades geometry-aware spatial lifting to reduce spatial redundancy by adaptively capturing irregular geometric structures and progressive channel lifting to progressively reduce channel-wise redundancy in multiscale representation. Furthermore, we design the split, predict, and update operations for geometry-aware spatial lifting to fully exploit the geometry information representing irregular structures. We develop geometry-aware adaptive split to equally split input points with significance scores indicating their dependencies, and propose geometry-aware cross-attention filtering for the predict and update operations for decorrelation based on geometry information. To our best knowledge, this paper achieves the first lifting-based learned transform for point cloud compression that enjoys reversibility guarantees of multiscale representation to enhance rate-distortion performance. Experimental results show that the proposed framework achieves state-of-the-art performance on extensive point cloud datasets, and outperforms latest MPEG G-PCC standard and most recent deep learning based methods.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 1","pages":"1143-1159"},"PeriodicalIF":11.1,"publicationDate":"2025-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146049270","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Generalized Document Tampering Localization via Color and Semantic Disentanglement 基于颜色和语义解纠缠的广义文档篡改定位
IF 11.1 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-08-11 DOI: 10.1109/TCSVT.2025.3597602
Shiqiang Zheng;Changsheng Chen;Shen Chen;Taiping Yao;Shouhong Ding;Bin Li;Jiwu Huang
Document images are vulnerable to tampering attacks from image editing tools and deep models. Therefore, the Document Tampering Localization (DTL) task has received increasing attention in recent years. However, given the wide variety of document types (e.g., contracts, certificates, ID cards), our analysis shows that existing DTL methods struggle with document images containing diverse background colors and varying semantic contents. Further analysis and experiments verify that the varying background color and semantic contents interfere with the forensic feature extraction process in the existing DTL methods. To address this issue, we propose two disentanglement modules to mitigate such interference and improve the ability of forgery trace detection. First, we design a Color Disentanglement (CD) module that applies disentangled learning representation to forensic features. The CD module, grounded in real-world prior knowledge, effectively decouples color information from forensic features, thereby improving robustness against varying background colors. Second, we propose the Semantic Disentanglement (SD) module, which performs image-level clustering on the tampering probability map during the inference process. The SD module focuses on tampering probabilities for each pixel, while discarding local semantic information (e.g., font, location, and shape). It leads to strong robustness against variations in document content. The evaluations demonstrate that our CD-SD method outperforms existing methods by 45.12% or 0.162 on the F1 metric in cross-dataset tests. Ablation studies show that the CD and SD modules improve the F1 score by 7.98% and 13.38%, respectively, across different backbones. Our method delivers consistent and stable improvements across various experimental protocols. Moreover, it is compatible with many DTL methods in a plug-and-play fashion.
文档图像容易受到图像编辑工具和深度模型的篡改攻击。因此,文档篡改本地化(DTL)任务近年来受到越来越多的关注。然而,考虑到各种文档类型(例如,合同、证书、身份证),我们的分析表明,现有的DTL方法很难处理包含不同背景颜色和不同语义内容的文档图像。进一步的分析和实验验证了背景颜色和语义内容的变化会干扰现有DTL方法的取证特征提取过程。为了解决这个问题,我们提出了两个解缠模块来减轻这种干扰,提高伪造痕迹检测的能力。首先,我们设计了一个颜色解纠缠(CD)模块,该模块将解纠缠学习表示应用于取证特征。CD模块以现实世界的先验知识为基础,有效地将颜色信息与取证特征解耦,从而提高了对不同背景颜色的鲁棒性。其次,我们提出了语义解纠缠(SD)模块,该模块在推理过程中对篡改概率图进行图像级聚类。SD模块关注每个像素的篡改概率,同时丢弃局部语义信息(如字体、位置和形状)。它导致了对文档内容变化的强大健壮性。评估表明,在交叉数据集测试中,CD-SD方法在F1指标上优于现有方法45.12%或0.162。消融研究表明,CD和SD模块在不同主干上的F1分数分别提高了7.98%和13.38%。我们的方法在各种实验协议中提供了一致和稳定的改进。此外,它以即插即用的方式与许多DTL方法兼容。
{"title":"Generalized Document Tampering Localization via Color and Semantic Disentanglement","authors":"Shiqiang Zheng;Changsheng Chen;Shen Chen;Taiping Yao;Shouhong Ding;Bin Li;Jiwu Huang","doi":"10.1109/TCSVT.2025.3597602","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3597602","url":null,"abstract":"Document images are vulnerable to tampering attacks from image editing tools and deep models. Therefore, the Document Tampering Localization (DTL) task has received increasing attention in recent years. However, given the wide variety of document types (<italic>e.g.</i>, contracts, certificates, ID cards), our analysis shows that existing DTL methods struggle with document images containing diverse background colors and varying semantic contents. Further analysis and experiments verify that the varying background color and semantic contents interfere with the forensic feature extraction process in the existing DTL methods. To address this issue, we propose two disentanglement modules to mitigate such interference and improve the ability of forgery trace detection. First, we design a Color Disentanglement (CD) module that applies disentangled learning representation to forensic features. The CD module, grounded in real-world prior knowledge, effectively decouples color information from forensic features, thereby improving robustness against varying background colors. Second, we propose the Semantic Disentanglement (SD) module, which performs image-level clustering on the tampering probability map during the inference process. The SD module focuses on tampering probabilities for each pixel, while discarding local semantic information (<italic>e.g.</i>, font, location, and shape). It leads to strong robustness against variations in document content. The evaluations demonstrate that our CD-SD method outperforms existing methods by 45.12% or 0.162 on the F1 metric in cross-dataset tests. Ablation studies show that the CD and SD modules improve the F1 score by 7.98% and 13.38%, respectively, across different backbones. Our method delivers consistent and stable improvements across various experimental protocols. Moreover, it is compatible with many DTL methods in a plug-and-play fashion.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 1","pages":"1279-1292"},"PeriodicalIF":11.1,"publicationDate":"2025-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146049290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Sparse Hyperspectral Band Selection Based on Expectation Maximization 基于期望最大化的稀疏高光谱波段选择
IF 11.1 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-08-11 DOI: 10.1109/TCSVT.2025.3597604
Likun Gao;Xinhui Xue;Haowen Zheng
Hyperspectral band selection seeks to identify a compact subset of informative spectral channels that preserves task–relevant information while mitigating the storage, transmission, and computational burdens imposed by high–dimensional data. Yet prevailing techniques face two pervasive limitations: (i) scoring- or ranking-based methods assess bands independently, overlooking the joint dependency that determine their true utility; and (ii) combinatorial search approaches, though theoretically exhaustive, require prohibitive enumeration that is incompatible with the scale and end-to-end nature of modern deep-learning pipelines. We recast band selection as a combinatorial inference problem and propose a task-agnostic framework that embeds a learnable Band Selection Layer equipped with an Expectation–Maximization–driven Sparsity Loss The E-step efficiently enumerates the expected likelihood of all k-out-of-B band subsets via dynamic programming, thereby making implicit dependencies explicit; the M-step optimises band importances toward a provably k-sparse solution without post-hoc thresholding. Comprehensive theoretical analysis proves the absence of spurious local maxima and guarantees convergence to an exact sparse optimum. Extensive experiments on three public benchmarks (KSC, HT2013, HT2018), two auxiliary tasks (anomaly and target detection), and six classifiers demonstrate that the proposed method consistently surpasses state-of-the-art baselines. The results confirm that EM-guided sparsification not only stabilises the sparsity pattern but also yields interpretable inter-band dependency structures, making the framework a robust and broadly applicable tool for hyperspectral analysis and other sparsity-oriented vision problems.
高光谱波段选择旨在确定信息光谱通道的紧凑子集,在减少高维数据带来的存储、传输和计算负担的同时保留任务相关信息。然而,流行的技术面临两个普遍的局限性:(i)基于评分或排名的方法独立评估波段,忽略了决定其真正效用的联合依赖性;(ii)组合搜索方法虽然在理论上是详尽的,但需要令人望而却步的枚举,这与现代深度学习管道的规模和端到端性质不兼容。我们将波段选择重新定义为一个组合推理问题,并提出了一个任务不可知框架,该框架嵌入了一个可学习的波段选择层,该层配备了期望最大化驱动的稀疏性损失。e步通过动态规划有效地枚举所有k-out- b波段子集的期望似然,从而使隐式依赖显式;m步在没有事后阈值的情况下优化可证明的k稀疏解的频带重要性。全面的理论分析证明了该算法不存在虚假的局部极大值,并保证收敛到精确的稀疏最优。在三个公共基准(KSC, HT2013, HT2018),两个辅助任务(异常和目标检测)以及六个分类器上进行的大量实验表明,所提出的方法始终优于最先进的基线。结果证实,em引导的稀疏化不仅稳定了稀疏模式,而且产生了可解释的带间依赖结构,使该框架成为高光谱分析和其他稀疏性导向视觉问题的强大且广泛适用的工具。
{"title":"Sparse Hyperspectral Band Selection Based on Expectation Maximization","authors":"Likun Gao;Xinhui Xue;Haowen Zheng","doi":"10.1109/TCSVT.2025.3597604","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3597604","url":null,"abstract":"Hyperspectral band selection seeks to identify a compact subset of informative spectral channels that preserves task–relevant information while mitigating the storage, transmission, and computational burdens imposed by high–dimensional data. Yet prevailing techniques face two pervasive limitations: (i) scoring- or ranking-based methods assess bands independently, overlooking the joint dependency that determine their true utility; and (ii) combinatorial search approaches, though theoretically exhaustive, require prohibitive enumeration that is incompatible with the scale and end-to-end nature of modern deep-learning pipelines. We recast band selection as a combinatorial inference problem and propose a task-agnostic framework that embeds a learnable Band Selection Layer equipped with an Expectation–Maximization–driven Sparsity Loss The E-step efficiently enumerates the expected likelihood of all <italic>k</i>-out-of-<italic>B</i> band subsets via dynamic programming, thereby making implicit dependencies explicit; the M-step optimises band importances toward a provably <italic>k</i>-sparse solution without post-hoc thresholding. Comprehensive theoretical analysis proves the absence of spurious local maxima and guarantees convergence to an exact sparse optimum. Extensive experiments on three public benchmarks (KSC, HT2013, HT2018), two auxiliary tasks (anomaly and target detection), and six classifiers demonstrate that the proposed method consistently surpasses state-of-the-art baselines. The results confirm that EM-guided sparsification not only stabilises the sparsity pattern but also yields interpretable inter-band dependency structures, making the framework a robust and broadly applicable tool for hyperspectral analysis and other sparsity-oriented vision problems.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 1","pages":"1265-1278"},"PeriodicalIF":11.1,"publicationDate":"2025-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146049302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
VisualRAG: Knowledge-Guided Retrieval Augmentation for Image-Text Matching VisualRAG:图像-文本匹配的知识引导检索增强
IF 11.1 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-08-08 DOI: 10.1109/TCSVT.2025.3597097
Hengchang Wang;Li Liu;Huaxiang Zhang;Lei Zhu;Xiaojun Chang;Hao Du
Image-text matching as a fundamental cross-modal understanding task presents unique challenges in weakly-aligned scenarios. Such data typically feature highly abstract textual captions with sparse entity References, creating a significant semantic gap with visual content. Current mainstream methods, primarily designed for strongly aligned data pairs, employ dynamic modeling or multi-dimensional similarity computation to achieve feature space mapping. However, they struggle with information asymmetry and modal heterogeneity in weakly aligned cases. To address this, we propose a Visual Perception Knowledge Enhancement (VPKE) framework. Unlike existing methods based on strong alignment assumptions, this framework mines latent image semantics through vision-language models and generates auxiliary captions, overcoming the information bottleneck of traditional text modalities. Its core innovation lies in an adaptive knowledge distillation mechanism that combines retrieval-augmented generation (RAG) with key entity extraction. This mechanism effectively filters noise when introducing external knowledge while optimizing cross-modal feature integration. The framework employs multi-level similarity evaluation to dynamically adjust fusion weights among original text, key entities, and auxiliary captions, enabling adaptive integration of diverse semantic features and significantly improving model flexibility. Additionally, multi-scale feature extraction further enhances cross-modal representation capabilities. Experimental results show that the proposed method performs excellently in image-text retrieval tasks on the MSCOCO and Flickr30K datasets, validating its effectiveness.
图像-文本匹配作为一项基本的跨模态理解任务,在弱对齐场景中呈现出独特的挑战。这些数据通常具有高度抽象的文本标题和稀疏的实体引用,与视觉内容产生明显的语义差距。目前主流的方法主要针对强对齐数据对,采用动态建模或多维相似度计算来实现特征空间映射。然而,在弱对齐的情况下,它们与信息不对称和模态异质性作斗争。为了解决这个问题,我们提出了一个视觉感知知识增强(VPKE)框架。与现有基于强对齐假设的方法不同,该框架通过视觉语言模型挖掘潜在的图像语义并生成辅助字幕,克服了传统文本模式的信息瓶颈。其核心创新点在于将检索增强生成与关键实体提取相结合的自适应知识蒸馏机制。该机制在优化跨模态特征集成的同时,有效地过滤了引入外部知识时的噪声。该框架采用多级相似度评估,动态调整原始文本、关键实体和辅助字幕之间的融合权重,实现了多种语义特征的自适应集成,显著提高了模型的灵活性。此外,多尺度特征提取进一步增强了跨模态表示能力。实验结果表明,该方法在MSCOCO和Flickr30K数据集的图像文本检索任务中表现优异,验证了该方法的有效性。
{"title":"VisualRAG: Knowledge-Guided Retrieval Augmentation for Image-Text Matching","authors":"Hengchang Wang;Li Liu;Huaxiang Zhang;Lei Zhu;Xiaojun Chang;Hao Du","doi":"10.1109/TCSVT.2025.3597097","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3597097","url":null,"abstract":"Image-text matching as a fundamental cross-modal understanding task presents unique challenges in weakly-aligned scenarios. Such data typically feature highly abstract textual captions with sparse entity References, creating a significant semantic gap with visual content. Current mainstream methods, primarily designed for strongly aligned data pairs, employ dynamic modeling or multi-dimensional similarity computation to achieve feature space mapping. However, they struggle with information asymmetry and modal heterogeneity in weakly aligned cases. To address this, we propose a Visual Perception Knowledge Enhancement (VPKE) framework. Unlike existing methods based on strong alignment assumptions, this framework mines latent image semantics through vision-language models and generates auxiliary captions, overcoming the information bottleneck of traditional text modalities. Its core innovation lies in an adaptive knowledge distillation mechanism that combines retrieval-augmented generation (RAG) with key entity extraction. This mechanism effectively filters noise when introducing external knowledge while optimizing cross-modal feature integration. The framework employs multi-level similarity evaluation to dynamically adjust fusion weights among original text, key entities, and auxiliary captions, enabling adaptive integration of diverse semantic features and significantly improving model flexibility. Additionally, multi-scale feature extraction further enhances cross-modal representation capabilities. Experimental results show that the proposed method performs excellently in image-text retrieval tasks on the MSCOCO and Flickr30K datasets, validating its effectiveness.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 1","pages":"1234-1248"},"PeriodicalIF":11.1,"publicationDate":"2025-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146049291","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MERINA+: Improving Generalization for Neural Video Adaptation via Information-Theoretic Meta-Reinforcement Learning MERINA+:通过信息论元强化学习改进神经视频自适应的泛化
IF 11.1 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-08-07 DOI: 10.1109/TCSVT.2025.3596636
Nuowen Kan;Chenglin Li;Yuankun Jiang;Wenrui Dai;Junni Zou;Hongkai Xiong;Laura Toni
Adaptive bitrate (ABR) streaming is a popular technique used to improve the quality of experience (QoE) for users who watch videos online, which, for example, can provide a smoother video playback by dynamically adjusting the requested video quality with associated bitrate according to the constrained yet diverse network conditions. Recently, learning-based ABR algorithms have achieved a notable performance gain with lower inference overhead than the conventional heuristic or model-based baselines. However, their performance may degrade significantly in an unseen network environment with time-varying and heterogeneous throughput dynamics. For a better generalization, in this paper, we propose a meta-reinforcement learning (meta-RL)-based neural ABR algorithm that is able to quickly adapt its policy to these unseen throughput dynamics. Specifically, we propose a model-free system framework comprising an inference network and a policy network. The inference network infers distribution of the latent representation for underlying dynamics based on the recent throughout context, while the policy network is trained to quickly adapt to the changing throughout dynamics with the sampled latent representation. To effectively learn the inference network and meta-policy on mixed dynamics of the practical ABR scenarios, we further design a variational information bottleneck theory-based loss function for training the inference and policy networks, whose objective is to strike a trade-off between brevity of the latent representation and expressiveness of the meta-policy. We also derive a theoretically necessary condition for the bitrate versions that yield higher long-term QoE, based on which a dynamic action pruning strategy is further developed for practical implementation. This pruning strategy can not only prevent unsafe policy outputs in midst of unseen throughput dynamics, but may also reduce the computational complexity of model-based ABR algorithms. Finally, the meta-training and meta-adaptation procedures of our proposed algorithm are implemented across a range of throughput dynamics. The empirical evaluations on various datasets containing real-world network traces verify that our algorithm surpasses the state-of-the-art ABR algorithms, particularly in terms of the average chunk QoE and fast adaptation across out-of-distribution throughput traces.
自适应比特率(ABR)流媒体是一种流行的技术,用于提高在线观看视频的用户的体验质量(QoE),例如,它可以根据受约束但不同的网络条件动态调整请求的视频质量和相关的比特率,从而提供更流畅的视频播放。最近,基于学习的ABR算法与传统的启发式或基于模型的基线相比,以更低的推理开销获得了显著的性能提升。然而,在一个看不见的网络环境中,随着时间的变化和异构的吞吐量动态,它们的性能可能会显著下降。为了更好地概括,在本文中,我们提出了一种基于元强化学习(meta-RL)的神经ABR算法,该算法能够快速调整其策略以适应这些看不见的吞吐量动态。具体来说,我们提出了一个由推理网络和策略网络组成的无模型系统框架。推理网络根据最近的整个上下文推断潜在表征的底层动态分布,而策略网络则通过采样的潜在表征来快速适应不断变化的整个动态。为了在实际ABR场景的混合动态中有效地学习推理网络和元策略,我们进一步设计了一个基于变分信息瓶颈理论的损失函数来训练推理网络和策略网络,其目的是在潜在表示的简明性和元策略的表达性之间取得平衡。我们还推导了比特率版本产生更高长期QoE的理论必要条件,并在此基础上进一步开发了用于实际实现的动态动作修剪策略。这种修剪策略不仅可以防止在看不见的吞吐量动态中产生不安全的策略输出,而且可以降低基于模型的ABR算法的计算复杂度。最后,我们提出的算法的元训练和元适应过程在一系列吞吐量动态中实现。对包含真实网络轨迹的各种数据集的经验评估验证了我们的算法超越了最先进的ABR算法,特别是在平均块QoE和跨分布外吞吐量轨迹的快速适应方面。
{"title":"MERINA+: Improving Generalization for Neural Video Adaptation via Information-Theoretic Meta-Reinforcement Learning","authors":"Nuowen Kan;Chenglin Li;Yuankun Jiang;Wenrui Dai;Junni Zou;Hongkai Xiong;Laura Toni","doi":"10.1109/TCSVT.2025.3596636","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3596636","url":null,"abstract":"Adaptive bitrate (ABR) streaming is a popular technique used to improve the quality of experience (QoE) for users who watch videos online, which, for example, can provide a smoother video playback by dynamically adjusting the requested video quality with associated bitrate according to the constrained yet diverse network conditions. Recently, learning-based ABR algorithms have achieved a notable performance gain with lower inference overhead than the conventional heuristic or model-based baselines. However, their performance may degrade significantly in an unseen network environment with time-varying and heterogeneous throughput dynamics. For a better generalization, in this paper, we propose a meta-reinforcement learning (meta-RL)-based neural ABR algorithm that is able to quickly adapt its policy to these unseen throughput dynamics. Specifically, we propose a model-free system framework comprising an inference network and a policy network. The inference network infers distribution of the latent representation for underlying dynamics based on the recent throughout context, while the policy network is trained to quickly adapt to the changing throughout dynamics with the sampled latent representation. To effectively learn the inference network and meta-policy on mixed dynamics of the practical ABR scenarios, we further design a variational information bottleneck theory-based loss function for training the inference and policy networks, whose objective is to strike a trade-off between brevity of the latent representation and expressiveness of the meta-policy. We also derive a theoretically necessary condition for the bitrate versions that yield higher long-term QoE, based on which a dynamic action pruning strategy is further developed for practical implementation. This pruning strategy can not only prevent unsafe policy outputs in midst of unseen throughput dynamics, but may also reduce the computational complexity of model-based ABR algorithms. Finally, the meta-training and meta-adaptation procedures of our proposed algorithm are implemented across a range of throughput dynamics. The empirical evaluations on various datasets containing real-world network traces verify that our algorithm surpasses the state-of-the-art ABR algorithms, particularly in terms of the average chunk QoE and fast adaptation across out-of-distribution throughput traces.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 1","pages":"1185-1202"},"PeriodicalIF":11.1,"publicationDate":"2025-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146049281","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Generative Human Video Compression With Multi-Granularity Temporal Trajectory Factorization 基于多粒度时间轨迹分解的生成式人类视频压缩
IF 11.1 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-08-07 DOI: 10.1109/TCSVT.2025.3596815
Shanzhi Yin;Bolin Chen;Shiqi Wang;Yan Ye
In this paper, we propose a novel Multi-granularity Temporal Trajectory Factorization (MTTF) framework for generative human video compression, which holds great potential for bandwidth-constrained human-centric video communication. In particular, the proposed multi-granularity feature factorization strategy can facilitate to implicitly characterize the high-dimensional visual signal into compact motion vectors for representation compactness and further transform these vectors into fine-grained fields for motion expressibility. As such, the coded bit-stream can be entailed with enough visual motion information at the lowest representation cost. Meanwhile, a resolution-expandable generative module is developed with enhanced background stability, such that the proposed framework can be optimized towards higher reconstruction robustness and more flexible resolution adaptation. Experimental results show that proposed method outperforms latest generative models and the state-of-the-art video coding standard Versatile Video Coding (VVC) on both talking-face videos and moving-body videos in terms of both objective and subjective quality. The project page can be found at https://github.com/xyzysz/Extreme-Human-Video-Compression-with-MTTF
在本文中,我们提出了一种新的多粒度时间轨迹分解(MTTF)框架,用于生成人类视频压缩,它在带宽受限的人类中心视频通信中具有很大的潜力。特别是,所提出的多粒度特征分解策略可以将高维视觉信号隐式地表征为紧凑的运动向量,以提高表示的紧凑性,并进一步将这些向量转化为细粒度的运动可表达性。这样,编码的比特流就可以以最低的表示成本包含足够的视觉运动信息。同时,开发了一个分辨率可扩展的生成模块,增强了背景稳定性,使所提出的框架可以朝着更高的重建鲁棒性和更灵活的分辨率适应进行优化。实验结果表明,该方法在人脸视频和运动身体视频的客观质量和主观质量上都优于最新的生成模型和最先进的视频编码标准通用视频编码(VVC)。项目页面可以在https://github.com/xyzysz/Extreme-Human-Video-Compression-with-MTTF上找到
{"title":"Generative Human Video Compression With Multi-Granularity Temporal Trajectory Factorization","authors":"Shanzhi Yin;Bolin Chen;Shiqi Wang;Yan Ye","doi":"10.1109/TCSVT.2025.3596815","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3596815","url":null,"abstract":"In this paper, we propose a novel Multi-granularity Temporal Trajectory Factorization (MTTF) framework for generative human video compression, which holds great potential for bandwidth-constrained human-centric video communication. In particular, the proposed multi-granularity feature factorization strategy can facilitate to implicitly characterize the high-dimensional visual signal into compact motion vectors for representation compactness and further transform these vectors into fine-grained fields for motion expressibility. As such, the coded bit-stream can be entailed with enough visual motion information at the lowest representation cost. Meanwhile, a resolution-expandable generative module is developed with enhanced background stability, such that the proposed framework can be optimized towards higher reconstruction robustness and more flexible resolution adaptation. Experimental results show that proposed method outperforms latest generative models and the state-of-the-art video coding standard Versatile Video Coding (VVC) on both talking-face videos and moving-body videos in terms of both objective and subjective quality. The project page can be found at <uri>https://github.com/xyzysz/Extreme-Human-Video-Compression-with-MTTF</uri>","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 1","pages":"1089-1103"},"PeriodicalIF":11.1,"publicationDate":"2025-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146049271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-Task Learning Network for Medical Image Analysis Guided by Lesion Regions and Spatial Relationships of Tissues 基于组织损伤区域和空间关系的医学图像分析多任务学习网络
IF 11.1 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-08-07 DOI: 10.1109/TCSVT.2025.3596803
Guowei Dai;Duwei Dai;Chaoyu Wang;Qingfeng Tang;Matthew Hamilton;Hu Chen;Yi Zhang
Medical image analysis plays key role in computer-aided diagnosis, where segmentation and classification are essential and interconnected tasks. While multi-task learning (MTL) has been widely explored to leverage inter-task synergies, effectively guiding knowledge transfer to prevent task conflict and negative transfer remains a key challenge, particularly in anatomically complex diagnostic scenarios. This paper presents LTRMTL-Net, a novel multi-task learning framework for medical image analysis that simultaneously addresses segmentation and classification tasks guided by lesion regions and spatial relationships of tissues. The proposed architecture integrates an Enhanced Lesion Region Fusion (ELRF) module that leverages GradCAM-guided attention mechanisms to precisely locate and enhance lesion regions, providing critical prior knowledge for both tasks. Tissue Space Structure Prediction (TSSP) component captures local-global spatial dependencies through contrastive learning, establishing effective anatomical context modeling. The core encoder employs Hybrid Wavelet-State Attention blocks that combine modulated wavelet transform convolutions with structured state space models to extract multi-scale features while maintaining computational efficiency. Dual-stream inputs with symmetric architecture accommodate single-source scenarios across diverse medical imaging applications. Experimental results on mammography and breast ultrasound datasets demonstrate that the proposed method captures fine-grained lesion boundary details while providing accurate malignancy classification. Harnessing cooperative knowledge transfer between segmentation and classification, guided by anatomical priors, boosts diagnostic performance and provides comprehensive, interpretable clinical insights.
医学图像分析在计算机辅助诊断中起着至关重要的作用,其中分割和分类是必不可少且相互关联的任务。虽然多任务学习(MTL)已被广泛探索以利用任务间的协同作用,但有效指导知识转移以防止任务冲突和负迁移仍然是一个关键挑战,特别是在解剖复杂的诊断场景中。LTRMTL-Net是一种新的多任务学习框架,用于医学图像分析,同时解决由病变区域和组织空间关系指导的分割和分类任务。所提出的架构集成了增强病变区域融合(Enhanced Lesion Region Fusion, ELRF)模块,该模块利用gradcam引导的注意力机制来精确定位和增强病变区域,为这两项任务提供关键的先验知识。组织空间结构预测(TSSP)组件通过对比学习捕获局部-全局空间依赖关系,建立有效的解剖上下文建模。核心编码器采用混合小波状态注意块,将调制小波变换卷积与结构化状态空间模型相结合,在保持计算效率的同时提取多尺度特征。对称架构的双流输入可适应不同医学成像应用中的单源场景。乳房x线摄影和乳腺超声数据集的实验结果表明,该方法在提供准确的恶性肿瘤分类的同时,捕获了细粒度的病变边界细节。利用分割和分类之间的合作知识转移,在解剖学先验的指导下,提高诊断性能,并提供全面的,可解释的临床见解。
{"title":"Multi-Task Learning Network for Medical Image Analysis Guided by Lesion Regions and Spatial Relationships of Tissues","authors":"Guowei Dai;Duwei Dai;Chaoyu Wang;Qingfeng Tang;Matthew Hamilton;Hu Chen;Yi Zhang","doi":"10.1109/TCSVT.2025.3596803","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3596803","url":null,"abstract":"Medical image analysis plays key role in computer-aided diagnosis, where segmentation and classification are essential and interconnected tasks. While multi-task learning (MTL) has been widely explored to leverage inter-task synergies, effectively guiding knowledge transfer to prevent task conflict and negative transfer remains a key challenge, particularly in anatomically complex diagnostic scenarios. This paper presents LTRMTL-Net, a novel multi-task learning framework for medical image analysis that simultaneously addresses segmentation and classification tasks guided by lesion regions and spatial relationships of tissues. The proposed architecture integrates an Enhanced Lesion Region Fusion (ELRF) module that leverages GradCAM-guided attention mechanisms to precisely locate and enhance lesion regions, providing critical prior knowledge for both tasks. Tissue Space Structure Prediction (TSSP) component captures local-global spatial dependencies through contrastive learning, establishing effective anatomical context modeling. The core encoder employs Hybrid Wavelet-State Attention blocks that combine modulated wavelet transform convolutions with structured state space models to extract multi-scale features while maintaining computational efficiency. Dual-stream inputs with symmetric architecture accommodate single-source scenarios across diverse medical imaging applications. Experimental results on mammography and breast ultrasound datasets demonstrate that the proposed method captures fine-grained lesion boundary details while providing accurate malignancy classification. Harnessing cooperative knowledge transfer between segmentation and classification, guided by anatomical priors, boosts diagnostic performance and provides comprehensive, interpretable clinical insights.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 1","pages":"1249-1264"},"PeriodicalIF":11.1,"publicationDate":"2025-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146049279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Circuits and Systems for Video Technology
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1