首页 > 最新文献

IEEE Transactions on Multimedia最新文献

英文 中文
Long-Tailed Continual Learning For Visual Food Recognition. 视觉食物识别的长尾持续学习。
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-12-03 DOI: 10.1109/tmm.2025.3632640
Jiangpeng He, Xiaoyan Zhang, Luotao Lin, Jack Ma, Heather A Eicher-Miller, Fengqing Zhu

Deep learning-based food recognition has made significant progress in predicting food types from eating occasion images. However, two key challenges hinder real-world deployment: (1) continuously learning new food classes without forgetting previously learned ones, and (2) handling the long-tailed distribution of food images, where a few common classes and many more rare classes. To address these, food recognition methods should focus on long-tailed continual learning. In this work, We introduce a dataset that encompasses 186 American foods along with comprehensive annotations. We also introduce three new benchmark datasets, VFN186-LT, VFN186-INSULIN and VFN186-T2D, which reflect real-world food consumption for healthy populations, insulin takers and individuals with type 2 diabetes without taking insulin. We propose a novel end-to-end framework that improves the generalization ability for instance-rare food classes using a knowledge distillation-based predictor to avoid misalignment of representation during continual learning. Additionally, we introduce an augmentation technique by integrating class-activation-map (CAM) and CutMix to improve generalization on instance-rare food classes. Our method, evaluated on Food101-LT, VFN-LT, VFN186-LT, VFN186-INSULIN, and VFN186-T2DM, shows significant improvements over existing methods. An ablation study highlights further performance enhancements, demonstrating its potential for real-world food recognition applications.

基于深度学习的食物识别在从进食场合图像预测食物类型方面取得了重大进展。然而,两个关键的挑战阻碍了现实世界的部署:(1)不断学习新的食物类别,而不忘记以前学过的食物类别;(2)处理食物图像的长尾分布,其中一些常见的类别和更多的罕见类别。为了解决这些问题,食物识别方法应该关注长尾持续学习。在这项工作中,我们引入了一个包含186种美国食物以及综合注释的数据集。我们还引入了三个新的基准数据集,VFN186-LT、vfn186 -胰岛素和VFN186-T2D,它们反映了健康人群、胰岛素使用者和未使用胰岛素的2型糖尿病患者的真实食物消费。我们提出了一种新的端到端框架,该框架使用基于知识蒸馏的预测器来提高例如稀有食物类别的泛化能力,以避免在持续学习期间表示的不一致。此外,我们引入了一种增强技术,通过集成类激活图(class-activation-map, CAM)和CutMix来提高实例稀有食物类的泛化。通过对Food101-LT、VFN-LT、VFN186-LT、VFN186-INSULIN和VFN186-T2DM的评估,我们的方法比现有方法有了显著的改进。一项消融研究强调了进一步的性能增强,展示了其在现实世界中食物识别应用的潜力。
{"title":"Long-Tailed Continual Learning For Visual Food Recognition.","authors":"Jiangpeng He, Xiaoyan Zhang, Luotao Lin, Jack Ma, Heather A Eicher-Miller, Fengqing Zhu","doi":"10.1109/tmm.2025.3632640","DOIUrl":"10.1109/tmm.2025.3632640","url":null,"abstract":"<p><p>Deep learning-based food recognition has made significant progress in predicting food types from eating occasion images. However, two key challenges hinder real-world deployment: (1) continuously learning new food classes without forgetting previously learned ones, and (2) handling the long-tailed distribution of food images, where a few common classes and many more rare classes. To address these, food recognition methods should focus on long-tailed continual learning. In this work, We introduce a dataset that encompasses 186 American foods along with comprehensive annotations. We also introduce three new benchmark datasets, VFN186-LT, VFN186-INSULIN and VFN186-T2D, which reflect real-world food consumption for healthy populations, insulin takers and individuals with type 2 diabetes without taking insulin. We propose a novel end-to-end framework that improves the generalization ability for instance-rare food classes using a knowledge distillation-based predictor to avoid misalignment of representation during continual learning. Additionally, we introduce an augmentation technique by integrating class-activation-map (CAM) and CutMix to improve generalization on instance-rare food classes. Our method, evaluated on Food101-LT, VFN-LT, VFN186-LT, VFN186-INSULIN, and VFN186-T2DM, shows significant improvements over existing methods. An ablation study highlights further performance enhancements, demonstrating its potential for real-world food recognition applications.</p>","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":" ","pages":""},"PeriodicalIF":9.7,"publicationDate":"2025-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12680007/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145700829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Retain, Blend, and Exchange: A Quality-Aware Spatial-Stereo Fusion Approach for Event Stream Recognition 保留,混合和交换:事件流识别的质量感知空间立体融合方法
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-11-12 DOI: 10.1109/TMM.2025.3607771
Lan Chen;Dong Li;Xiao Wang;Pengpeng Shao;Wei Zhang;Yaowei Wang;Yonghong Tian;Jin Tang
Current event stream-based pattern recognition models typically present the event stream as the point cloud, voxel, image, and the like, and formulate multiple deep neural networks to acquire their features. Although considerable results can be achieved in simple cases, however, the performance of the model might be restricted by monotonous modality expressions, sub-optimal fusion, and readout mechanisms. In this article, we put forward a novel dual-stream framework for event stream-based pattern recognition through differentiated fusion, which is called EFV++. It models two common event representations simultaneously, i.e., event images and event voxels. The spatial and three-dimensional stereo information can be separately learned by making use of Transformer and Graph Neural Network (GNN). We believe the features of each representation still contain both efficient and redundant features and a sub-optimal solution may be obtained if we directly fuse them without differentiation. Thus, we divide each feature into three levels and retain high-quality features, blend medium-quality features, and exchange low-quality features. The enhanced dual features will be provided to the fusion Transformer together with bottleneck features. In addition, we introduce a novel hybrid interaction readout mechanism to enhance the diversity of features as final representations. Comprehensive experiments validate that the framework we have proposed attains cutting-edge performance on a variety of extensively utilized event stream-based classification datasets. Particularly, we have realized a freshly pioneering performance on the Bullying10 k dataset, precisely 90.51%, and this outpaces the runner-up by $+2.21%$.
当前基于事件流的模式识别模型通常将事件流表示为点云、体素、图像等,并制定多个深度神经网络来获取其特征。虽然在简单的情况下可以获得可观的结果,但是,模型的性能可能受到单调的模态表达式、次优融合和读出机制的限制。本文提出了一种新的基于事件流的模式识别双流框架,即efv++。它同时建模两种常见的事件表示,即事件图像和事件体素。利用Transformer和Graph Neural Network (GNN)分别学习空间和三维立体信息。我们认为,每个表示的特征仍然包含有效和冗余的特征,如果我们直接融合它们而不进行微分,可能会得到次优解。因此,我们将每个特征分为三个层次,保留高质量的特征,混合中等质量的特征,交换低质量的特征。增强的双重特性将与瓶颈特性一起提供给融合变压器。此外,我们还引入了一种新的混合交互读出机制,以增强特征作为最终表示的多样性。综合实验证明,我们提出的框架在各种广泛使用的基于事件流的分类数据集上获得了尖端的性能。特别是,我们在欺凌10 k数据集上实现了新的开创性性能,精确地达到90.51%,超过了亚军+2.21 %。
{"title":"Retain, Blend, and Exchange: A Quality-Aware Spatial-Stereo Fusion Approach for Event Stream Recognition","authors":"Lan Chen;Dong Li;Xiao Wang;Pengpeng Shao;Wei Zhang;Yaowei Wang;Yonghong Tian;Jin Tang","doi":"10.1109/TMM.2025.3607771","DOIUrl":"https://doi.org/10.1109/TMM.2025.3607771","url":null,"abstract":"Current event stream-based pattern recognition models typically present the event stream as the point cloud, voxel, image, and the like, and formulate multiple deep neural networks to acquire their features. Although considerable results can be achieved in simple cases, however, the performance of the model might be restricted by monotonous modality expressions, sub-optimal fusion, and readout mechanisms. In this article, we put forward a novel dual-stream framework for event stream-based pattern recognition through differentiated fusion, which is called EFV++. It models two common event representations simultaneously, i.e., event images and event voxels. The spatial and three-dimensional stereo information can be separately learned by making use of Transformer and Graph Neural Network (GNN). We believe the features of each representation still contain both efficient and redundant features and a sub-optimal solution may be obtained if we directly fuse them without differentiation. Thus, we divide each feature into three levels and retain high-quality features, blend medium-quality features, and exchange low-quality features. The enhanced dual features will be provided to the fusion Transformer together with bottleneck features. In addition, we introduce a novel hybrid interaction readout mechanism to enhance the diversity of features as final representations. Comprehensive experiments validate that the framework we have proposed attains cutting-edge performance on a variety of extensively utilized event stream-based classification datasets. Particularly, we have realized a freshly pioneering performance on the Bullying10 k dataset, precisely 90.51%, and this outpaces the runner-up by <inline-formula><tex-math>$+2.21%$</tex-math></inline-formula>.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"8926-8939"},"PeriodicalIF":9.7,"publicationDate":"2025-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145510158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MIP-CLIP: Multimodal Independent Prompt CLIP for Action Recognition MIP-CLIP:用于动作识别的多模式独立提示剪辑
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-10-20 DOI: 10.1109/TMM.2025.3618557
Xiong Gao;Zhaobin Chang;Dongyi Kong;Huiyu Zhou;Yonggang Lu
Recently, the Contrastive Language Image Pre-training (CLIP) model has shown significant generalizability by optimizing the distance between visual and text features. The mainstream CLIP-based action recognition methods mitigate the low “zero-shot” generalization of the 1-of-N paradigm but also lead to a significant degradation in supervised performance. Therefore, powerful supervision and competitive “zero-shot” need to be effectively traded off. In this work, a Multimodal Independent Prompt CLIP (MIP-CLIP) model is proposed to address this challenge. On the visual side, we propose novel Video Motion Prompt (VMP) to empower the visual encoder with motion perception, which performs short- and long-term motion modelling via temporal difference operation. Next, the visual classification branch is introduced to improve the discrimination of visual features. Specifically, the temporal difference and visual classification operations of the 1-of-N paradigm are extended to CLIP to satisfy the need for strong supervised performance. On the text side, we design Class-Agnostic text prompt Template (CAT) under the constraint of Semantic Alignment (SA) module to solve the label semantic dependency problem. Finally, a Dual-branch Feature Reconstruction (DFR) module is proposed to complete cross-modal interactions for better feature matching, which uses the class confidence of the visual classification branch as input. The experiments are conducted on four widely used benchmarks (HMDB-51, UCF-101, Jester, and Kinetics-400). The results demonstrate that our method achieves excellent supervised performance while preserving competitive generalizability.
最近,对比语言图像预训练(CLIP)模型通过优化视觉特征和文本特征之间的距离显示出显著的泛化性。主流的基于clip的动作识别方法减轻了1-of-N范式的低“零概率”泛化,但也导致了监督性能的显著下降。因此,强有力的监管和竞争性的“零射击”需要进行有效的权衡。在这项工作中,提出了一个多模式独立提示CLIP (MIP-CLIP)模型来解决这一挑战。在视觉方面,我们提出了一种新的视频运动提示(VMP)来赋予视觉编码器运动感知能力,它通过时间差分操作进行短期和长期运动建模。其次,引入视觉分类分支,提高视觉特征的识别能力;具体而言,将1-of-N范式的时间差异和视觉分类操作扩展到CLIP,以满足强监督性能的需要。在文本端,我们在语义对齐(SA)模块的约束下设计了类不可知文本提示模板(CAT)来解决标签语义依赖问题。最后,提出了双分支特征重构(Dual-branch Feature Reconstruction, DFR)模块,以视觉分类分支的类置信度作为输入,完成跨模态交互,实现更好的特征匹配。实验在四种广泛使用的基准(HMDB-51, UCF-101, Jester和Kinetics-400)上进行。结果表明,该方法在保持竞争泛化性的同时取得了良好的监督性能。
{"title":"MIP-CLIP: Multimodal Independent Prompt CLIP for Action Recognition","authors":"Xiong Gao;Zhaobin Chang;Dongyi Kong;Huiyu Zhou;Yonggang Lu","doi":"10.1109/TMM.2025.3618557","DOIUrl":"https://doi.org/10.1109/TMM.2025.3618557","url":null,"abstract":"Recently, the Contrastive Language Image Pre-training (CLIP) model has shown significant generalizability by optimizing the distance between visual and text features. The mainstream CLIP-based action recognition methods mitigate the low “zero-shot” generalization of the 1-of-N paradigm but also lead to a significant degradation in supervised performance. Therefore, powerful supervision and competitive “zero-shot” need to be effectively traded off. In this work, a Multimodal Independent Prompt CLIP (MIP-CLIP) model is proposed to address this challenge. On the visual side, we propose novel Video Motion Prompt (VMP) to empower the visual encoder with motion perception, which performs short- and long-term motion modelling via temporal difference operation. Next, the visual classification branch is introduced to improve the discrimination of visual features. Specifically, the temporal difference and visual classification operations of the 1-of-N paradigm are extended to CLIP to satisfy the need for strong supervised performance. On the text side, we design Class-Agnostic text prompt Template (CAT) under the constraint of Semantic Alignment (SA) module to solve the label semantic dependency problem. Finally, a Dual-branch Feature Reconstruction (DFR) module is proposed to complete cross-modal interactions for better feature matching, which uses the class confidence of the visual classification branch as input. The experiments are conducted on four widely used benchmarks (HMDB-51, UCF-101, Jester, and Kinetics-400). The results demonstrate that our method achieves excellent supervised performance while preserving competitive generalizability.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"9918-9930"},"PeriodicalIF":9.7,"publicationDate":"2025-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145778148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FUNet: Frequency-Aware and Uncertainty-Guiding Network for Rain-Hazy Image Restoration 基于频率感知和不确定性导向的雨雾图像恢复网络
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-10-07 DOI: 10.1109/TMM.2025.3618545
Mengkun Liu;Tao Gao;Yao Liu;Yuhan Cao;Licheng Jiao
Restoring rain-hazy images is vital for intelligent decision-making in autonomous driving and outdoor surveillance systems, which is a challenging ill-posed problem due to the irreversible nature of image degradation. Despite remarkable success achieved through deep learning, current algorithms are primarily evaluated using given kind of images, and the texture details and frequency domain information are insufficiently explored in most approaches, which greatly limits the performance of the model. To alleviate the above challenges, the frequency-aware and uncertainty-guiding network (FUNet) is proposed for rain-hazy image restoration. The FUNet consists of an end-to-end encoder-decoder architecture with the uncertainty-guided feature refinement (UGFR) and the confidence feature feedback module (CFF). First, the UGFR is designed with the uncertainty estimation (UE), uncertainty local global feature extraction module (ULG), and the frequency component decomposition and fusion (FCDF), which learns the abundant intermediate information in detail for clear image restoration. Second, in order to adequately learn rich semantic features, the CFF module is proposed to provide feedback and guidance on the learning process of the decoder. Third, the frequency-based loss function is designed to ensure training stability, which effectively guarantees the spatial and spectral details of images. Experiments on seven synthetic outdoor datasets and the real-world dataset DQA demonstrate the superiority of the proposed model quantitatively and qualitatively.
由于图像退化的不可逆性,恢复雨雾图像对于自动驾驶和户外监控系统的智能决策至关重要,这是一个具有挑战性的不适定问题。尽管通过深度学习取得了显著的成功,但目前的算法主要是使用给定类型的图像进行评估,并且大多数方法对纹理细节和频域信息的探索不足,这极大地限制了模型的性能。针对上述问题,提出了一种用于雨朦胧图像恢复的频率感知和不确定性引导网络(FUNet)。FUNet由端到端编码器-解码器架构组成,具有不确定性引导特征细化(UGFR)和置信度特征反馈模块(CFF)。首先,采用不确定性估计(UE)、不确定性局部全局特征提取模块(ULG)和频率分量分解与融合(FCDF)设计UGFR,详细学习丰富的中间信息,实现清晰的图像复原;其次,为了充分学习丰富的语义特征,提出了CFF模块对解码器的学习过程提供反馈和指导。第三,设计了基于频率的损失函数,保证了训练的稳定性,有效地保证了图像的空间和频谱细节。在7个室外合成数据集和实际数据集DQA上的实验证明了该模型在定量和定性上的优越性。
{"title":"FUNet: Frequency-Aware and Uncertainty-Guiding Network for Rain-Hazy Image Restoration","authors":"Mengkun Liu;Tao Gao;Yao Liu;Yuhan Cao;Licheng Jiao","doi":"10.1109/TMM.2025.3618545","DOIUrl":"https://doi.org/10.1109/TMM.2025.3618545","url":null,"abstract":"Restoring rain-hazy images is vital for intelligent decision-making in autonomous driving and outdoor surveillance systems, which is a challenging ill-posed problem due to the irreversible nature of image degradation. Despite remarkable success achieved through deep learning, current algorithms are primarily evaluated using given kind of images, and the texture details and frequency domain information are insufficiently explored in most approaches, which greatly limits the performance of the model. To alleviate the above challenges, the frequency-aware and uncertainty-guiding network (FUNet) is proposed for rain-hazy image restoration. The FUNet consists of an end-to-end encoder-decoder architecture with the uncertainty-guided feature refinement (UGFR) and the confidence feature feedback module (CFF). First, the UGFR is designed with the uncertainty estimation (UE), uncertainty local global feature extraction module (ULG), and the frequency component decomposition and fusion (FCDF), which learns the abundant intermediate information in detail for clear image restoration. Second, in order to adequately learn rich semantic features, the CFF module is proposed to provide feedback and guidance on the learning process of the decoder. Third, the frequency-based loss function is designed to ensure training stability, which effectively guarantees the spatial and spectral details of images. Experiments on seven synthetic outdoor datasets and the real-world dataset DQA demonstrate the superiority of the proposed model quantitatively and qualitatively.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"9902-9917"},"PeriodicalIF":9.7,"publicationDate":"2025-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145778118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HEVC Video Steganalysis Based on Centralized Error and Attention Mechanism 基于集中错误和注意机制的HEVC视频隐写分析
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-09-22 DOI: 10.1109/TMM.2025.3613171
Haojun Dai;Dawen Xu;Lin Yang;Rangding Wang
With high embedding capacity and security, transform coefficient-based video steganography has become an important branch of video steganography. However, existing steganalysis methods against transform coefficient-based steganography provide insufficient consideration to the prediction process of HEVC compression, which results in steganalysis that is not straightforward and fail to effectively detect adaptive steganography methods in low embedding rate scenarios. In this paper, an HEVC video steganalysis method based on centralized error and attention mechanism against transform coefficient-based steganography is proposed. Firstly, the centralized error phenomenon brought by distortion compensation-based steganography is analyzed, and prediction error maps is constructed for steganalysis to achieve higher SNR(signal-to-noise ratio). Secondly, a video steganalysis network called CESNet (Centralized Error Steganalysis Network) is proposed. The network takes the prediction error maps as input and four types of convolutional modules are designed to adapt to different stages of feature extraction. To address the intra-frame sparsity of adaptive steganography, CEA (Centralized Error Attention) modules based on spatial and channel attention mechanisms are proposed to adaptively enhance the steganographic region. Finally, after extracting the feature vectors of each frame, the detection of steganographic video is completed using the self-attention mechanism. Experimental results show that compared with the existing transform coefficient-based video steganalysis methods, the proposed method can effectively detect multiple transform coefficient-based steganography algorithms and achieve higher detection performance in low payload scenarios.
基于变换系数的视频隐写具有较高的嵌入容量和安全性,已成为视频隐写技术的一个重要分支。然而,现有针对变换系数隐写的隐写分析方法对HEVC压缩的预测过程考虑不足,导致隐写分析不够直观,在低嵌入率场景下无法有效检测自适应隐写方法。针对基于变换系数的隐写,提出了一种基于集中误差和注意机制的HEVC视频隐写分析方法。首先,分析了基于失真补偿的隐写带来的集中误差现象,构建了隐写预测误差图,实现了更高的信噪比。其次,提出了一种视频隐写分析网络CESNet (Centralized Error steganalysis network)。该网络以预测误差图为输入,设计了四种类型的卷积模块,以适应不同阶段的特征提取。为了解决自适应隐写的帧内稀疏性问题,提出了基于空间和通道注意机制的集中式错误注意(CEA)模块来自适应增强隐写区域。最后,在提取每一帧的特征向量后,利用自注意机制完成隐写视频的检测。实验结果表明,与现有的基于变换系数的视频隐写分析方法相比,该方法可以有效检测多种基于变换系数的隐写算法,在低载荷场景下具有更高的检测性能。
{"title":"HEVC Video Steganalysis Based on Centralized Error and Attention Mechanism","authors":"Haojun Dai;Dawen Xu;Lin Yang;Rangding Wang","doi":"10.1109/TMM.2025.3613171","DOIUrl":"https://doi.org/10.1109/TMM.2025.3613171","url":null,"abstract":"With high embedding capacity and security, transform coefficient-based video steganography has become an important branch of video steganography. However, existing steganalysis methods against transform coefficient-based steganography provide insufficient consideration to the prediction process of HEVC compression, which results in steganalysis that is not straightforward and fail to effectively detect adaptive steganography methods in low embedding rate scenarios. In this paper, an HEVC video steganalysis method based on centralized error and attention mechanism against transform coefficient-based steganography is proposed. Firstly, the centralized error phenomenon brought by distortion compensation-based steganography is analyzed, and prediction error maps is constructed for steganalysis to achieve higher SNR(signal-to-noise ratio). Secondly, a video steganalysis network called CESNet (Centralized Error Steganalysis Network) is proposed. The network takes the prediction error maps as input and four types of convolutional modules are designed to adapt to different stages of feature extraction. To address the intra-frame sparsity of adaptive steganography, CEA (Centralized Error Attention) modules based on spatial and channel attention mechanisms are proposed to adaptively enhance the steganographic region. Finally, after extracting the feature vectors of each frame, the detection of steganographic video is completed using the self-attention mechanism. Experimental results show that compared with the existing transform coefficient-based video steganalysis methods, the proposed method can effectively detect multiple transform coefficient-based steganography algorithms and achieve higher detection performance in low payload scenarios.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"8914-8925"},"PeriodicalIF":9.7,"publicationDate":"2025-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145510152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SPDQ: Synergetic Prompts as Disentanglement Queries for Compositional Zero-Shot Learning 协同提示作为解纠缠查询的组合零射击学习
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-09-09 DOI: 10.1109/TMM.2025.3607726
Han Jiang;Xiaoshan Yang;Chaofan Chen;Changsheng Xu
Compositional zero-shot learning (CZSL) aims to identify novel compositions formed by known primitives (attributes and objects). Motivated by recent advancements in pre-trained vision-language models such as CLIP, many methods attempt to fine-tune CLIP for CZSL and achieve remarkable performance. However, the existing CLIP-based CZSL methods focus mainly on text prompt tuning, which lacks the flexibility to dynamically adapt both modalities. To solve this issue, an intuitive solution is to additionally introduce visual prompt tuning. This insight is not trivial to achieve because effectively learning prompts for CZSL involves the challenge of entanglement between visual primitives as well as appearance shifts in different compositions. In this paper, we propose a novel Synergetic Prompts as Disentanglement Queries (SPDQ) framework for CZSL. It can disentangle primitive features based on synergetic prompts to jointly alleviate these challenges. Specifically, we first design a low-rank primitive modulator to produce synergetic adaptive attribute and object prompts based on prior knowledge of each instance for model adaptation. Then, we additionally utilize text prefix prompts to construct synergetic prompt queries, which are used to resample corresponding visual features from local visual patches. Comprehensive experiments conducted on three benchmarks demonstrate that our SPDQ approach achieves state-of-the-art results.
组合零射击学习(CZSL)旨在识别由已知原语(属性和对象)组成的新组合。在预训练视觉语言模型(如CLIP)的最新进展的推动下,许多方法试图对CZSL的CLIP进行微调并取得显着的性能。然而,现有的基于clip的CZSL方法主要侧重于文本提示调优,缺乏动态适应这两种模式的灵活性。为了解决这个问题,一个直观的解决方案是额外引入视觉提示调优。要实现这种见解并非易事,因为有效地学习CZSL提示涉及到视觉原语之间的纠缠以及不同组合中的外观变化的挑战。在本文中,我们提出了一个新的协同提示作为解纠缠查询(SPDQ)框架。它可以基于协同提示解开原始特征,共同缓解这些挑战。具体而言,我们首先设计了一个低秩原语调制器,根据每个实例的先验知识产生协同自适应属性和对象提示,用于模型自适应。然后,我们还利用文本前缀提示构建协同提示查询,用于从局部视觉补丁中重新采样相应的视觉特征。在三个基准测试上进行的综合实验表明,我们的SPDQ方法达到了最先进的结果。
{"title":"SPDQ: Synergetic Prompts as Disentanglement Queries for Compositional Zero-Shot Learning","authors":"Han Jiang;Xiaoshan Yang;Chaofan Chen;Changsheng Xu","doi":"10.1109/TMM.2025.3607726","DOIUrl":"https://doi.org/10.1109/TMM.2025.3607726","url":null,"abstract":"Compositional zero-shot learning (CZSL) aims to identify novel compositions formed by known primitives (attributes and objects). Motivated by recent advancements in pre-trained vision-language models such as CLIP, many methods attempt to fine-tune CLIP for CZSL and achieve remarkable performance. However, the existing CLIP-based CZSL methods focus mainly on text prompt tuning, which lacks the flexibility to dynamically adapt both modalities. To solve this issue, an intuitive solution is to additionally introduce visual prompt tuning. This insight is not trivial to achieve because effectively learning prompts for CZSL involves the challenge of entanglement between visual primitives as well as appearance shifts in different compositions. In this paper, we propose a novel Synergetic Prompts as Disentanglement Queries (SPDQ) framework for CZSL. It can disentangle primitive features based on synergetic prompts to jointly alleviate these challenges. Specifically, we first design a low-rank primitive modulator to produce synergetic adaptive attribute and object prompts based on prior knowledge of each instance for model adaptation. Then, we additionally utilize text prefix prompts to construct synergetic prompt queries, which are used to resample corresponding visual features from local visual patches. Comprehensive experiments conducted on three benchmarks demonstrate that our SPDQ approach achieves state-of-the-art results.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"8888-8899"},"PeriodicalIF":9.7,"publicationDate":"2025-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145510101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-Layer Transfer Learning for Cross-Domain Recommendation Based on Graph Node Representation Enhancement 基于图节点表示增强的跨域推荐多层迁移学习
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-09-09 DOI: 10.1109/TMM.2025.3607706
Xin Ni;Jie Nie;Niantai Jing;Jianliang Xu;Xiaodong Wang;Xuesong Gao;MingXing Jiang;Chi-Hung Chi;Zhiqiang Wei
Effectively representing and transferring user preferences across various domains presents a significant challenge in cross-domain recommendation (CDR). Some approaches utilize graph neural networks that use interaction behavior to establish relationships between entities, providing a comprehensive understanding of user interests. However, the impact of consistent semantics across various types, fields, and perspectives of social media information on user preferences is overlooked, i.e. the multidimensional consistency of user preferences. This oversight results in graph node representations that inadequately reflect user preferences. To address these limitations, we propose a multi-layer transfer learning network (MTLG) for CDR based on graph node representation enhancement via multi-dimensional consistent user preferences. Firstly, the model introduces a set of globally shared semantic units to perform different-grained semantic alignment of multiple media information without clear alignment boundaries, thereby modeling multi-dimensional consistent user preference features. These features are then seamlessly integrated with the initial high-order graph structure embedding features, thus significantly improving the quality of graph node representation. Secondly, the model innovatively designs a multi-layer transfer learning network that hierarchically aligns the domain distribution differences. It calculates the similarity between domains to derive layer weights for more precise transfer learning, thereby mitigating the possibility of information error accumulation resulting from inaccurate feature aggregation processes. We conducted numerous experiments on 3 scenarios, including 7,954,943 rating information from the Amazon dataset. The results indicate that MTLG’s recommendation accuracy surpasses those of state-of-the-art methods.
在跨域推荐(CDR)中,如何有效地表示和传递用户偏好是一个重大挑战。一些方法利用图形神经网络,使用交互行为来建立实体之间的关系,提供对用户兴趣的全面理解。然而,社交媒体信息的不同类型、领域和视角的语义一致性对用户偏好的影响被忽视了,即用户偏好的多维一致性。这种疏忽导致图节点表示不能充分反映用户偏好。为了解决这些限制,我们提出了一种基于基于多维一致用户偏好的图节点表示增强的CDR多层迁移学习网络(MTLG)。首先,该模型引入一组全局共享的语义单元,对多种媒体信息进行不同粒度的语义对齐,没有明确的对齐边界,从而建模多维一致的用户偏好特征;然后将这些特征与初始的高阶图结构嵌入特征无缝集成,从而显著提高图节点表示的质量。其次,该模型创新地设计了一个多层迁移学习网络,分层排列领域分布差异;它计算域之间的相似度,以获得更精确的迁移学习层权重,从而减少由于不准确的特征聚集过程而导致信息错误积累的可能性。我们在3个场景下进行了大量的实验,包括来自Amazon数据集的7,954,943个评级信息。结果表明,MTLG的推荐准确率超过了目前最先进的推荐方法。
{"title":"Multi-Layer Transfer Learning for Cross-Domain Recommendation Based on Graph Node Representation Enhancement","authors":"Xin Ni;Jie Nie;Niantai Jing;Jianliang Xu;Xiaodong Wang;Xuesong Gao;MingXing Jiang;Chi-Hung Chi;Zhiqiang Wei","doi":"10.1109/TMM.2025.3607706","DOIUrl":"https://doi.org/10.1109/TMM.2025.3607706","url":null,"abstract":"Effectively representing and transferring user preferences across various domains presents a significant challenge in cross-domain recommendation (CDR). Some approaches utilize graph neural networks that use interaction behavior to establish relationships between entities, providing a comprehensive understanding of user interests. However, the impact of consistent semantics across various types, fields, and perspectives of social media information on user preferences is overlooked, i.e. the multidimensional consistency of user preferences. This oversight results in graph node representations that inadequately reflect user preferences. To address these limitations, we propose a multi-layer transfer learning network (MTLG) for CDR based on graph node representation enhancement via multi-dimensional consistent user preferences. Firstly, the model introduces a set of globally shared semantic units to perform different-grained semantic alignment of multiple media information without clear alignment boundaries, thereby modeling multi-dimensional consistent user preference features. These features are then seamlessly integrated with the initial high-order graph structure embedding features, thus significantly improving the quality of graph node representation. Secondly, the model innovatively designs a multi-layer transfer learning network that hierarchically aligns the domain distribution differences. It calculates the similarity between domains to derive layer weights for more precise transfer learning, thereby mitigating the possibility of information error accumulation resulting from inaccurate feature aggregation processes. We conducted numerous experiments on 3 scenarios, including 7,954,943 rating information from the Amazon dataset. The results indicate that MTLG’s recommendation accuracy surpasses those of state-of-the-art methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"8940-8953"},"PeriodicalIF":9.7,"publicationDate":"2025-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145510159","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Like Humans to Few-Shot Learning Through Knowledge Permeation of Visual and Language 通过视觉和语言的知识渗透,像人一样进行短镜头学习
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-09-08 DOI: 10.1109/TMM.2025.3604977
Yuyu Jia;Qing Zhou;Junyu Gao;Qiang Li;Qi Wang
Few-shot learning aims to generalize the recognizer from seen categories to an entirely novel scenario. With only a few support samples, several advanced methods initially introduce class names as prior knowledge for identifying novel classes. However, obstacles still impede achieving a comprehensive understanding of how to harness the mutual advantages of visual and textual knowledge. In this paper, we set out to fill this gap via a coherent Bidirectional Knowledge Permeation strategy called BiKop, which is grounded in human intuition: a class name description offers a more general representation, whereas an image captures the specificity of individuals. BiKop primarily establishes a hierarchical joint general-specific representation through bidirectional knowledge permeation. On the other hand, considering the bias of joint representation towards the base set, we disentangle base-class-relevant semantics during training, thereby alleviating the suppression of potential novel-class-relevant information. Experiments on four challenging benchmarks demonstrate the remarkable superiority of BiKop, particularly outperforming previous methods by a substantial margin in the 1-shot setting (improving the accuracy by 7.58% on miniImageNet).
Few-shot学习的目的是将识别器从已知的类别推广到一个全新的场景。只有几个支持样本,一些高级方法最初引入类名作为识别新类的先验知识。然而,对如何利用视觉和文本知识的相互优势的全面理解仍然存在障碍。在本文中,我们开始通过一种名为BiKop的连贯双向知识渗透策略来填补这一空白,该策略以人类直觉为基础:类名描述提供了更一般的表示,而图像捕获了个体的特殊性。BiKop主要是通过双向的知识渗透建立一个层次的联合通用表示。另一方面,考虑到联合表示对基集的偏见,我们在训练过程中解开了基类相关语义,从而减轻了潜在的新类相关信息的抑制。在四个具有挑战性的基准测试中进行的实验证明了BiKop的显著优势,特别是在1次射击设置中表现明显优于以前的方法(在miniImageNet上提高了7.58%的精度)。
{"title":"Like Humans to Few-Shot Learning Through Knowledge Permeation of Visual and Language","authors":"Yuyu Jia;Qing Zhou;Junyu Gao;Qiang Li;Qi Wang","doi":"10.1109/TMM.2025.3604977","DOIUrl":"https://doi.org/10.1109/TMM.2025.3604977","url":null,"abstract":"Few-shot learning aims to generalize the recognizer from seen categories to an entirely novel scenario. With only a few support samples, several advanced methods initially introduce class names as prior knowledge for identifying novel classes. However, obstacles still impede achieving a comprehensive understanding of how to harness the mutual advantages of visual and textual knowledge. In this paper, we set out to fill this gap via a coherent Bidirectional Knowledge Permeation strategy called BiKop, which is grounded in human intuition: a class name description offers a more <italic>general</i> representation, whereas an image captures the <italic>specificity</i> of individuals. BiKop primarily establishes a hierarchical joint general-specific representation through bidirectional knowledge permeation. On the other hand, considering the bias of joint representation towards the base set, we disentangle base-class-relevant semantics during training, thereby alleviating the suppression of potential novel-class-relevant information. Experiments on four challenging benchmarks demonstrate the remarkable superiority of BiKop, particularly outperforming previous methods by a substantial margin in the 1-shot setting (improving the accuracy by 7.58% on <italic>mini</i>ImageNet).","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"7905-7916"},"PeriodicalIF":9.7,"publicationDate":"2025-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145351925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PrimePSegter: Progressively Combined Diffusion for 3D Panoptic Segmentation With Multi-Modal BEV Refinement PrimePSegter:基于多模态BEV细化的3D全视分割的渐进组合扩散
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-09-02 DOI: 10.1109/TMM.2025.3604903
Hongqi Yu;Sixian Chan;Xiaolong Zhou;Xiaoqin Zhang
Effective and robust 3D panoptic segmentation is crucial for scene perception in autonomous driving. Modern methods widely adopt multi-modal fusion based simple feature concatenation to enhance 3D scene understanding, resulting in generated multi-modal representations typically lack comprehensive semantic and geometry information. These methods focused on panoptic prediction in a single step also limit the capability to progressively refine panoptic predictions under varying noise levels, which is essential for enhancing model robustness. To address these limitations, we first utilize BEV space to unify semantic-geometry perceptual representation, allowing for a more effective integration of LiDAR and camera data. Then, we propose PrimePSegter, a progressively combined diffusion 3D panoptic segmentation model that is conditioned on BEV maps to iteratively refine predictions by denoising samples generated from Gaussian distribution. PrimePSegter adopts a conditional encoder-decoder architecture for fine-grained panoptic predictions. Specifically, a multi-modal conditional encoder is equipped with BEV fusion network to integrate semantic and geometric information from LiDAR and camera streams into unified BEV space. Additionally, a diffusion transformer decoder operates on multi-modal BEV features with varying noise levels to guide the training of diffusion model, refining the BEV panoptic representations enriched with semantics and geometry in a progressive way. PrimePSegter achieves state-of-the-art performance on the nuScenes and competitive results on the SemanticKITTI, respectively. Moreover, PrimePSegter demonstrates superior robustness towards various scenarios, outperforming leading methods.
有效、鲁棒的三维全景分割是实现自动驾驶场景感知的关键。现代方法普遍采用基于简单特征拼接的多模态融合来增强对三维场景的理解,导致生成的多模态表示通常缺乏全面的语义和几何信息。这些方法侧重于单步全光学预测,也限制了在不同噪声水平下逐步改进全光学预测的能力,这对于增强模型的鲁棒性至关重要。为了解决这些限制,我们首先利用BEV空间来统一语义-几何感知表示,从而更有效地集成激光雷达和相机数据。然后,我们提出了PrimePSegter,这是一种渐进组合的扩散3D全视分割模型,它以BEV映射为条件,通过对高斯分布生成的样本去噪来迭代地改进预测。PrimePSegter采用条件编码器-解码器架构进行细粒度的全光预测。具体而言,多模态条件编码器配备了BEV融合网络,将来自LiDAR和相机流的语义和几何信息整合到统一的BEV空间中。此外,扩散转换器解码器对具有不同噪声水平的多模态BEV特征进行操作,指导扩散模型的训练,逐步细化具有丰富语义和几何的BEV全景表示。PrimePSegter分别在nuScenes和SemanticKITTI上取得了最先进的性能和竞争结果。此外,PrimePSegter对各种场景表现出优越的鲁棒性,优于领先的方法。
{"title":"PrimePSegter: Progressively Combined Diffusion for 3D Panoptic Segmentation With Multi-Modal BEV Refinement","authors":"Hongqi Yu;Sixian Chan;Xiaolong Zhou;Xiaoqin Zhang","doi":"10.1109/TMM.2025.3604903","DOIUrl":"https://doi.org/10.1109/TMM.2025.3604903","url":null,"abstract":"Effective and robust 3D panoptic segmentation is crucial for scene perception in autonomous driving. Modern methods widely adopt multi-modal fusion based simple feature concatenation to enhance 3D scene understanding, resulting in generated multi-modal representations typically lack comprehensive semantic and geometry information. These methods focused on panoptic prediction in a single step also limit the capability to progressively refine panoptic predictions under varying noise levels, which is essential for enhancing model robustness. To address these limitations, we first utilize BEV space to unify semantic-geometry perceptual representation, allowing for a more effective integration of LiDAR and camera data. Then, we propose PrimePSegter, a progressively combined diffusion 3D panoptic segmentation model that is conditioned on BEV maps to iteratively refine predictions by denoising samples generated from Gaussian distribution. PrimePSegter adopts a conditional encoder-decoder architecture for fine-grained panoptic predictions. Specifically, a multi-modal conditional encoder is equipped with BEV fusion network to integrate semantic and geometric information from LiDAR and camera streams into unified BEV space. Additionally, a diffusion transformer decoder operates on multi-modal BEV features with varying noise levels to guide the training of diffusion model, refining the BEV panoptic representations enriched with semantics and geometry in a progressive way. PrimePSegter achieves state-of-the-art performance on the nuScenes and competitive results on the SemanticKITTI, respectively. Moreover, PrimePSegter demonstrates superior robustness towards various scenarios, outperforming leading methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"7891-7904"},"PeriodicalIF":9.7,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145351954","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Crafting More Transferable Adversarial Examples via Quality-Aware Transformation Combination 通过质量意识转换组合制作更多可转移的对抗性示例
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-09-01 DOI: 10.1109/TMM.2025.3604967
Junlin Liu;Xinchen Lyu;Chenshan Ren;Qimei Cui
Input diversity is an effective technique for crafting transferable adversarial examples that can deceive unknown AI models. Existing input-diversity-based methods typically use single input transformation, limiting targeted transferability and defense robustness. Combining different transformation types is challenging, as keeping increasing types would degrade semantic information and targeted transferability. This paper proposes a quality-aware transformation combination attack (TCA) that selects high-quality transformation combinations. The quality-aware selection enables expansion of transformation types, enhances input diversity, and hence improves targeted transferability and defense robustness. We first design a quality-evaluation framework to quantify the effectiveness of transformation combinations, which jointly considers convergence, transferability, and robustness. Only a small group (up to 10) of images are required for computation-efficient quality evaluation. Experiments validate TCA’s superiority over state-of-the-art baselines in adversarial transferability and robustness. When defenses are secured, the average targeted success rate of TCA with four transformation types (i.e., TCA-t4) outperforms the best baseline by 26%$sim$42% on ImageNet.
输入多样性是一种有效的技术,用于制作可转移的对抗示例,可以欺骗未知的AI模型。现有的基于输入多样性的方法通常使用单输入转换,限制了目标可转移性和防御鲁棒性。组合不同的转换类型具有挑战性,因为不断增加类型会降低语义信息和目标可移植性。提出了一种选择高质量转换组合的质量感知转换组合攻击(TCA)。质量意识选择可以扩展转换类型,增强输入多样性,从而提高目标可转移性和防御鲁棒性。我们首先设计了一个质量评估框架来量化转换组合的有效性,它共同考虑了收敛性、可转移性和鲁棒性。计算效率高的质量评估只需要一小组(最多10张)图像。实验验证了TCA在对抗可转移性和鲁棒性方面优于最先进的基线。当防御得到保护时,具有四种转换类型(即TCA-t4)的TCA的平均目标成功率在ImageNet上比最佳基线高出26%。
{"title":"Crafting More Transferable Adversarial Examples via Quality-Aware Transformation Combination","authors":"Junlin Liu;Xinchen Lyu;Chenshan Ren;Qimei Cui","doi":"10.1109/TMM.2025.3604967","DOIUrl":"https://doi.org/10.1109/TMM.2025.3604967","url":null,"abstract":"Input diversity is an effective technique for crafting transferable adversarial examples that can deceive unknown AI models. Existing input-diversity-based methods typically use single input transformation, limiting targeted transferability and defense robustness. Combining different transformation types is challenging, as keeping increasing types would degrade semantic information and targeted transferability. This paper proposes a quality-aware <underline>t</u>ransformation <underline>c</u>ombination <underline>a</u>ttack (TCA) that selects high-quality transformation combinations. The quality-aware selection enables expansion of transformation types, enhances input diversity, and hence improves targeted transferability and defense robustness. We first design a quality-evaluation framework to quantify the effectiveness of transformation combinations, which jointly considers convergence, transferability, and robustness. Only a small group (up to 10) of images are required for computation-efficient quality evaluation. Experiments validate TCA’s superiority over state-of-the-art baselines in adversarial transferability and robustness. When defenses are secured, the average targeted success rate of TCA with four transformation types (i.e., TCA-t4) outperforms the best baseline by 26%<inline-formula><tex-math>$sim$</tex-math></inline-formula>42% on ImageNet.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"7917-7929"},"PeriodicalIF":9.7,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145351943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Multimedia
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1