首页 > 最新文献

IEEE Transactions on Multimedia最新文献

英文 中文
Rethinking Class-Incremental Learning From a Dynamic Imbalanced Learning Perspective 从动态不平衡学习的视角重新思考课堂增量学习
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-11-14 DOI: 10.1109/TMM.2025.3632688
Leyuan Wang;Liuyu Xiang;Yunlong Wang;Huijia Wu;Huafeng Yang;Jingqian Liu;Zhaofeng He
Deep neural networks suffer from catastrophic forgetting when continually learning new concepts. In this paper, we analyze this problem from a data imbalance point of view. We argue that the imbalance between old task and new task data contributes to forgetting of the old tasks. Moreover, the increasing imbalance ratio during incremental learning further aggravates the problem. To address the dynamic imbalance issue, we propose Uniform Prototype Contrastive Learning (UPCL), where uniform and compact features are learned. Specifically, we generate a set of non-learnable uniform prototypes before each task starts. Then we assign these uniform prototypes to each class and guide the feature learning through prototype contrastive learning. We also dynamically adjust the relative margin between old and new classes so that the feature distribution will be maintained balanced and compact. Finally, we demonstrate through extensive experiments that the proposed method achieves state-of-the-art performance on several benchmark including CIFAR-100, ImageNet-100, TinyImageNet, Food-101, and CUB-200. Experimental results show that our approach not only effectively addresses the issue of imbalanced old data in memory but also tackles the problem of imbalanced new data distributions.
深度神经网络在不断学习新概念时会遭受灾难性的遗忘。本文从数据不平衡的角度分析了这一问题。我们认为旧任务和新任务数据的不平衡导致了旧任务的遗忘。此外,增量学习过程中不断增加的不平衡比例进一步加剧了这一问题。为了解决动态不平衡问题,我们提出了统一原型对比学习(UPCL),其中学习统一和紧凑的特征。具体来说,我们在每个任务开始之前生成一组不可学习的统一原型。然后将这些统一的原型分配给每个类,并通过原型对比学习指导特征学习。我们还动态调整新旧类之间的相对余量,使特征分布保持平衡和紧凑。最后,我们通过大量的实验证明,所提出的方法在包括CIFAR-100、ImageNet-100、TinyImageNet、Food-101和CUB-200在内的几个基准上实现了最先进的性能。实验结果表明,该方法不仅有效地解决了内存中旧数据分布不平衡的问题,而且还解决了新数据分布不平衡的问题。
{"title":"Rethinking Class-Incremental Learning From a Dynamic Imbalanced Learning Perspective","authors":"Leyuan Wang;Liuyu Xiang;Yunlong Wang;Huijia Wu;Huafeng Yang;Jingqian Liu;Zhaofeng He","doi":"10.1109/TMM.2025.3632688","DOIUrl":"https://doi.org/10.1109/TMM.2025.3632688","url":null,"abstract":"Deep neural networks suffer from catastrophic forgetting when continually learning new concepts. In this paper, we analyze this problem from a data imbalance point of view. We argue that the imbalance between old task and new task data contributes to forgetting of the old tasks. Moreover, the increasing imbalance ratio during incremental learning further aggravates the problem. To address the dynamic imbalance issue, we propose Uniform Prototype Contrastive Learning (UPCL), where uniform and compact features are learned. Specifically, we generate a set of non-learnable uniform prototypes before each task starts. Then we assign these uniform prototypes to each class and guide the feature learning through prototype contrastive learning. We also dynamically adjust the relative margin between old and new classes so that the feature distribution will be maintained balanced and compact. Finally, we demonstrate through extensive experiments that the proposed method achieves state-of-the-art performance on several benchmark including CIFAR-100, ImageNet-100, TinyImageNet, Food-101, and CUB-200. Experimental results show that our approach not only effectively addresses the issue of imbalanced old data in memory but also tackles the problem of imbalanced new data distributions.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"825-836"},"PeriodicalIF":9.7,"publicationDate":"2025-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145929423","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DiffW: Multi-Encoder Based on Conditional Diffusion Model for Robust Image Watermarking 基于条件扩散模型的多编码器鲁棒图像水印
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-11-14 DOI: 10.1109/TMM.2025.3632631
Ting Luo;Renzhi Hu;Zhouyan He;Gangyi Jiang;Haiyong Xu;Yang Song;Chin-Chen Chang
The existing deep-learning based robust watermarking model generally applies a discriminator to form generative adversarial network (GAN) for increasing the quality of encoded images, and adopts a single encoder to embed watermark. However, GAN training is unstable, and the single encoder cannot fully adjust the watermarking distribution, thus affecting the watermarking performance. To address those limitations, this paper presents the multi-encoder based on conditional diffusion model (CDM) for robust image watermarking, namely, DiffW. To enhance the stability, the multi-encoder structure based on CDM replaces GAN for optimizing the watermarking distribution iteratively. Specifically, the operation of each timestep in the forward and reverse diffusion processes of the CDM is regarded as an encoder to overcome the shortcomings of the single encoder structure. At the training stage, under the guidance of the conditional noisy image, the forward process trains each encoder to fuse the image and watermark to generate high-quality encoded images. During the testing stage, only a small number of trained encoders of the forward process are used, so as to reduce the time complexity. Furthermore, to improve watermarking robustness, the channel attention module (CAM) is designed to extract main watermark features by mining channel correlations for multi-layer fusion, so that watermark can be embedded into imperceptible and texture areas. The experimental results reveal that compared with the existing watermarking model, the proposed DiffW can achieve better results in terms of watermarking invisibility and robustness.
现有的基于深度学习的鲁棒水印模型一般采用判别器形成生成式对抗网络(GAN)来提高编码图像的质量,并采用单个编码器嵌入水印。然而,GAN训练是不稳定的,单个编码器不能完全调整水印分布,从而影响了水印性能。为了解决这些局限性,本文提出了一种基于条件扩散模型(CDM)的鲁棒图像水印多编码器,即DiffW。为了提高稳定性,采用基于CDM的多编码器结构代替GAN来迭代优化水印分布。具体而言,将CDM正向和反向扩散过程中的每个时间步的操作视为一个编码器,以克服单编码器结构的缺点。在训练阶段,前向过程在条件噪声图像的指导下,训练每个编码器融合图像和水印,生成高质量的编码图像。在测试阶段,为了降低时间复杂度,只使用少量前向过程训练好的编码器。为了提高水印的鲁棒性,设计了通道关注模块(CAM),通过挖掘通道相关性提取水印的主要特征,进行多层融合,使水印嵌入到难以察觉和纹理区域。实验结果表明,与现有的水印模型相比,DiffW在水印不可见性和鲁棒性方面取得了更好的效果。
{"title":"DiffW: Multi-Encoder Based on Conditional Diffusion Model for Robust Image Watermarking","authors":"Ting Luo;Renzhi Hu;Zhouyan He;Gangyi Jiang;Haiyong Xu;Yang Song;Chin-Chen Chang","doi":"10.1109/TMM.2025.3632631","DOIUrl":"https://doi.org/10.1109/TMM.2025.3632631","url":null,"abstract":"The existing deep-learning based robust watermarking model generally applies a discriminator to form generative adversarial network (GAN) for increasing the quality of encoded images, and adopts a single encoder to embed watermark. However, GAN training is unstable, and the single encoder cannot fully adjust the watermarking distribution, thus affecting the watermarking performance. To address those limitations, this paper presents the multi-encoder based on conditional diffusion model (CDM) for robust image watermarking, namely, DiffW. To enhance the stability, the multi-encoder structure based on CDM replaces GAN for optimizing the watermarking distribution iteratively. Specifically, the operation of each timestep in the forward and reverse diffusion processes of the CDM is regarded as an encoder to overcome the shortcomings of the single encoder structure. At the training stage, under the guidance of the conditional noisy image, the forward process trains each encoder to fuse the image and watermark to generate high-quality encoded images. During the testing stage, only a small number of trained encoders of the forward process are used, so as to reduce the time complexity. Furthermore, to improve watermarking robustness, the channel attention module (CAM) is designed to extract main watermark features by mining channel correlations for multi-layer fusion, so that watermark can be embedded into imperceptible and texture areas. The experimental results reveal that compared with the existing watermarking model, the proposed DiffW can achieve better results in terms of watermarking invisibility and robustness.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"837-852"},"PeriodicalIF":9.7,"publicationDate":"2025-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145929454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Retain, Blend, and Exchange: A Quality-Aware Spatial-Stereo Fusion Approach for Event Stream Recognition 保留,混合和交换:事件流识别的质量感知空间立体融合方法
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-11-12 DOI: 10.1109/TMM.2025.3607771
Lan Chen;Dong Li;Xiao Wang;Pengpeng Shao;Wei Zhang;Yaowei Wang;Yonghong Tian;Jin Tang
Current event stream-based pattern recognition models typically present the event stream as the point cloud, voxel, image, and the like, and formulate multiple deep neural networks to acquire their features. Although considerable results can be achieved in simple cases, however, the performance of the model might be restricted by monotonous modality expressions, sub-optimal fusion, and readout mechanisms. In this article, we put forward a novel dual-stream framework for event stream-based pattern recognition through differentiated fusion, which is called EFV++. It models two common event representations simultaneously, i.e., event images and event voxels. The spatial and three-dimensional stereo information can be separately learned by making use of Transformer and Graph Neural Network (GNN). We believe the features of each representation still contain both efficient and redundant features and a sub-optimal solution may be obtained if we directly fuse them without differentiation. Thus, we divide each feature into three levels and retain high-quality features, blend medium-quality features, and exchange low-quality features. The enhanced dual features will be provided to the fusion Transformer together with bottleneck features. In addition, we introduce a novel hybrid interaction readout mechanism to enhance the diversity of features as final representations. Comprehensive experiments validate that the framework we have proposed attains cutting-edge performance on a variety of extensively utilized event stream-based classification datasets. Particularly, we have realized a freshly pioneering performance on the Bullying10 k dataset, precisely 90.51%, and this outpaces the runner-up by $+2.21%$.
当前基于事件流的模式识别模型通常将事件流表示为点云、体素、图像等,并制定多个深度神经网络来获取其特征。虽然在简单的情况下可以获得可观的结果,但是,模型的性能可能受到单调的模态表达式、次优融合和读出机制的限制。本文提出了一种新的基于事件流的模式识别双流框架,即efv++。它同时建模两种常见的事件表示,即事件图像和事件体素。利用Transformer和Graph Neural Network (GNN)分别学习空间和三维立体信息。我们认为,每个表示的特征仍然包含有效和冗余的特征,如果我们直接融合它们而不进行微分,可能会得到次优解。因此,我们将每个特征分为三个层次,保留高质量的特征,混合中等质量的特征,交换低质量的特征。增强的双重特性将与瓶颈特性一起提供给融合变压器。此外,我们还引入了一种新的混合交互读出机制,以增强特征作为最终表示的多样性。综合实验证明,我们提出的框架在各种广泛使用的基于事件流的分类数据集上获得了尖端的性能。特别是,我们在欺凌10 k数据集上实现了新的开创性性能,精确地达到90.51%,超过了亚军+2.21 %。
{"title":"Retain, Blend, and Exchange: A Quality-Aware Spatial-Stereo Fusion Approach for Event Stream Recognition","authors":"Lan Chen;Dong Li;Xiao Wang;Pengpeng Shao;Wei Zhang;Yaowei Wang;Yonghong Tian;Jin Tang","doi":"10.1109/TMM.2025.3607771","DOIUrl":"https://doi.org/10.1109/TMM.2025.3607771","url":null,"abstract":"Current event stream-based pattern recognition models typically present the event stream as the point cloud, voxel, image, and the like, and formulate multiple deep neural networks to acquire their features. Although considerable results can be achieved in simple cases, however, the performance of the model might be restricted by monotonous modality expressions, sub-optimal fusion, and readout mechanisms. In this article, we put forward a novel dual-stream framework for event stream-based pattern recognition through differentiated fusion, which is called EFV++. It models two common event representations simultaneously, i.e., event images and event voxels. The spatial and three-dimensional stereo information can be separately learned by making use of Transformer and Graph Neural Network (GNN). We believe the features of each representation still contain both efficient and redundant features and a sub-optimal solution may be obtained if we directly fuse them without differentiation. Thus, we divide each feature into three levels and retain high-quality features, blend medium-quality features, and exchange low-quality features. The enhanced dual features will be provided to the fusion Transformer together with bottleneck features. In addition, we introduce a novel hybrid interaction readout mechanism to enhance the diversity of features as final representations. Comprehensive experiments validate that the framework we have proposed attains cutting-edge performance on a variety of extensively utilized event stream-based classification datasets. Particularly, we have realized a freshly pioneering performance on the Bullying10 k dataset, precisely 90.51%, and this outpaces the runner-up by <inline-formula><tex-math>$+2.21%$</tex-math></inline-formula>.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"8926-8939"},"PeriodicalIF":9.7,"publicationDate":"2025-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145510158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Knowledge-Enhanced Graph Contrastive Learning for Recommendations 面向推荐的知识增强图对比学习
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-10-31 DOI: 10.1109/TMM.2025.3626976
Xiaofeng Wang;Zhengjie Zhang;Yuanyuan Qi;Guodong Shen;Shuaiming Lai;Yuntao Chen;Fang Zhou;Daying Quan
Graph contrastive learning (GCL), which captures essential features from augmented graphs to address data sparsity issues, has recently demonstrated promising potential in improving recommendation performance. Most GCL-based recommendation methods learn consistent entity representations from user-item bipartite graphs through structural perturbations. However, these approaches impose an additional computational cost and have been shown to be insensitive to various graph augmentations, resulting in limited improvements in long-tail recommendation scenarios. To address this issue, we propose a novel framework for recommendation, Knowledge-Enhanced graph Contrastive Learning (KECL), which adopts knowledge graph-based embedding augmentation instead of graph enhancement to construct views for GCL. Specifically, we introduce a knowledge aggregation module with a heterogeneous attentive aggregator to capture relation heterogeneity in the knowledge graph. Furthermore, we propose a knowledge-based augmentation GCL model that adds knowledge-aware embeddings to the learned representations for more efficient representation-level augmentation. Extensive experiments on real-world datasets demonstrate that the knowledge-based augmentation approach effectively enhances recommendation performance and shows superiority over state-of-the-art methods.
图对比学习(GCL)从增强图中捕获基本特征以解决数据稀疏性问题,最近在提高推荐性能方面显示出了很大的潜力。大多数基于gcl的推荐方法通过结构扰动从用户-项目二部图中学习一致的实体表示。然而,这些方法增加了额外的计算成本,并且对各种图形增强不敏感,导致长尾推荐场景的改进有限。为了解决这一问题,我们提出了一种新的推荐框架——知识增强图对比学习(knowledge - enhanced graph contrast Learning, KECL),该框架采用基于知识图的嵌入增强而不是图增强来构建GCL视图。具体来说,我们引入了一个带有异构关注聚合器的知识聚合模块来捕获知识图中的关系异质性。此外,我们提出了一种基于知识的增强GCL模型,该模型将知识感知嵌入添加到学习到的表示中,以实现更有效的表示级增强。在真实数据集上的大量实验表明,基于知识的增强方法有效地提高了推荐性能,并且比最先进的方法显示出优越性。
{"title":"Knowledge-Enhanced Graph Contrastive Learning for Recommendations","authors":"Xiaofeng Wang;Zhengjie Zhang;Yuanyuan Qi;Guodong Shen;Shuaiming Lai;Yuntao Chen;Fang Zhou;Daying Quan","doi":"10.1109/TMM.2025.3626976","DOIUrl":"https://doi.org/10.1109/TMM.2025.3626976","url":null,"abstract":"Graph contrastive learning (GCL), which captures essential features from augmented graphs to address data sparsity issues, has recently demonstrated promising potential in improving recommendation performance. Most GCL-based recommendation methods learn consistent entity representations from user-item bipartite graphs through structural perturbations. However, these approaches impose an additional computational cost and have been shown to be insensitive to various graph augmentations, resulting in limited improvements in long-tail recommendation scenarios. To address this issue, we propose a novel framework for recommendation, <bold>K</b>nowledge-<bold>E</b>nhanced graph <bold>C</b>ontrastive <bold>L</b>earning (KECL), which adopts knowledge graph-based embedding augmentation instead of graph enhancement to construct views for GCL. Specifically, we introduce a knowledge aggregation module with a heterogeneous attentive aggregator to capture relation heterogeneity in the knowledge graph. Furthermore, we propose a knowledge-based augmentation GCL model that adds knowledge-aware embeddings to the learned representations for more efficient representation-level augmentation. Extensive experiments on real-world datasets demonstrate that the knowledge-based augmentation approach effectively enhances recommendation performance and shows superiority over state-of-the-art methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"684-699"},"PeriodicalIF":9.7,"publicationDate":"2025-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145982351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SingingHead: A Large-Scale 4D Dataset for Singing Head Animation SingingHead:歌唱头部动画的大规模4D数据集
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-10-20 DOI: 10.1109/TMM.2025.3623560
Sijing Wu;Yunhao Li;Weitian Zhang;Jun Jia;Yucheng Zhu;Yichao Yan;Guangtao Zhai;Xiaokang Yang
Singing, as a common facial movement second only to talking, can be regarded as a universal language across ethnicities and cultures, plays an important role in emotional communication, art, and entertainment. However, it is often overlooked in the field of audio-driven 3D facial animation due to the lack of singing head datasets and the domain gap between singing and talking in rhythm and amplitude. To this end, we collect a large-scale high-quality multi-modal singing head dataset, SingingHead, which consists of more than 27 hours of synchronized singing video, 3D facial motion, singing audio, and background music from 76 individuals and 8 types of music. Along with the SingingHead dataset, we benchmark existing audio-driven 3D facial animation methods and 2D talking head methods on the singing task. Existing 3D facial animation methods and 2D talking head methods fail to produce satisfactory singing results. Focusing on the 3D singing head animation, we first utilize the proposed singing-specific dataset to retrain the 3D facial animation methods, resulting in substantial performance improvements. Besides, considering the absence of background music and the slow generation speed of existing methods, we propose a simple but efficient non-autoregressive VAE-based framework with background music as an input signal to generate diverse and accurate 3D singing facial motions in real time. Extensive experiments demonstrate the significance of the SingingHead dataset in promoting the development of singing head animation.
歌唱作为一种仅次于说话的常见面部动作,是跨民族、跨文化的通用语言,在情感交流、艺术、娱乐等方面发挥着重要作用。然而,由于缺乏唱歌头部数据集以及唱歌和说话在节奏和幅度上的域差距,在音频驱动的3D面部动画领域中往往被忽视。为此,我们收集了一个大规模的高质量多模态唱歌头部数据集SingingHead,该数据集由76个人和8种音乐的27小时以上的同步唱歌视频、3D面部运动、唱歌音频和背景音乐组成。结合SingingHead数据集,我们在歌唱任务上对现有的音频驱动的3D面部动画方法和2D说话头方法进行了基准测试。现有的三维面部动画方法和二维说话头方法不能产生令人满意的歌唱效果。针对3D歌唱头部动画,我们首先利用所提出的歌唱特定数据集对3D面部动画方法进行再训练,从而大大提高了性能。此外,考虑到现有方法缺乏背景音乐和生成速度慢的问题,我们提出了一种简单而高效的以背景音乐为输入信号的基于vae的非自回归框架,可以实时生成多样化、准确的3D歌唱面部动作。大量的实验证明了SingingHead数据集对推动歌头动画发展的重要意义。
{"title":"SingingHead: A Large-Scale 4D Dataset for Singing Head Animation","authors":"Sijing Wu;Yunhao Li;Weitian Zhang;Jun Jia;Yucheng Zhu;Yichao Yan;Guangtao Zhai;Xiaokang Yang","doi":"10.1109/TMM.2025.3623560","DOIUrl":"https://doi.org/10.1109/TMM.2025.3623560","url":null,"abstract":"Singing, as a common facial movement second only to talking, can be regarded as a universal language across ethnicities and cultures, plays an important role in emotional communication, art, and entertainment. However, it is often overlooked in the field of audio-driven 3D facial animation due to the lack of singing head datasets and the domain gap between singing and talking in rhythm and amplitude. To this end, we collect a large-scale high-quality multi-modal singing head dataset, <bold>SingingHead</b>, which consists of more than 27 hours of synchronized singing video, 3D facial motion, singing audio, and background music from 76 individuals and 8 types of music. Along with the SingingHead dataset, we benchmark existing audio-driven 3D facial animation methods and 2D talking head methods on the singing task. Existing 3D facial animation methods and 2D talking head methods fail to produce satisfactory singing results. Focusing on the 3D singing head animation, we first utilize the proposed singing-specific dataset to retrain the 3D facial animation methods, resulting in substantial performance improvements. Besides, considering the absence of background music and the slow generation speed of existing methods, we propose a simple but efficient non-autoregressive VAE-based framework with background music as an input signal to generate diverse and accurate 3D singing facial motions in real time. Extensive experiments demonstrate the significance of the SingingHead dataset in promoting the development of singing head animation.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"700-714"},"PeriodicalIF":9.7,"publicationDate":"2025-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145982349","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Temporal Prompt Learning With Depth Memory for Video Mirror Detection 基于深度记忆的时间提示学习用于视频镜像检测
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-10-20 DOI: 10.1109/TMM.2025.3623544
Zhaohu Xing;Tian Ye;Xin Yang;Sixiang Chen;Huazhu Fu;Yan Nei Law;Lei Zhu
Mirror detection in dynamic scenes plays a crucial role in ensuring safety for various applications, such as drone tracking and robot navigation. However, current mirror detection models often fail in areas with mirrors that have a similar visual and color appearance to their surrounding objects. They also struggle to generalize well in complex cases, primarily due to limited annotated datasets. In this work, we propose a novel temporal prompt learning network with depth memory (TPD-Net) to address these critical challenges. Our approach includes several key components. First, we introduce a Temporal Prompt Generator (TPG) to learn temporal prompt features. Then, we devise Multi-layer Depth-aware Adaptor (MDA) modules to progressively adapt prompt features from the TPG, thereby learning mirror-related features by embedding temporal depth information as guidance. Moreover, we further refine these mirror-related features by constructing a depth memory and a Depth Memory Read module to read the temporal depths stored in the memory, boosting video mirror detection. Experimental results on a benchmark dataset show that our TPD-Net significantly outperforms 22 state-of-the-art methods in video mirror detection tasks.
动态场景中的镜像检测在确保无人机跟踪和机器人导航等各种应用的安全方面起着至关重要的作用。然而,目前的反射镜检测模型经常在具有与周围物体相似的视觉和颜色外观的反射镜区域失败。在复杂的情况下,它们也很难很好地泛化,这主要是由于有限的注释数据集。在这项工作中,我们提出了一种新的具有深度记忆的时间提示学习网络(TPD-Net)来解决这些关键挑战。我们的方法包括几个关键部分。首先,我们引入了一个时态提示生成器(TPG)来学习时态提示的特征。然后,我们设计了多层深度感知适配器(MDA)模块,逐步适应来自TPG的提示特征,从而通过嵌入时间深度信息作为指导来学习镜像相关特征。此外,我们通过构建深度存储器和深度存储器读取模块来进一步完善这些镜像相关功能,以读取存储在存储器中的时间深度,从而增强视频镜像检测。在一个基准数据集上的实验结果表明,我们的TPD-Net在视频镜像检测任务中显著优于22种最先进的方法。
{"title":"Temporal Prompt Learning With Depth Memory for Video Mirror Detection","authors":"Zhaohu Xing;Tian Ye;Xin Yang;Sixiang Chen;Huazhu Fu;Yan Nei Law;Lei Zhu","doi":"10.1109/TMM.2025.3623544","DOIUrl":"https://doi.org/10.1109/TMM.2025.3623544","url":null,"abstract":"Mirror detection in dynamic scenes plays a crucial role in ensuring safety for various applications, such as drone tracking and robot navigation. However, current mirror detection models often fail in areas with mirrors that have a similar visual and color appearance to their surrounding objects. They also struggle to generalize well in complex cases, primarily due to limited annotated datasets. In this work, we propose a novel temporal prompt learning network with depth memory (TPD-Net) to address these critical challenges. Our approach includes several key components. First, we introduce a Temporal Prompt Generator (TPG) to learn temporal prompt features. Then, we devise Multi-layer Depth-aware Adaptor (MDA) modules to progressively adapt prompt features from the TPG, thereby learning mirror-related features by embedding temporal depth information as guidance. Moreover, we further refine these mirror-related features by constructing a depth memory and a Depth Memory Read module to read the temporal depths stored in the memory, boosting video mirror detection. Experimental results on a benchmark dataset show that our TPD-Net significantly outperforms 22 state-of-the-art methods in video mirror detection tasks.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"715-725"},"PeriodicalIF":9.7,"publicationDate":"2025-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145982357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MIP-CLIP: Multimodal Independent Prompt CLIP for Action Recognition MIP-CLIP:用于动作识别的多模式独立提示剪辑
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-10-20 DOI: 10.1109/TMM.2025.3618557
Xiong Gao;Zhaobin Chang;Dongyi Kong;Huiyu Zhou;Yonggang Lu
Recently, the Contrastive Language Image Pre-training (CLIP) model has shown significant generalizability by optimizing the distance between visual and text features. The mainstream CLIP-based action recognition methods mitigate the low “zero-shot” generalization of the 1-of-N paradigm but also lead to a significant degradation in supervised performance. Therefore, powerful supervision and competitive “zero-shot” need to be effectively traded off. In this work, a Multimodal Independent Prompt CLIP (MIP-CLIP) model is proposed to address this challenge. On the visual side, we propose novel Video Motion Prompt (VMP) to empower the visual encoder with motion perception, which performs short- and long-term motion modelling via temporal difference operation. Next, the visual classification branch is introduced to improve the discrimination of visual features. Specifically, the temporal difference and visual classification operations of the 1-of-N paradigm are extended to CLIP to satisfy the need for strong supervised performance. On the text side, we design Class-Agnostic text prompt Template (CAT) under the constraint of Semantic Alignment (SA) module to solve the label semantic dependency problem. Finally, a Dual-branch Feature Reconstruction (DFR) module is proposed to complete cross-modal interactions for better feature matching, which uses the class confidence of the visual classification branch as input. The experiments are conducted on four widely used benchmarks (HMDB-51, UCF-101, Jester, and Kinetics-400). The results demonstrate that our method achieves excellent supervised performance while preserving competitive generalizability.
最近,对比语言图像预训练(CLIP)模型通过优化视觉特征和文本特征之间的距离显示出显著的泛化性。主流的基于clip的动作识别方法减轻了1-of-N范式的低“零概率”泛化,但也导致了监督性能的显著下降。因此,强有力的监管和竞争性的“零射击”需要进行有效的权衡。在这项工作中,提出了一个多模式独立提示CLIP (MIP-CLIP)模型来解决这一挑战。在视觉方面,我们提出了一种新的视频运动提示(VMP)来赋予视觉编码器运动感知能力,它通过时间差分操作进行短期和长期运动建模。其次,引入视觉分类分支,提高视觉特征的识别能力;具体而言,将1-of-N范式的时间差异和视觉分类操作扩展到CLIP,以满足强监督性能的需要。在文本端,我们在语义对齐(SA)模块的约束下设计了类不可知文本提示模板(CAT)来解决标签语义依赖问题。最后,提出了双分支特征重构(Dual-branch Feature Reconstruction, DFR)模块,以视觉分类分支的类置信度作为输入,完成跨模态交互,实现更好的特征匹配。实验在四种广泛使用的基准(HMDB-51, UCF-101, Jester和Kinetics-400)上进行。结果表明,该方法在保持竞争泛化性的同时取得了良好的监督性能。
{"title":"MIP-CLIP: Multimodal Independent Prompt CLIP for Action Recognition","authors":"Xiong Gao;Zhaobin Chang;Dongyi Kong;Huiyu Zhou;Yonggang Lu","doi":"10.1109/TMM.2025.3618557","DOIUrl":"https://doi.org/10.1109/TMM.2025.3618557","url":null,"abstract":"Recently, the Contrastive Language Image Pre-training (CLIP) model has shown significant generalizability by optimizing the distance between visual and text features. The mainstream CLIP-based action recognition methods mitigate the low “zero-shot” generalization of the 1-of-N paradigm but also lead to a significant degradation in supervised performance. Therefore, powerful supervision and competitive “zero-shot” need to be effectively traded off. In this work, a Multimodal Independent Prompt CLIP (MIP-CLIP) model is proposed to address this challenge. On the visual side, we propose novel Video Motion Prompt (VMP) to empower the visual encoder with motion perception, which performs short- and long-term motion modelling via temporal difference operation. Next, the visual classification branch is introduced to improve the discrimination of visual features. Specifically, the temporal difference and visual classification operations of the 1-of-N paradigm are extended to CLIP to satisfy the need for strong supervised performance. On the text side, we design Class-Agnostic text prompt Template (CAT) under the constraint of Semantic Alignment (SA) module to solve the label semantic dependency problem. Finally, a Dual-branch Feature Reconstruction (DFR) module is proposed to complete cross-modal interactions for better feature matching, which uses the class confidence of the visual classification branch as input. The experiments are conducted on four widely used benchmarks (HMDB-51, UCF-101, Jester, and Kinetics-400). The results demonstrate that our method achieves excellent supervised performance while preserving competitive generalizability.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"9918-9930"},"PeriodicalIF":9.7,"publicationDate":"2025-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145778148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FUNet: Frequency-Aware and Uncertainty-Guiding Network for Rain-Hazy Image Restoration 基于频率感知和不确定性导向的雨雾图像恢复网络
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-10-07 DOI: 10.1109/TMM.2025.3618545
Mengkun Liu;Tao Gao;Yao Liu;Yuhan Cao;Licheng Jiao
Restoring rain-hazy images is vital for intelligent decision-making in autonomous driving and outdoor surveillance systems, which is a challenging ill-posed problem due to the irreversible nature of image degradation. Despite remarkable success achieved through deep learning, current algorithms are primarily evaluated using given kind of images, and the texture details and frequency domain information are insufficiently explored in most approaches, which greatly limits the performance of the model. To alleviate the above challenges, the frequency-aware and uncertainty-guiding network (FUNet) is proposed for rain-hazy image restoration. The FUNet consists of an end-to-end encoder-decoder architecture with the uncertainty-guided feature refinement (UGFR) and the confidence feature feedback module (CFF). First, the UGFR is designed with the uncertainty estimation (UE), uncertainty local global feature extraction module (ULG), and the frequency component decomposition and fusion (FCDF), which learns the abundant intermediate information in detail for clear image restoration. Second, in order to adequately learn rich semantic features, the CFF module is proposed to provide feedback and guidance on the learning process of the decoder. Third, the frequency-based loss function is designed to ensure training stability, which effectively guarantees the spatial and spectral details of images. Experiments on seven synthetic outdoor datasets and the real-world dataset DQA demonstrate the superiority of the proposed model quantitatively and qualitatively.
由于图像退化的不可逆性,恢复雨雾图像对于自动驾驶和户外监控系统的智能决策至关重要,这是一个具有挑战性的不适定问题。尽管通过深度学习取得了显著的成功,但目前的算法主要是使用给定类型的图像进行评估,并且大多数方法对纹理细节和频域信息的探索不足,这极大地限制了模型的性能。针对上述问题,提出了一种用于雨朦胧图像恢复的频率感知和不确定性引导网络(FUNet)。FUNet由端到端编码器-解码器架构组成,具有不确定性引导特征细化(UGFR)和置信度特征反馈模块(CFF)。首先,采用不确定性估计(UE)、不确定性局部全局特征提取模块(ULG)和频率分量分解与融合(FCDF)设计UGFR,详细学习丰富的中间信息,实现清晰的图像复原;其次,为了充分学习丰富的语义特征,提出了CFF模块对解码器的学习过程提供反馈和指导。第三,设计了基于频率的损失函数,保证了训练的稳定性,有效地保证了图像的空间和频谱细节。在7个室外合成数据集和实际数据集DQA上的实验证明了该模型在定量和定性上的优越性。
{"title":"FUNet: Frequency-Aware and Uncertainty-Guiding Network for Rain-Hazy Image Restoration","authors":"Mengkun Liu;Tao Gao;Yao Liu;Yuhan Cao;Licheng Jiao","doi":"10.1109/TMM.2025.3618545","DOIUrl":"https://doi.org/10.1109/TMM.2025.3618545","url":null,"abstract":"Restoring rain-hazy images is vital for intelligent decision-making in autonomous driving and outdoor surveillance systems, which is a challenging ill-posed problem due to the irreversible nature of image degradation. Despite remarkable success achieved through deep learning, current algorithms are primarily evaluated using given kind of images, and the texture details and frequency domain information are insufficiently explored in most approaches, which greatly limits the performance of the model. To alleviate the above challenges, the frequency-aware and uncertainty-guiding network (FUNet) is proposed for rain-hazy image restoration. The FUNet consists of an end-to-end encoder-decoder architecture with the uncertainty-guided feature refinement (UGFR) and the confidence feature feedback module (CFF). First, the UGFR is designed with the uncertainty estimation (UE), uncertainty local global feature extraction module (ULG), and the frequency component decomposition and fusion (FCDF), which learns the abundant intermediate information in detail for clear image restoration. Second, in order to adequately learn rich semantic features, the CFF module is proposed to provide feedback and guidance on the learning process of the decoder. Third, the frequency-based loss function is designed to ensure training stability, which effectively guarantees the spatial and spectral details of images. Experiments on seven synthetic outdoor datasets and the real-world dataset DQA demonstrate the superiority of the proposed model quantitatively and qualitatively.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"9902-9917"},"PeriodicalIF":9.7,"publicationDate":"2025-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145778118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multimodal Classification and Out-of-Distribution Detection for Multimodal Intent Understanding 基于多模态意图理解的多模态分类和分布外检测
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-10-07 DOI: 10.1109/TMM.2025.3618541
Hanlei Zhang;Qianrui Zhou;Hua Xu;Jianhua Su;Roberto Evans;Kai Gao
Multimodal intent understanding is a significant research area that requires effectively leveraging multiple modalities to analyze human language. Existing methods face two main challenges in this domain. Firstly, they have limitations in capturing nuanced and high-level semantics underlying complex in-distribution (ID) multimodal intents. Secondly, they exhibit poor generalization when confronted with unseen out-of-distribution (OOD) data in real-world scenarios. To address these issues, we propose a novel method for both ID classification and OOD detection (MIntOOD). We first introduce a weighted feature fusion network that models multimodal representations effectively. This network dynamically learns the importance of each modality, adapting to multimodal contexts. To develop discriminative representations that are conducive to both tasks, we synthesize pseudo-OOD data from convex combinations of ID data and engage in multimodal representation learning from both coarse-grained and fine-grained perspectives. The coarse-grained perspective focuses on distinguishing between ID and OOD binary classes, while the fine-grained perspective enhances the understanding of ID data, achieving a progressive learning process that addresses tasks of increasing complexity. Additionally, the fine-grained perspective captures instance-level interactions between ID and OOD samples, promoting proximity among similar instances and separation from dissimilar ones. We establish baselines for three multimodal intent datasets and build an OOD benchmark. Extensive experiments on these datasets demonstrate that our method significantly improves OOD detection performance with a 3-10% increase in AUROC scores while achieving new state-of-the-art results in ID classification.
多模态意图理解是一个重要的研究领域,它需要有效地利用多模态来分析人类语言。现有的方法在这一领域面临两个主要挑战。首先,它们在捕获复杂分布中(ID)多模态意图的细微和高级语义方面存在局限性。其次,当面对现实场景中未见过的分布外(OOD)数据时,它们表现出较差的泛化。为了解决这些问题,我们提出了一种新的ID分类和OOD检测方法(MIntOOD)。我们首先引入了一个加权特征融合网络,该网络可以有效地建模多模态表示。该网络动态学习每个模态的重要性,适应多模态环境。为了开发有利于这两个任务的判别表示,我们从ID数据的凸组合合成伪ood数据,并从粗粒度和细粒度的角度进行多模态表示学习。粗粒度透视图侧重于区分ID和OOD二进制类,而细粒度透视图增强了对ID数据的理解,实现了一个渐进的学习过程,以解决日益复杂的任务。此外,细粒度透视图捕获ID和OOD样本之间的实例级交互,促进相似实例之间的接近和不同实例之间的分离。我们建立了三个多模态意图数据集的基线,并建立了一个OOD基准。在这些数据集上进行的大量实验表明,我们的方法显著提高了OOD检测性能,AUROC分数提高了3-10%,同时在ID分类方面取得了新的最先进的结果。
{"title":"Multimodal Classification and Out-of-Distribution Detection for Multimodal Intent Understanding","authors":"Hanlei Zhang;Qianrui Zhou;Hua Xu;Jianhua Su;Roberto Evans;Kai Gao","doi":"10.1109/TMM.2025.3618541","DOIUrl":"https://doi.org/10.1109/TMM.2025.3618541","url":null,"abstract":"Multimodal intent understanding is a significant research area that requires effectively leveraging multiple modalities to analyze human language. Existing methods face two main challenges in this domain. Firstly, they have limitations in capturing nuanced and high-level semantics underlying complex in-distribution (ID) multimodal intents. Secondly, they exhibit poor generalization when confronted with unseen out-of-distribution (OOD) data in real-world scenarios. To address these issues, we propose a novel method for both ID classification and OOD detection (MIntOOD). We first introduce a weighted feature fusion network that models multimodal representations effectively. This network dynamically learns the importance of each modality, adapting to multimodal contexts. To develop discriminative representations that are conducive to both tasks, we synthesize pseudo-OOD data from convex combinations of ID data and engage in multimodal representation learning from both coarse-grained and fine-grained perspectives. The coarse-grained perspective focuses on distinguishing between ID and OOD binary classes, while the fine-grained perspective enhances the understanding of ID data, achieving a progressive learning process that addresses tasks of increasing complexity. Additionally, the fine-grained perspective captures instance-level interactions between ID and OOD samples, promoting proximity among similar instances and separation from dissimilar ones. We establish baselines for three multimodal intent datasets and build an OOD benchmark. Extensive experiments on these datasets demonstrate that our method significantly improves OOD detection performance with a 3-10% increase in AUROC scores while achieving new state-of-the-art results in ID classification.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"9887-9901"},"PeriodicalIF":9.7,"publicationDate":"2025-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145830824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Boosting Dataset Distillation With the Assistance of Crucial Samples for Visual Learning 基于关键样本的数据集升华视觉学习
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-10-07 DOI: 10.1109/TMM.2025.3618578
Xiaodan Li;Yao Zhu;Yuefeng Chen;Cen Chen;Jianmei Guo;Shuhui Wang
In recent years, massive datasets have significantly driven the advancement of visual learning such as multi-modal large model at the expense of high computational costs and extensive storage requirements. Dataset distillation (DD) aims to address this challenge by learning a small synthetic dataset such that a model trained on it can achieve a test performance comparable to that of the model trained on the original dataset. This task can be formulated as a bi-level learning problem where the outer loop optimizes the learned dataset and the inner loop updates the model parameters based on the distilled data. Different from previous studies that focus primarily on optimizing the inner loop in this bi-level problem, we delve into the task of dataset distillation from the perspective of sample cruciality. We find that discarding easy samples and keeping the hard ones that are difficult to be represented by the learned synthetic samples in the outer loop can be beneficial for DD. Motivated by this observation, we further develop an Infinite Semantic Augmentation (ISA) based dataset distillation algorithm, which discards some easier samples and implicitly enriches harder ones in the semantic space through continuous interpolation between two target feature vectors. Through detailed mathematical derivation, the joint contribution to the training loss of all interpolated feature points is formed into an analytical closed-form solution of an integral that can be optimized with almost no extra computational cost. Experimental results on several benchmark datasets demonstrate the effectiveness of our approach in reducing the dataset size while preserving the accuracy of the model. Furthermore, we show that high-quality distilled data can also benefit downstream applications, such as continual learning and membership inference defense.
近年来,海量数据集极大地推动了多模态大模型等视觉学习的进步,但代价是高昂的计算成本和大量的存储需求。数据集蒸馏(DD)旨在通过学习一个小的合成数据集来解决这一挑战,这样在它上面训练的模型可以达到与在原始数据集上训练的模型相当的测试性能。该任务可以表述为一个双层学习问题,其中外环优化学习数据集,内环根据提取的数据更新模型参数。与以往的研究主要集中在优化这一双层次问题的内环不同,我们从样本关键度的角度深入研究数据集蒸馏的任务。我们发现,丢弃简单的样本,并将难以被学习的合成样本表示的难样本保留在外环中,这对DD是有益的。基于这一观察结果,我们进一步开发了一种基于无限语义增强(ISA)的数据集蒸馏算法,该算法通过两个目标特征向量之间的连续插值,在语义空间中丢弃一些容易的样本,隐式地丰富更难的样本。通过详细的数学推导,将所有插值特征点对训练损失的共同贡献形成一个积分的解析封闭解,几乎不需要额外的计算成本就可以对其进行优化。在几个基准数据集上的实验结果表明,我们的方法在减少数据集大小的同时保持了模型的准确性。此外,我们还表明,高质量的蒸馏数据也可以使下游应用受益,例如持续学习和成员推理防御。
{"title":"Boosting Dataset Distillation With the Assistance of Crucial Samples for Visual Learning","authors":"Xiaodan Li;Yao Zhu;Yuefeng Chen;Cen Chen;Jianmei Guo;Shuhui Wang","doi":"10.1109/TMM.2025.3618578","DOIUrl":"https://doi.org/10.1109/TMM.2025.3618578","url":null,"abstract":"In recent years, massive datasets have significantly driven the advancement of visual learning such as multi-modal large model at the expense of high computational costs and extensive storage requirements. Dataset distillation (DD) aims to address this challenge by learning a small synthetic dataset such that a model trained on it can achieve a test performance comparable to that of the model trained on the original dataset. This task can be formulated as a bi-level learning problem where the outer loop optimizes the learned dataset and the inner loop updates the model parameters based on the distilled data. Different from previous studies that focus primarily on optimizing the inner loop in this bi-level problem, we delve into the task of dataset distillation from the perspective of sample cruciality. We find that discarding easy samples and keeping the hard ones that are difficult to be represented by the learned synthetic samples in the outer loop can be beneficial for DD. Motivated by this observation, we further develop an Infinite Semantic Augmentation (ISA) based dataset distillation algorithm, which discards some easier samples and implicitly enriches harder ones in the semantic space through continuous interpolation between two target feature vectors. Through detailed mathematical derivation, the joint contribution to the training loss of all interpolated feature points is formed into an analytical closed-form solution of an integral that can be optimized with almost no extra computational cost. Experimental results on several benchmark datasets demonstrate the effectiveness of our approach in reducing the dataset size while preserving the accuracy of the model. Furthermore, we show that high-quality distilled data can also benefit downstream applications, such as continual learning and membership inference defense.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"9873-9886"},"PeriodicalIF":9.7,"publicationDate":"2025-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145886604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Multimedia
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1