Pub Date : 2025-11-14DOI: 10.1109/TMM.2025.3632688
Leyuan Wang;Liuyu Xiang;Yunlong Wang;Huijia Wu;Huafeng Yang;Jingqian Liu;Zhaofeng He
Deep neural networks suffer from catastrophic forgetting when continually learning new concepts. In this paper, we analyze this problem from a data imbalance point of view. We argue that the imbalance between old task and new task data contributes to forgetting of the old tasks. Moreover, the increasing imbalance ratio during incremental learning further aggravates the problem. To address the dynamic imbalance issue, we propose Uniform Prototype Contrastive Learning (UPCL), where uniform and compact features are learned. Specifically, we generate a set of non-learnable uniform prototypes before each task starts. Then we assign these uniform prototypes to each class and guide the feature learning through prototype contrastive learning. We also dynamically adjust the relative margin between old and new classes so that the feature distribution will be maintained balanced and compact. Finally, we demonstrate through extensive experiments that the proposed method achieves state-of-the-art performance on several benchmark including CIFAR-100, ImageNet-100, TinyImageNet, Food-101, and CUB-200. Experimental results show that our approach not only effectively addresses the issue of imbalanced old data in memory but also tackles the problem of imbalanced new data distributions.
{"title":"Rethinking Class-Incremental Learning From a Dynamic Imbalanced Learning Perspective","authors":"Leyuan Wang;Liuyu Xiang;Yunlong Wang;Huijia Wu;Huafeng Yang;Jingqian Liu;Zhaofeng He","doi":"10.1109/TMM.2025.3632688","DOIUrl":"https://doi.org/10.1109/TMM.2025.3632688","url":null,"abstract":"Deep neural networks suffer from catastrophic forgetting when continually learning new concepts. In this paper, we analyze this problem from a data imbalance point of view. We argue that the imbalance between old task and new task data contributes to forgetting of the old tasks. Moreover, the increasing imbalance ratio during incremental learning further aggravates the problem. To address the dynamic imbalance issue, we propose Uniform Prototype Contrastive Learning (UPCL), where uniform and compact features are learned. Specifically, we generate a set of non-learnable uniform prototypes before each task starts. Then we assign these uniform prototypes to each class and guide the feature learning through prototype contrastive learning. We also dynamically adjust the relative margin between old and new classes so that the feature distribution will be maintained balanced and compact. Finally, we demonstrate through extensive experiments that the proposed method achieves state-of-the-art performance on several benchmark including CIFAR-100, ImageNet-100, TinyImageNet, Food-101, and CUB-200. Experimental results show that our approach not only effectively addresses the issue of imbalanced old data in memory but also tackles the problem of imbalanced new data distributions.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"825-836"},"PeriodicalIF":9.7,"publicationDate":"2025-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145929423","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The existing deep-learning based robust watermarking model generally applies a discriminator to form generative adversarial network (GAN) for increasing the quality of encoded images, and adopts a single encoder to embed watermark. However, GAN training is unstable, and the single encoder cannot fully adjust the watermarking distribution, thus affecting the watermarking performance. To address those limitations, this paper presents the multi-encoder based on conditional diffusion model (CDM) for robust image watermarking, namely, DiffW. To enhance the stability, the multi-encoder structure based on CDM replaces GAN for optimizing the watermarking distribution iteratively. Specifically, the operation of each timestep in the forward and reverse diffusion processes of the CDM is regarded as an encoder to overcome the shortcomings of the single encoder structure. At the training stage, under the guidance of the conditional noisy image, the forward process trains each encoder to fuse the image and watermark to generate high-quality encoded images. During the testing stage, only a small number of trained encoders of the forward process are used, so as to reduce the time complexity. Furthermore, to improve watermarking robustness, the channel attention module (CAM) is designed to extract main watermark features by mining channel correlations for multi-layer fusion, so that watermark can be embedded into imperceptible and texture areas. The experimental results reveal that compared with the existing watermarking model, the proposed DiffW can achieve better results in terms of watermarking invisibility and robustness.
{"title":"DiffW: Multi-Encoder Based on Conditional Diffusion Model for Robust Image Watermarking","authors":"Ting Luo;Renzhi Hu;Zhouyan He;Gangyi Jiang;Haiyong Xu;Yang Song;Chin-Chen Chang","doi":"10.1109/TMM.2025.3632631","DOIUrl":"https://doi.org/10.1109/TMM.2025.3632631","url":null,"abstract":"The existing deep-learning based robust watermarking model generally applies a discriminator to form generative adversarial network (GAN) for increasing the quality of encoded images, and adopts a single encoder to embed watermark. However, GAN training is unstable, and the single encoder cannot fully adjust the watermarking distribution, thus affecting the watermarking performance. To address those limitations, this paper presents the multi-encoder based on conditional diffusion model (CDM) for robust image watermarking, namely, DiffW. To enhance the stability, the multi-encoder structure based on CDM replaces GAN for optimizing the watermarking distribution iteratively. Specifically, the operation of each timestep in the forward and reverse diffusion processes of the CDM is regarded as an encoder to overcome the shortcomings of the single encoder structure. At the training stage, under the guidance of the conditional noisy image, the forward process trains each encoder to fuse the image and watermark to generate high-quality encoded images. During the testing stage, only a small number of trained encoders of the forward process are used, so as to reduce the time complexity. Furthermore, to improve watermarking robustness, the channel attention module (CAM) is designed to extract main watermark features by mining channel correlations for multi-layer fusion, so that watermark can be embedded into imperceptible and texture areas. The experimental results reveal that compared with the existing watermarking model, the proposed DiffW can achieve better results in terms of watermarking invisibility and robustness.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"837-852"},"PeriodicalIF":9.7,"publicationDate":"2025-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145929454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-12DOI: 10.1109/TMM.2025.3607771
Lan Chen;Dong Li;Xiao Wang;Pengpeng Shao;Wei Zhang;Yaowei Wang;Yonghong Tian;Jin Tang
Current event stream-based pattern recognition models typically present the event stream as the point cloud, voxel, image, and the like, and formulate multiple deep neural networks to acquire their features. Although considerable results can be achieved in simple cases, however, the performance of the model might be restricted by monotonous modality expressions, sub-optimal fusion, and readout mechanisms. In this article, we put forward a novel dual-stream framework for event stream-based pattern recognition through differentiated fusion, which is called EFV++. It models two common event representations simultaneously, i.e., event images and event voxels. The spatial and three-dimensional stereo information can be separately learned by making use of Transformer and Graph Neural Network (GNN). We believe the features of each representation still contain both efficient and redundant features and a sub-optimal solution may be obtained if we directly fuse them without differentiation. Thus, we divide each feature into three levels and retain high-quality features, blend medium-quality features, and exchange low-quality features. The enhanced dual features will be provided to the fusion Transformer together with bottleneck features. In addition, we introduce a novel hybrid interaction readout mechanism to enhance the diversity of features as final representations. Comprehensive experiments validate that the framework we have proposed attains cutting-edge performance on a variety of extensively utilized event stream-based classification datasets. Particularly, we have realized a freshly pioneering performance on the Bullying10 k dataset, precisely 90.51%, and this outpaces the runner-up by $+2.21%$.
{"title":"Retain, Blend, and Exchange: A Quality-Aware Spatial-Stereo Fusion Approach for Event Stream Recognition","authors":"Lan Chen;Dong Li;Xiao Wang;Pengpeng Shao;Wei Zhang;Yaowei Wang;Yonghong Tian;Jin Tang","doi":"10.1109/TMM.2025.3607771","DOIUrl":"https://doi.org/10.1109/TMM.2025.3607771","url":null,"abstract":"Current event stream-based pattern recognition models typically present the event stream as the point cloud, voxel, image, and the like, and formulate multiple deep neural networks to acquire their features. Although considerable results can be achieved in simple cases, however, the performance of the model might be restricted by monotonous modality expressions, sub-optimal fusion, and readout mechanisms. In this article, we put forward a novel dual-stream framework for event stream-based pattern recognition through differentiated fusion, which is called EFV++. It models two common event representations simultaneously, i.e., event images and event voxels. The spatial and three-dimensional stereo information can be separately learned by making use of Transformer and Graph Neural Network (GNN). We believe the features of each representation still contain both efficient and redundant features and a sub-optimal solution may be obtained if we directly fuse them without differentiation. Thus, we divide each feature into three levels and retain high-quality features, blend medium-quality features, and exchange low-quality features. The enhanced dual features will be provided to the fusion Transformer together with bottleneck features. In addition, we introduce a novel hybrid interaction readout mechanism to enhance the diversity of features as final representations. Comprehensive experiments validate that the framework we have proposed attains cutting-edge performance on a variety of extensively utilized event stream-based classification datasets. Particularly, we have realized a freshly pioneering performance on the Bullying10 k dataset, precisely 90.51%, and this outpaces the runner-up by <inline-formula><tex-math>$+2.21%$</tex-math></inline-formula>.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"8926-8939"},"PeriodicalIF":9.7,"publicationDate":"2025-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145510158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-31DOI: 10.1109/TMM.2025.3626976
Xiaofeng Wang;Zhengjie Zhang;Yuanyuan Qi;Guodong Shen;Shuaiming Lai;Yuntao Chen;Fang Zhou;Daying Quan
Graph contrastive learning (GCL), which captures essential features from augmented graphs to address data sparsity issues, has recently demonstrated promising potential in improving recommendation performance. Most GCL-based recommendation methods learn consistent entity representations from user-item bipartite graphs through structural perturbations. However, these approaches impose an additional computational cost and have been shown to be insensitive to various graph augmentations, resulting in limited improvements in long-tail recommendation scenarios. To address this issue, we propose a novel framework for recommendation, Knowledge-Enhanced graph Contrastive Learning (KECL), which adopts knowledge graph-based embedding augmentation instead of graph enhancement to construct views for GCL. Specifically, we introduce a knowledge aggregation module with a heterogeneous attentive aggregator to capture relation heterogeneity in the knowledge graph. Furthermore, we propose a knowledge-based augmentation GCL model that adds knowledge-aware embeddings to the learned representations for more efficient representation-level augmentation. Extensive experiments on real-world datasets demonstrate that the knowledge-based augmentation approach effectively enhances recommendation performance and shows superiority over state-of-the-art methods.
{"title":"Knowledge-Enhanced Graph Contrastive Learning for Recommendations","authors":"Xiaofeng Wang;Zhengjie Zhang;Yuanyuan Qi;Guodong Shen;Shuaiming Lai;Yuntao Chen;Fang Zhou;Daying Quan","doi":"10.1109/TMM.2025.3626976","DOIUrl":"https://doi.org/10.1109/TMM.2025.3626976","url":null,"abstract":"Graph contrastive learning (GCL), which captures essential features from augmented graphs to address data sparsity issues, has recently demonstrated promising potential in improving recommendation performance. Most GCL-based recommendation methods learn consistent entity representations from user-item bipartite graphs through structural perturbations. However, these approaches impose an additional computational cost and have been shown to be insensitive to various graph augmentations, resulting in limited improvements in long-tail recommendation scenarios. To address this issue, we propose a novel framework for recommendation, <bold>K</b>nowledge-<bold>E</b>nhanced graph <bold>C</b>ontrastive <bold>L</b>earning (KECL), which adopts knowledge graph-based embedding augmentation instead of graph enhancement to construct views for GCL. Specifically, we introduce a knowledge aggregation module with a heterogeneous attentive aggregator to capture relation heterogeneity in the knowledge graph. Furthermore, we propose a knowledge-based augmentation GCL model that adds knowledge-aware embeddings to the learned representations for more efficient representation-level augmentation. Extensive experiments on real-world datasets demonstrate that the knowledge-based augmentation approach effectively enhances recommendation performance and shows superiority over state-of-the-art methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"684-699"},"PeriodicalIF":9.7,"publicationDate":"2025-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145982351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-20DOI: 10.1109/TMM.2025.3623560
Sijing Wu;Yunhao Li;Weitian Zhang;Jun Jia;Yucheng Zhu;Yichao Yan;Guangtao Zhai;Xiaokang Yang
Singing, as a common facial movement second only to talking, can be regarded as a universal language across ethnicities and cultures, plays an important role in emotional communication, art, and entertainment. However, it is often overlooked in the field of audio-driven 3D facial animation due to the lack of singing head datasets and the domain gap between singing and talking in rhythm and amplitude. To this end, we collect a large-scale high-quality multi-modal singing head dataset, SingingHead, which consists of more than 27 hours of synchronized singing video, 3D facial motion, singing audio, and background music from 76 individuals and 8 types of music. Along with the SingingHead dataset, we benchmark existing audio-driven 3D facial animation methods and 2D talking head methods on the singing task. Existing 3D facial animation methods and 2D talking head methods fail to produce satisfactory singing results. Focusing on the 3D singing head animation, we first utilize the proposed singing-specific dataset to retrain the 3D facial animation methods, resulting in substantial performance improvements. Besides, considering the absence of background music and the slow generation speed of existing methods, we propose a simple but efficient non-autoregressive VAE-based framework with background music as an input signal to generate diverse and accurate 3D singing facial motions in real time. Extensive experiments demonstrate the significance of the SingingHead dataset in promoting the development of singing head animation.
{"title":"SingingHead: A Large-Scale 4D Dataset for Singing Head Animation","authors":"Sijing Wu;Yunhao Li;Weitian Zhang;Jun Jia;Yucheng Zhu;Yichao Yan;Guangtao Zhai;Xiaokang Yang","doi":"10.1109/TMM.2025.3623560","DOIUrl":"https://doi.org/10.1109/TMM.2025.3623560","url":null,"abstract":"Singing, as a common facial movement second only to talking, can be regarded as a universal language across ethnicities and cultures, plays an important role in emotional communication, art, and entertainment. However, it is often overlooked in the field of audio-driven 3D facial animation due to the lack of singing head datasets and the domain gap between singing and talking in rhythm and amplitude. To this end, we collect a large-scale high-quality multi-modal singing head dataset, <bold>SingingHead</b>, which consists of more than 27 hours of synchronized singing video, 3D facial motion, singing audio, and background music from 76 individuals and 8 types of music. Along with the SingingHead dataset, we benchmark existing audio-driven 3D facial animation methods and 2D talking head methods on the singing task. Existing 3D facial animation methods and 2D talking head methods fail to produce satisfactory singing results. Focusing on the 3D singing head animation, we first utilize the proposed singing-specific dataset to retrain the 3D facial animation methods, resulting in substantial performance improvements. Besides, considering the absence of background music and the slow generation speed of existing methods, we propose a simple but efficient non-autoregressive VAE-based framework with background music as an input signal to generate diverse and accurate 3D singing facial motions in real time. Extensive experiments demonstrate the significance of the SingingHead dataset in promoting the development of singing head animation.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"700-714"},"PeriodicalIF":9.7,"publicationDate":"2025-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145982349","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-20DOI: 10.1109/TMM.2025.3623544
Zhaohu Xing;Tian Ye;Xin Yang;Sixiang Chen;Huazhu Fu;Yan Nei Law;Lei Zhu
Mirror detection in dynamic scenes plays a crucial role in ensuring safety for various applications, such as drone tracking and robot navigation. However, current mirror detection models often fail in areas with mirrors that have a similar visual and color appearance to their surrounding objects. They also struggle to generalize well in complex cases, primarily due to limited annotated datasets. In this work, we propose a novel temporal prompt learning network with depth memory (TPD-Net) to address these critical challenges. Our approach includes several key components. First, we introduce a Temporal Prompt Generator (TPG) to learn temporal prompt features. Then, we devise Multi-layer Depth-aware Adaptor (MDA) modules to progressively adapt prompt features from the TPG, thereby learning mirror-related features by embedding temporal depth information as guidance. Moreover, we further refine these mirror-related features by constructing a depth memory and a Depth Memory Read module to read the temporal depths stored in the memory, boosting video mirror detection. Experimental results on a benchmark dataset show that our TPD-Net significantly outperforms 22 state-of-the-art methods in video mirror detection tasks.
{"title":"Temporal Prompt Learning With Depth Memory for Video Mirror Detection","authors":"Zhaohu Xing;Tian Ye;Xin Yang;Sixiang Chen;Huazhu Fu;Yan Nei Law;Lei Zhu","doi":"10.1109/TMM.2025.3623544","DOIUrl":"https://doi.org/10.1109/TMM.2025.3623544","url":null,"abstract":"Mirror detection in dynamic scenes plays a crucial role in ensuring safety for various applications, such as drone tracking and robot navigation. However, current mirror detection models often fail in areas with mirrors that have a similar visual and color appearance to their surrounding objects. They also struggle to generalize well in complex cases, primarily due to limited annotated datasets. In this work, we propose a novel temporal prompt learning network with depth memory (TPD-Net) to address these critical challenges. Our approach includes several key components. First, we introduce a Temporal Prompt Generator (TPG) to learn temporal prompt features. Then, we devise Multi-layer Depth-aware Adaptor (MDA) modules to progressively adapt prompt features from the TPG, thereby learning mirror-related features by embedding temporal depth information as guidance. Moreover, we further refine these mirror-related features by constructing a depth memory and a Depth Memory Read module to read the temporal depths stored in the memory, boosting video mirror detection. Experimental results on a benchmark dataset show that our TPD-Net significantly outperforms 22 state-of-the-art methods in video mirror detection tasks.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"715-725"},"PeriodicalIF":9.7,"publicationDate":"2025-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145982357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-20DOI: 10.1109/TMM.2025.3618557
Xiong Gao;Zhaobin Chang;Dongyi Kong;Huiyu Zhou;Yonggang Lu
Recently, the Contrastive Language Image Pre-training (CLIP) model has shown significant generalizability by optimizing the distance between visual and text features. The mainstream CLIP-based action recognition methods mitigate the low “zero-shot” generalization of the 1-of-N paradigm but also lead to a significant degradation in supervised performance. Therefore, powerful supervision and competitive “zero-shot” need to be effectively traded off. In this work, a Multimodal Independent Prompt CLIP (MIP-CLIP) model is proposed to address this challenge. On the visual side, we propose novel Video Motion Prompt (VMP) to empower the visual encoder with motion perception, which performs short- and long-term motion modelling via temporal difference operation. Next, the visual classification branch is introduced to improve the discrimination of visual features. Specifically, the temporal difference and visual classification operations of the 1-of-N paradigm are extended to CLIP to satisfy the need for strong supervised performance. On the text side, we design Class-Agnostic text prompt Template (CAT) under the constraint of Semantic Alignment (SA) module to solve the label semantic dependency problem. Finally, a Dual-branch Feature Reconstruction (DFR) module is proposed to complete cross-modal interactions for better feature matching, which uses the class confidence of the visual classification branch as input. The experiments are conducted on four widely used benchmarks (HMDB-51, UCF-101, Jester, and Kinetics-400). The results demonstrate that our method achieves excellent supervised performance while preserving competitive generalizability.
{"title":"MIP-CLIP: Multimodal Independent Prompt CLIP for Action Recognition","authors":"Xiong Gao;Zhaobin Chang;Dongyi Kong;Huiyu Zhou;Yonggang Lu","doi":"10.1109/TMM.2025.3618557","DOIUrl":"https://doi.org/10.1109/TMM.2025.3618557","url":null,"abstract":"Recently, the Contrastive Language Image Pre-training (CLIP) model has shown significant generalizability by optimizing the distance between visual and text features. The mainstream CLIP-based action recognition methods mitigate the low “zero-shot” generalization of the 1-of-N paradigm but also lead to a significant degradation in supervised performance. Therefore, powerful supervision and competitive “zero-shot” need to be effectively traded off. In this work, a Multimodal Independent Prompt CLIP (MIP-CLIP) model is proposed to address this challenge. On the visual side, we propose novel Video Motion Prompt (VMP) to empower the visual encoder with motion perception, which performs short- and long-term motion modelling via temporal difference operation. Next, the visual classification branch is introduced to improve the discrimination of visual features. Specifically, the temporal difference and visual classification operations of the 1-of-N paradigm are extended to CLIP to satisfy the need for strong supervised performance. On the text side, we design Class-Agnostic text prompt Template (CAT) under the constraint of Semantic Alignment (SA) module to solve the label semantic dependency problem. Finally, a Dual-branch Feature Reconstruction (DFR) module is proposed to complete cross-modal interactions for better feature matching, which uses the class confidence of the visual classification branch as input. The experiments are conducted on four widely used benchmarks (HMDB-51, UCF-101, Jester, and Kinetics-400). The results demonstrate that our method achieves excellent supervised performance while preserving competitive generalizability.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"9918-9930"},"PeriodicalIF":9.7,"publicationDate":"2025-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145778148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Restoring rain-hazy images is vital for intelligent decision-making in autonomous driving and outdoor surveillance systems, which is a challenging ill-posed problem due to the irreversible nature of image degradation. Despite remarkable success achieved through deep learning, current algorithms are primarily evaluated using given kind of images, and the texture details and frequency domain information are insufficiently explored in most approaches, which greatly limits the performance of the model. To alleviate the above challenges, the frequency-aware and uncertainty-guiding network (FUNet) is proposed for rain-hazy image restoration. The FUNet consists of an end-to-end encoder-decoder architecture with the uncertainty-guided feature refinement (UGFR) and the confidence feature feedback module (CFF). First, the UGFR is designed with the uncertainty estimation (UE), uncertainty local global feature extraction module (ULG), and the frequency component decomposition and fusion (FCDF), which learns the abundant intermediate information in detail for clear image restoration. Second, in order to adequately learn rich semantic features, the CFF module is proposed to provide feedback and guidance on the learning process of the decoder. Third, the frequency-based loss function is designed to ensure training stability, which effectively guarantees the spatial and spectral details of images. Experiments on seven synthetic outdoor datasets and the real-world dataset DQA demonstrate the superiority of the proposed model quantitatively and qualitatively.
{"title":"FUNet: Frequency-Aware and Uncertainty-Guiding Network for Rain-Hazy Image Restoration","authors":"Mengkun Liu;Tao Gao;Yao Liu;Yuhan Cao;Licheng Jiao","doi":"10.1109/TMM.2025.3618545","DOIUrl":"https://doi.org/10.1109/TMM.2025.3618545","url":null,"abstract":"Restoring rain-hazy images is vital for intelligent decision-making in autonomous driving and outdoor surveillance systems, which is a challenging ill-posed problem due to the irreversible nature of image degradation. Despite remarkable success achieved through deep learning, current algorithms are primarily evaluated using given kind of images, and the texture details and frequency domain information are insufficiently explored in most approaches, which greatly limits the performance of the model. To alleviate the above challenges, the frequency-aware and uncertainty-guiding network (FUNet) is proposed for rain-hazy image restoration. The FUNet consists of an end-to-end encoder-decoder architecture with the uncertainty-guided feature refinement (UGFR) and the confidence feature feedback module (CFF). First, the UGFR is designed with the uncertainty estimation (UE), uncertainty local global feature extraction module (ULG), and the frequency component decomposition and fusion (FCDF), which learns the abundant intermediate information in detail for clear image restoration. Second, in order to adequately learn rich semantic features, the CFF module is proposed to provide feedback and guidance on the learning process of the decoder. Third, the frequency-based loss function is designed to ensure training stability, which effectively guarantees the spatial and spectral details of images. Experiments on seven synthetic outdoor datasets and the real-world dataset DQA demonstrate the superiority of the proposed model quantitatively and qualitatively.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"9902-9917"},"PeriodicalIF":9.7,"publicationDate":"2025-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145778118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multimodal intent understanding is a significant research area that requires effectively leveraging multiple modalities to analyze human language. Existing methods face two main challenges in this domain. Firstly, they have limitations in capturing nuanced and high-level semantics underlying complex in-distribution (ID) multimodal intents. Secondly, they exhibit poor generalization when confronted with unseen out-of-distribution (OOD) data in real-world scenarios. To address these issues, we propose a novel method for both ID classification and OOD detection (MIntOOD). We first introduce a weighted feature fusion network that models multimodal representations effectively. This network dynamically learns the importance of each modality, adapting to multimodal contexts. To develop discriminative representations that are conducive to both tasks, we synthesize pseudo-OOD data from convex combinations of ID data and engage in multimodal representation learning from both coarse-grained and fine-grained perspectives. The coarse-grained perspective focuses on distinguishing between ID and OOD binary classes, while the fine-grained perspective enhances the understanding of ID data, achieving a progressive learning process that addresses tasks of increasing complexity. Additionally, the fine-grained perspective captures instance-level interactions between ID and OOD samples, promoting proximity among similar instances and separation from dissimilar ones. We establish baselines for three multimodal intent datasets and build an OOD benchmark. Extensive experiments on these datasets demonstrate that our method significantly improves OOD detection performance with a 3-10% increase in AUROC scores while achieving new state-of-the-art results in ID classification.
{"title":"Multimodal Classification and Out-of-Distribution Detection for Multimodal Intent Understanding","authors":"Hanlei Zhang;Qianrui Zhou;Hua Xu;Jianhua Su;Roberto Evans;Kai Gao","doi":"10.1109/TMM.2025.3618541","DOIUrl":"https://doi.org/10.1109/TMM.2025.3618541","url":null,"abstract":"Multimodal intent understanding is a significant research area that requires effectively leveraging multiple modalities to analyze human language. Existing methods face two main challenges in this domain. Firstly, they have limitations in capturing nuanced and high-level semantics underlying complex in-distribution (ID) multimodal intents. Secondly, they exhibit poor generalization when confronted with unseen out-of-distribution (OOD) data in real-world scenarios. To address these issues, we propose a novel method for both ID classification and OOD detection (MIntOOD). We first introduce a weighted feature fusion network that models multimodal representations effectively. This network dynamically learns the importance of each modality, adapting to multimodal contexts. To develop discriminative representations that are conducive to both tasks, we synthesize pseudo-OOD data from convex combinations of ID data and engage in multimodal representation learning from both coarse-grained and fine-grained perspectives. The coarse-grained perspective focuses on distinguishing between ID and OOD binary classes, while the fine-grained perspective enhances the understanding of ID data, achieving a progressive learning process that addresses tasks of increasing complexity. Additionally, the fine-grained perspective captures instance-level interactions between ID and OOD samples, promoting proximity among similar instances and separation from dissimilar ones. We establish baselines for three multimodal intent datasets and build an OOD benchmark. Extensive experiments on these datasets demonstrate that our method significantly improves OOD detection performance with a 3-10% increase in AUROC scores while achieving new state-of-the-art results in ID classification.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"9887-9901"},"PeriodicalIF":9.7,"publicationDate":"2025-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145830824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-07DOI: 10.1109/TMM.2025.3618578
Xiaodan Li;Yao Zhu;Yuefeng Chen;Cen Chen;Jianmei Guo;Shuhui Wang
In recent years, massive datasets have significantly driven the advancement of visual learning such as multi-modal large model at the expense of high computational costs and extensive storage requirements. Dataset distillation (DD) aims to address this challenge by learning a small synthetic dataset such that a model trained on it can achieve a test performance comparable to that of the model trained on the original dataset. This task can be formulated as a bi-level learning problem where the outer loop optimizes the learned dataset and the inner loop updates the model parameters based on the distilled data. Different from previous studies that focus primarily on optimizing the inner loop in this bi-level problem, we delve into the task of dataset distillation from the perspective of sample cruciality. We find that discarding easy samples and keeping the hard ones that are difficult to be represented by the learned synthetic samples in the outer loop can be beneficial for DD. Motivated by this observation, we further develop an Infinite Semantic Augmentation (ISA) based dataset distillation algorithm, which discards some easier samples and implicitly enriches harder ones in the semantic space through continuous interpolation between two target feature vectors. Through detailed mathematical derivation, the joint contribution to the training loss of all interpolated feature points is formed into an analytical closed-form solution of an integral that can be optimized with almost no extra computational cost. Experimental results on several benchmark datasets demonstrate the effectiveness of our approach in reducing the dataset size while preserving the accuracy of the model. Furthermore, we show that high-quality distilled data can also benefit downstream applications, such as continual learning and membership inference defense.
{"title":"Boosting Dataset Distillation With the Assistance of Crucial Samples for Visual Learning","authors":"Xiaodan Li;Yao Zhu;Yuefeng Chen;Cen Chen;Jianmei Guo;Shuhui Wang","doi":"10.1109/TMM.2025.3618578","DOIUrl":"https://doi.org/10.1109/TMM.2025.3618578","url":null,"abstract":"In recent years, massive datasets have significantly driven the advancement of visual learning such as multi-modal large model at the expense of high computational costs and extensive storage requirements. Dataset distillation (DD) aims to address this challenge by learning a small synthetic dataset such that a model trained on it can achieve a test performance comparable to that of the model trained on the original dataset. This task can be formulated as a bi-level learning problem where the outer loop optimizes the learned dataset and the inner loop updates the model parameters based on the distilled data. Different from previous studies that focus primarily on optimizing the inner loop in this bi-level problem, we delve into the task of dataset distillation from the perspective of sample cruciality. We find that discarding easy samples and keeping the hard ones that are difficult to be represented by the learned synthetic samples in the outer loop can be beneficial for DD. Motivated by this observation, we further develop an Infinite Semantic Augmentation (ISA) based dataset distillation algorithm, which discards some easier samples and implicitly enriches harder ones in the semantic space through continuous interpolation between two target feature vectors. Through detailed mathematical derivation, the joint contribution to the training loss of all interpolated feature points is formed into an analytical closed-form solution of an integral that can be optimized with almost no extra computational cost. Experimental results on several benchmark datasets demonstrate the effectiveness of our approach in reducing the dataset size while preserving the accuracy of the model. Furthermore, we show that high-quality distilled data can also benefit downstream applications, such as continual learning and membership inference defense.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"9873-9886"},"PeriodicalIF":9.7,"publicationDate":"2025-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145886604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}