Pub Date : 2025-10-07DOI: 10.1109/TMM.2025.3618578
Xiaodan Li;Yao Zhu;Yuefeng Chen;Cen Chen;Jianmei Guo;Shuhui Wang
In recent years, massive datasets have significantly driven the advancement of visual learning such as multi-modal large model at the expense of high computational costs and extensive storage requirements. Dataset distillation (DD) aims to address this challenge by learning a small synthetic dataset such that a model trained on it can achieve a test performance comparable to that of the model trained on the original dataset. This task can be formulated as a bi-level learning problem where the outer loop optimizes the learned dataset and the inner loop updates the model parameters based on the distilled data. Different from previous studies that focus primarily on optimizing the inner loop in this bi-level problem, we delve into the task of dataset distillation from the perspective of sample cruciality. We find that discarding easy samples and keeping the hard ones that are difficult to be represented by the learned synthetic samples in the outer loop can be beneficial for DD. Motivated by this observation, we further develop an Infinite Semantic Augmentation (ISA) based dataset distillation algorithm, which discards some easier samples and implicitly enriches harder ones in the semantic space through continuous interpolation between two target feature vectors. Through detailed mathematical derivation, the joint contribution to the training loss of all interpolated feature points is formed into an analytical closed-form solution of an integral that can be optimized with almost no extra computational cost. Experimental results on several benchmark datasets demonstrate the effectiveness of our approach in reducing the dataset size while preserving the accuracy of the model. Furthermore, we show that high-quality distilled data can also benefit downstream applications, such as continual learning and membership inference defense.
{"title":"Boosting Dataset Distillation With the Assistance of Crucial Samples for Visual Learning","authors":"Xiaodan Li;Yao Zhu;Yuefeng Chen;Cen Chen;Jianmei Guo;Shuhui Wang","doi":"10.1109/TMM.2025.3618578","DOIUrl":"https://doi.org/10.1109/TMM.2025.3618578","url":null,"abstract":"In recent years, massive datasets have significantly driven the advancement of visual learning such as multi-modal large model at the expense of high computational costs and extensive storage requirements. Dataset distillation (DD) aims to address this challenge by learning a small synthetic dataset such that a model trained on it can achieve a test performance comparable to that of the model trained on the original dataset. This task can be formulated as a bi-level learning problem where the outer loop optimizes the learned dataset and the inner loop updates the model parameters based on the distilled data. Different from previous studies that focus primarily on optimizing the inner loop in this bi-level problem, we delve into the task of dataset distillation from the perspective of sample cruciality. We find that discarding easy samples and keeping the hard ones that are difficult to be represented by the learned synthetic samples in the outer loop can be beneficial for DD. Motivated by this observation, we further develop an Infinite Semantic Augmentation (ISA) based dataset distillation algorithm, which discards some easier samples and implicitly enriches harder ones in the semantic space through continuous interpolation between two target feature vectors. Through detailed mathematical derivation, the joint contribution to the training loss of all interpolated feature points is formed into an analytical closed-form solution of an integral that can be optimized with almost no extra computational cost. Experimental results on several benchmark datasets demonstrate the effectiveness of our approach in reducing the dataset size while preserving the accuracy of the model. Furthermore, we show that high-quality distilled data can also benefit downstream applications, such as continual learning and membership inference defense.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"9873-9886"},"PeriodicalIF":9.7,"publicationDate":"2025-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145886604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-06DOI: 10.1109/TMM.2025.3618542
Peirong Ma;Wu Ran;Zhiquan He;Jian Pu;Hong Lu
Open-vocabulary multi-label classification (OV- MLC) aims to leverage the rich multi-modal knowledge from Vision-language pre-training (VLP) models to further improve the recognition ability for unseen (novel) classes beyond the training set in multi-label scenarios. Existing OV-MLC methods only perform predictions on single hierarchical regions, and aggregate the prediction scores of these regions through simple top-k mean pooling. This fails to unleash the potential of rich hierarchical region clues in multi-label images and does not fully exploit the discriminative information from all regions in the image, resulting in sub-optimal performance. In this work, we propose a novel OV-MLC framework to fully harness the power of multiple hierarchical region clues. Specifically, we first design a hierarchical clue gathering (HCG) module to gather different hierarchical clues, enabling more precise recognition of multiple object categories with different sizes in a multi-label image. Then, by viewing multi-label classification as single-label classification of each region within the image, we present a novel hierarchical score aggregation (HSA) approach, thereby better utilizing the predictions of each image region for each class. We also utilize a well-designed region selection strategy (RSS) to eliminate noise or background regions in an image that are irrelevant to classification, achieving higher multi-label classification accuracy. In addition, we propose a hybrid prompt learning (HPL) strategy to enhance visual-semantic consistency while preserving the generalization capability of label embeddings for unseen classes. Extensive experiments on public benchmark datasets demonstrate that our method significantly outperforms the current state-of-the-art.
{"title":"Unleashing the Potential of Hierarchical Region Clues for Open-Vocabulary Multi-Label Classification","authors":"Peirong Ma;Wu Ran;Zhiquan He;Jian Pu;Hong Lu","doi":"10.1109/TMM.2025.3618542","DOIUrl":"https://doi.org/10.1109/TMM.2025.3618542","url":null,"abstract":"Open-vocabulary multi-label classification (OV- MLC) aims to leverage the rich multi-modal knowledge from Vision-language pre-training (VLP) models to further improve the recognition ability for unseen (novel) classes beyond the training set in multi-label scenarios. Existing OV-MLC methods only perform predictions on single hierarchical regions, and aggregate the prediction scores of these regions through simple <italic>top-k</i> mean pooling. This fails to unleash the potential of rich hierarchical region clues in multi-label images and does not fully exploit the discriminative information from all regions in the image, resulting in sub-optimal performance. In this work, we propose a novel OV-MLC framework to fully harness the power of multiple hierarchical region clues. Specifically, we first design a hierarchical clue gathering (HCG) module to gather different hierarchical clues, enabling more precise recognition of multiple object categories with different sizes in a multi-label image. Then, by viewing multi-label classification as single-label classification of each region within the image, we present a novel hierarchical score aggregation (HSA) approach, thereby better utilizing the predictions of each image region for each class. We also utilize a well-designed region selection strategy (RSS) to eliminate noise or background regions in an image that are irrelevant to classification, achieving higher multi-label classification accuracy. In addition, we propose a hybrid prompt learning (HPL) strategy to enhance visual-semantic consistency while preserving the generalization capability of label embeddings for unseen classes. Extensive experiments on public benchmark datasets demonstrate that our method significantly outperforms the current state-of-the-art.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"9832-9846"},"PeriodicalIF":9.7,"publicationDate":"2025-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145885034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-06DOI: 10.1109/TMM.2025.3618537
Kaichen Chi;Wei Jing;Junjie Li;Qiang Li;Qi Wang
Shadows are dark areas, typically rendering low illumination intensity. Admittedly, the infrared image can provide robust illumination cues that the visible image lacks, but existing methods ignore the collaboration between heterogeneous modalities. To fill this gap, we propose a weakly supervised shadow removal network with a spherical feature space, dubbed S2-ShadowNet, to explore the best of both worlds for visible and infrared modalities. Specifically, we employ a modal translation (visible-to-infrared) model to learn the cross-domain mapping, thus generating realistic infrared samples. Then, Swin Transformer is utilized to extract strong representational visible/infrared features. Simultaneously, the extracted features are mapped to the smooth spherical manifold, which alleviates the domain shift through regularization. Well-designed similarity loss and orthogonality loss are embedded into the spherical space, prompting the separation of private visible/infrared features and the alignment of shared visible/infrared features through constraints on both representation content and orientation. Such a manner encourages implicit reciprocity between modalities, thus providing a novel insight into shadow removal. Notably, ground truth is not available in practice, thus S2-ShadowNet is trained by cropping shadow and shadow-free patches from the shadow image itself, avoiding stereotypical and strict pair data acquisition. More importantly, we contribute a large-scale weakly supervised shadow removal benchmark that makes shadow removal independent of specific scenario constraints possible. Extensive experiments demonstrate that S2-ShadowNet outperforms state-of-the-art methods in both qualitative and quantitative comparisons.
{"title":"Cross-Modal Spherical Aggregation for Weakly Supervised Remote Sensing Shadow Removal","authors":"Kaichen Chi;Wei Jing;Junjie Li;Qiang Li;Qi Wang","doi":"10.1109/TMM.2025.3618537","DOIUrl":"https://doi.org/10.1109/TMM.2025.3618537","url":null,"abstract":"Shadows are dark areas, typically rendering low illumination intensity. Admittedly, the infrared image can provide robust illumination cues that the visible image lacks, but existing methods ignore the collaboration between heterogeneous modalities. To fill this gap, we propose a weakly supervised shadow removal network with a spherical feature space, dubbed S2-ShadowNet, to explore the best of both worlds for visible and infrared modalities. Specifically, we employ a modal translation (visible-to-infrared) model to learn the cross-domain mapping, thus generating realistic infrared samples. Then, Swin Transformer is utilized to extract strong representational visible/infrared features. Simultaneously, the extracted features are mapped to the smooth spherical manifold, which alleviates the domain shift through regularization. Well-designed similarity loss and orthogonality loss are embedded into the spherical space, prompting the separation of private visible/infrared features and the alignment of shared visible/infrared features through constraints on both representation content and orientation. Such a manner encourages implicit reciprocity between modalities, thus providing a novel insight into shadow removal. Notably, ground truth is not available in practice, thus S2-ShadowNet is trained by cropping shadow and shadow-free patches from the shadow image itself, avoiding stereotypical and strict pair data acquisition. More importantly, we contribute a large-scale weakly supervised shadow removal benchmark that makes shadow removal independent of specific scenario constraints possible. Extensive experiments demonstrate that S2-ShadowNet outperforms state-of-the-art methods in both qualitative and quantitative comparisons.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"813-824"},"PeriodicalIF":9.7,"publicationDate":"2025-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145929529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-06DOI: 10.1109/TMM.2025.3618533
Ziyi Liu;Caiyun Xie;Wenbing Ding;Dengpan Ye;Long Tang;Qian Wang
Adversarial attacks have become a critical focus in visual object tracking (VOT) research. Small, carefully crafted adversarial perturbations to video frames can easily disrupt the visual object tracker, leading to tracking failure. Therefore, studying adversarial attacks contributes to the development of more robust and reliable trackers. Considering that trackers are agnostic in real-world scenarios, research on decision-based black-box attacks is straightforward and practical. However, existing decision-based black-box attacks neither comprehensively analyze the unique characteristics of object tracking nor sufficiently consider the imperceptibility of adversarial perturbations. In this paper, we propose invisible local attack (ILA), a novel decision-based adversarial attack specifically for VOT with imperceptible perturbations. We assume that a significant number of pixels in a frame, irrelevant to the tracked object, do not substantially contribute to the functioning mechanism of a deep tracker. Based on this consideration, we propose a search algorithm to identify the pixel set focused on by the tracker during object tracking. The adversarial noise is then confined to these pixels and iteratively optimized through a heuristic algorithm of ILA. By perturbing only the key pixels, ILA significantly enhances both the attack performance and imperceptibility when it is applied to visual object trackers. Extensive experiments demonstrate that our ILA method achieves a 121% increase in the robustness metric and a 137% improvement in the structural similarity index measure (SSIM) across multiple datasets for various trackers compared with the state-of-the-art (SOTA) method.
{"title":"Towards Invisible Decision-Based Adversarial Attacks Against Visual Object Tracking","authors":"Ziyi Liu;Caiyun Xie;Wenbing Ding;Dengpan Ye;Long Tang;Qian Wang","doi":"10.1109/TMM.2025.3618533","DOIUrl":"https://doi.org/10.1109/TMM.2025.3618533","url":null,"abstract":"Adversarial attacks have become a critical focus in visual object tracking (VOT) research. Small, carefully crafted adversarial perturbations to video frames can easily disrupt the visual object tracker, leading to tracking failure. Therefore, studying adversarial attacks contributes to the development of more robust and reliable trackers. Considering that trackers are agnostic in real-world scenarios, research on decision-based black-box attacks is straightforward and practical. However, existing decision-based black-box attacks neither comprehensively analyze the unique characteristics of object tracking nor sufficiently consider the imperceptibility of adversarial perturbations. In this paper, we propose invisible local attack (ILA), a novel decision-based adversarial attack specifically for VOT with imperceptible perturbations. We assume that a significant number of pixels in a frame, irrelevant to the tracked object, do not substantially contribute to the functioning mechanism of a deep tracker. Based on this consideration, we propose a search algorithm to identify the pixel set focused on by the tracker during object tracking. The adversarial noise is then confined to these pixels and iteratively optimized through a heuristic algorithm of ILA. By perturbing only the key pixels, ILA significantly enhances both the attack performance and imperceptibility when it is applied to visual object trackers. Extensive experiments demonstrate that our ILA method achieves a 121% increase in the robustness metric and a 137% improvement in the structural similarity index measure (SSIM) across multiple datasets for various trackers compared with the state-of-the-art (SOTA) method.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"9861-9872"},"PeriodicalIF":9.7,"publicationDate":"2025-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145886608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-06DOI: 10.1109/TMM.2025.3618573
Yanpeng Jia;Fengkui Cao;Ting Wang;Yandong Tang;Shiliang Shao;Lianqing Liu
Most LiDAR odometry and SLAM systems construct maps in point clouds, which are discrete and sparse when zoomed in, making them not directly suitable for navigation. Mesh maps represent a dense and continuous map format with low memory consumption, which can approximate complex structures with simple elements, attracting significant attention of researchers in recent years. However, most existing methods operate under a static environment assumption. In effect, moving objects cause ghosting, degrading the quality of meshing. To address these issues, we propose a plug-and-play meshing module adapting to dynamic environments, which can easily integrate with various LiDAR odometry to generally improve the pose estimation accuracy of odometry. In our meshing module, a novel two-stage coarse-to-fine dynamic removal method is designed to effectively filter dynamic objects, generating consistent, accurate, and dense mesh maps. To the best of our knowledge, this is the first mesh construction method with explicit dynamic removal. Additionally, sliding window-based keyframe aggregation and adaptive downsampling strategies are used to ensure the uniformity of point cloud, benefiting for Gaussian process in mesh construction. We evaluate the localization and mapping accuracy on six publicly available datasets. Extensive experiments demonstrate the superiority of our method compared with the state-of-the-art algorithms. The code and introduction video are publicly available at https://yaepiii.github.io/CAD-Mesher/.
{"title":"CAD-Mesher: A Convenient, Accurate, Dense Mesh-Based Mapping Module in SLAM for Dynamic Environments","authors":"Yanpeng Jia;Fengkui Cao;Ting Wang;Yandong Tang;Shiliang Shao;Lianqing Liu","doi":"10.1109/TMM.2025.3618573","DOIUrl":"https://doi.org/10.1109/TMM.2025.3618573","url":null,"abstract":"Most LiDAR odometry and SLAM systems construct maps in point clouds, which are discrete and sparse when zoomed in, making them not directly suitable for navigation. Mesh maps represent a dense and continuous map format with low memory consumption, which can approximate complex structures with simple elements, attracting significant attention of researchers in recent years. However, most existing methods operate under a static environment assumption. In effect, moving objects cause ghosting, degrading the quality of meshing. To address these issues, we propose a plug-and-play meshing module adapting to dynamic environments, which can easily integrate with various LiDAR odometry to generally improve the pose estimation accuracy of odometry. In our meshing module, a novel two-stage coarse-to-fine dynamic removal method is designed to effectively filter dynamic objects, generating consistent, accurate, and dense mesh maps. To the best of our knowledge, this is the first mesh construction method with explicit dynamic removal. Additionally, sliding window-based keyframe aggregation and adaptive downsampling strategies are used to ensure the uniformity of point cloud, benefiting for Gaussian process in mesh construction. We evaluate the localization and mapping accuracy on six publicly available datasets. Extensive experiments demonstrate the superiority of our method compared with the state-of-the-art algorithms. The code and introduction video are publicly available at <uri>https://yaepiii.github.io/CAD-Mesher/</uri>.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"1025-1036"},"PeriodicalIF":9.7,"publicationDate":"2025-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146175626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-22DOI: 10.1109/TMM.2025.3613171
Haojun Dai;Dawen Xu;Lin Yang;Rangding Wang
With high embedding capacity and security, transform coefficient-based video steganography has become an important branch of video steganography. However, existing steganalysis methods against transform coefficient-based steganography provide insufficient consideration to the prediction process of HEVC compression, which results in steganalysis that is not straightforward and fail to effectively detect adaptive steganography methods in low embedding rate scenarios. In this paper, an HEVC video steganalysis method based on centralized error and attention mechanism against transform coefficient-based steganography is proposed. Firstly, the centralized error phenomenon brought by distortion compensation-based steganography is analyzed, and prediction error maps is constructed for steganalysis to achieve higher SNR(signal-to-noise ratio). Secondly, a video steganalysis network called CESNet (Centralized Error Steganalysis Network) is proposed. The network takes the prediction error maps as input and four types of convolutional modules are designed to adapt to different stages of feature extraction. To address the intra-frame sparsity of adaptive steganography, CEA (Centralized Error Attention) modules based on spatial and channel attention mechanisms are proposed to adaptively enhance the steganographic region. Finally, after extracting the feature vectors of each frame, the detection of steganographic video is completed using the self-attention mechanism. Experimental results show that compared with the existing transform coefficient-based video steganalysis methods, the proposed method can effectively detect multiple transform coefficient-based steganography algorithms and achieve higher detection performance in low payload scenarios.
{"title":"HEVC Video Steganalysis Based on Centralized Error and Attention Mechanism","authors":"Haojun Dai;Dawen Xu;Lin Yang;Rangding Wang","doi":"10.1109/TMM.2025.3613171","DOIUrl":"https://doi.org/10.1109/TMM.2025.3613171","url":null,"abstract":"With high embedding capacity and security, transform coefficient-based video steganography has become an important branch of video steganography. However, existing steganalysis methods against transform coefficient-based steganography provide insufficient consideration to the prediction process of HEVC compression, which results in steganalysis that is not straightforward and fail to effectively detect adaptive steganography methods in low embedding rate scenarios. In this paper, an HEVC video steganalysis method based on centralized error and attention mechanism against transform coefficient-based steganography is proposed. Firstly, the centralized error phenomenon brought by distortion compensation-based steganography is analyzed, and prediction error maps is constructed for steganalysis to achieve higher SNR(signal-to-noise ratio). Secondly, a video steganalysis network called CESNet (Centralized Error Steganalysis Network) is proposed. The network takes the prediction error maps as input and four types of convolutional modules are designed to adapt to different stages of feature extraction. To address the intra-frame sparsity of adaptive steganography, CEA (Centralized Error Attention) modules based on spatial and channel attention mechanisms are proposed to adaptively enhance the steganographic region. Finally, after extracting the feature vectors of each frame, the detection of steganographic video is completed using the self-attention mechanism. Experimental results show that compared with the existing transform coefficient-based video steganalysis methods, the proposed method can effectively detect multiple transform coefficient-based steganography algorithms and achieve higher detection performance in low payload scenarios.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"8914-8925"},"PeriodicalIF":9.7,"publicationDate":"2025-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145510152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-09DOI: 10.1109/TMM.2025.3607726
Han Jiang;Xiaoshan Yang;Chaofan Chen;Changsheng Xu
Compositional zero-shot learning (CZSL) aims to identify novel compositions formed by known primitives (attributes and objects). Motivated by recent advancements in pre-trained vision-language models such as CLIP, many methods attempt to fine-tune CLIP for CZSL and achieve remarkable performance. However, the existing CLIP-based CZSL methods focus mainly on text prompt tuning, which lacks the flexibility to dynamically adapt both modalities. To solve this issue, an intuitive solution is to additionally introduce visual prompt tuning. This insight is not trivial to achieve because effectively learning prompts for CZSL involves the challenge of entanglement between visual primitives as well as appearance shifts in different compositions. In this paper, we propose a novel Synergetic Prompts as Disentanglement Queries (SPDQ) framework for CZSL. It can disentangle primitive features based on synergetic prompts to jointly alleviate these challenges. Specifically, we first design a low-rank primitive modulator to produce synergetic adaptive attribute and object prompts based on prior knowledge of each instance for model adaptation. Then, we additionally utilize text prefix prompts to construct synergetic prompt queries, which are used to resample corresponding visual features from local visual patches. Comprehensive experiments conducted on three benchmarks demonstrate that our SPDQ approach achieves state-of-the-art results.
{"title":"SPDQ: Synergetic Prompts as Disentanglement Queries for Compositional Zero-Shot Learning","authors":"Han Jiang;Xiaoshan Yang;Chaofan Chen;Changsheng Xu","doi":"10.1109/TMM.2025.3607726","DOIUrl":"https://doi.org/10.1109/TMM.2025.3607726","url":null,"abstract":"Compositional zero-shot learning (CZSL) aims to identify novel compositions formed by known primitives (attributes and objects). Motivated by recent advancements in pre-trained vision-language models such as CLIP, many methods attempt to fine-tune CLIP for CZSL and achieve remarkable performance. However, the existing CLIP-based CZSL methods focus mainly on text prompt tuning, which lacks the flexibility to dynamically adapt both modalities. To solve this issue, an intuitive solution is to additionally introduce visual prompt tuning. This insight is not trivial to achieve because effectively learning prompts for CZSL involves the challenge of entanglement between visual primitives as well as appearance shifts in different compositions. In this paper, we propose a novel Synergetic Prompts as Disentanglement Queries (SPDQ) framework for CZSL. It can disentangle primitive features based on synergetic prompts to jointly alleviate these challenges. Specifically, we first design a low-rank primitive modulator to produce synergetic adaptive attribute and object prompts based on prior knowledge of each instance for model adaptation. Then, we additionally utilize text prefix prompts to construct synergetic prompt queries, which are used to resample corresponding visual features from local visual patches. Comprehensive experiments conducted on three benchmarks demonstrate that our SPDQ approach achieves state-of-the-art results.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"8888-8899"},"PeriodicalIF":9.7,"publicationDate":"2025-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145510101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Effectively representing and transferring user preferences across various domains presents a significant challenge in cross-domain recommendation (CDR). Some approaches utilize graph neural networks that use interaction behavior to establish relationships between entities, providing a comprehensive understanding of user interests. However, the impact of consistent semantics across various types, fields, and perspectives of social media information on user preferences is overlooked, i.e. the multidimensional consistency of user preferences. This oversight results in graph node representations that inadequately reflect user preferences. To address these limitations, we propose a multi-layer transfer learning network (MTLG) for CDR based on graph node representation enhancement via multi-dimensional consistent user preferences. Firstly, the model introduces a set of globally shared semantic units to perform different-grained semantic alignment of multiple media information without clear alignment boundaries, thereby modeling multi-dimensional consistent user preference features. These features are then seamlessly integrated with the initial high-order graph structure embedding features, thus significantly improving the quality of graph node representation. Secondly, the model innovatively designs a multi-layer transfer learning network that hierarchically aligns the domain distribution differences. It calculates the similarity between domains to derive layer weights for more precise transfer learning, thereby mitigating the possibility of information error accumulation resulting from inaccurate feature aggregation processes. We conducted numerous experiments on 3 scenarios, including 7,954,943 rating information from the Amazon dataset. The results indicate that MTLG’s recommendation accuracy surpasses those of state-of-the-art methods.
{"title":"Multi-Layer Transfer Learning for Cross-Domain Recommendation Based on Graph Node Representation Enhancement","authors":"Xin Ni;Jie Nie;Niantai Jing;Jianliang Xu;Xiaodong Wang;Xuesong Gao;MingXing Jiang;Chi-Hung Chi;Zhiqiang Wei","doi":"10.1109/TMM.2025.3607706","DOIUrl":"https://doi.org/10.1109/TMM.2025.3607706","url":null,"abstract":"Effectively representing and transferring user preferences across various domains presents a significant challenge in cross-domain recommendation (CDR). Some approaches utilize graph neural networks that use interaction behavior to establish relationships between entities, providing a comprehensive understanding of user interests. However, the impact of consistent semantics across various types, fields, and perspectives of social media information on user preferences is overlooked, i.e. the multidimensional consistency of user preferences. This oversight results in graph node representations that inadequately reflect user preferences. To address these limitations, we propose a multi-layer transfer learning network (MTLG) for CDR based on graph node representation enhancement via multi-dimensional consistent user preferences. Firstly, the model introduces a set of globally shared semantic units to perform different-grained semantic alignment of multiple media information without clear alignment boundaries, thereby modeling multi-dimensional consistent user preference features. These features are then seamlessly integrated with the initial high-order graph structure embedding features, thus significantly improving the quality of graph node representation. Secondly, the model innovatively designs a multi-layer transfer learning network that hierarchically aligns the domain distribution differences. It calculates the similarity between domains to derive layer weights for more precise transfer learning, thereby mitigating the possibility of information error accumulation resulting from inaccurate feature aggregation processes. We conducted numerous experiments on 3 scenarios, including 7,954,943 rating information from the Amazon dataset. The results indicate that MTLG’s recommendation accuracy surpasses those of state-of-the-art methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"8940-8953"},"PeriodicalIF":9.7,"publicationDate":"2025-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145510159","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-08DOI: 10.1109/TMM.2025.3604977
Yuyu Jia;Qing Zhou;Junyu Gao;Qiang Li;Qi Wang
Few-shot learning aims to generalize the recognizer from seen categories to an entirely novel scenario. With only a few support samples, several advanced methods initially introduce class names as prior knowledge for identifying novel classes. However, obstacles still impede achieving a comprehensive understanding of how to harness the mutual advantages of visual and textual knowledge. In this paper, we set out to fill this gap via a coherent Bidirectional Knowledge Permeation strategy called BiKop, which is grounded in human intuition: a class name description offers a more general representation, whereas an image captures the specificity of individuals. BiKop primarily establishes a hierarchical joint general-specific representation through bidirectional knowledge permeation. On the other hand, considering the bias of joint representation towards the base set, we disentangle base-class-relevant semantics during training, thereby alleviating the suppression of potential novel-class-relevant information. Experiments on four challenging benchmarks demonstrate the remarkable superiority of BiKop, particularly outperforming previous methods by a substantial margin in the 1-shot setting (improving the accuracy by 7.58% on miniImageNet).
{"title":"Like Humans to Few-Shot Learning Through Knowledge Permeation of Visual and Language","authors":"Yuyu Jia;Qing Zhou;Junyu Gao;Qiang Li;Qi Wang","doi":"10.1109/TMM.2025.3604977","DOIUrl":"https://doi.org/10.1109/TMM.2025.3604977","url":null,"abstract":"Few-shot learning aims to generalize the recognizer from seen categories to an entirely novel scenario. With only a few support samples, several advanced methods initially introduce class names as prior knowledge for identifying novel classes. However, obstacles still impede achieving a comprehensive understanding of how to harness the mutual advantages of visual and textual knowledge. In this paper, we set out to fill this gap via a coherent Bidirectional Knowledge Permeation strategy called BiKop, which is grounded in human intuition: a class name description offers a more <italic>general</i> representation, whereas an image captures the <italic>specificity</i> of individuals. BiKop primarily establishes a hierarchical joint general-specific representation through bidirectional knowledge permeation. On the other hand, considering the bias of joint representation towards the base set, we disentangle base-class-relevant semantics during training, thereby alleviating the suppression of potential novel-class-relevant information. Experiments on four challenging benchmarks demonstrate the remarkable superiority of BiKop, particularly outperforming previous methods by a substantial margin in the 1-shot setting (improving the accuracy by 7.58% on <italic>mini</i>ImageNet).","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"7905-7916"},"PeriodicalIF":9.7,"publicationDate":"2025-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145351925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-02DOI: 10.1109/TMM.2025.3604903
Hongqi Yu;Sixian Chan;Xiaolong Zhou;Xiaoqin Zhang
Effective and robust 3D panoptic segmentation is crucial for scene perception in autonomous driving. Modern methods widely adopt multi-modal fusion based simple feature concatenation to enhance 3D scene understanding, resulting in generated multi-modal representations typically lack comprehensive semantic and geometry information. These methods focused on panoptic prediction in a single step also limit the capability to progressively refine panoptic predictions under varying noise levels, which is essential for enhancing model robustness. To address these limitations, we first utilize BEV space to unify semantic-geometry perceptual representation, allowing for a more effective integration of LiDAR and camera data. Then, we propose PrimePSegter, a progressively combined diffusion 3D panoptic segmentation model that is conditioned on BEV maps to iteratively refine predictions by denoising samples generated from Gaussian distribution. PrimePSegter adopts a conditional encoder-decoder architecture for fine-grained panoptic predictions. Specifically, a multi-modal conditional encoder is equipped with BEV fusion network to integrate semantic and geometric information from LiDAR and camera streams into unified BEV space. Additionally, a diffusion transformer decoder operates on multi-modal BEV features with varying noise levels to guide the training of diffusion model, refining the BEV panoptic representations enriched with semantics and geometry in a progressive way. PrimePSegter achieves state-of-the-art performance on the nuScenes and competitive results on the SemanticKITTI, respectively. Moreover, PrimePSegter demonstrates superior robustness towards various scenarios, outperforming leading methods.
{"title":"PrimePSegter: Progressively Combined Diffusion for 3D Panoptic Segmentation With Multi-Modal BEV Refinement","authors":"Hongqi Yu;Sixian Chan;Xiaolong Zhou;Xiaoqin Zhang","doi":"10.1109/TMM.2025.3604903","DOIUrl":"https://doi.org/10.1109/TMM.2025.3604903","url":null,"abstract":"Effective and robust 3D panoptic segmentation is crucial for scene perception in autonomous driving. Modern methods widely adopt multi-modal fusion based simple feature concatenation to enhance 3D scene understanding, resulting in generated multi-modal representations typically lack comprehensive semantic and geometry information. These methods focused on panoptic prediction in a single step also limit the capability to progressively refine panoptic predictions under varying noise levels, which is essential for enhancing model robustness. To address these limitations, we first utilize BEV space to unify semantic-geometry perceptual representation, allowing for a more effective integration of LiDAR and camera data. Then, we propose PrimePSegter, a progressively combined diffusion 3D panoptic segmentation model that is conditioned on BEV maps to iteratively refine predictions by denoising samples generated from Gaussian distribution. PrimePSegter adopts a conditional encoder-decoder architecture for fine-grained panoptic predictions. Specifically, a multi-modal conditional encoder is equipped with BEV fusion network to integrate semantic and geometric information from LiDAR and camera streams into unified BEV space. Additionally, a diffusion transformer decoder operates on multi-modal BEV features with varying noise levels to guide the training of diffusion model, refining the BEV panoptic representations enriched with semantics and geometry in a progressive way. PrimePSegter achieves state-of-the-art performance on the nuScenes and competitive results on the SemanticKITTI, respectively. Moreover, PrimePSegter demonstrates superior robustness towards various scenarios, outperforming leading methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"7891-7904"},"PeriodicalIF":9.7,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145351954","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}