Pub Date : 2025-10-06DOI: 10.1109/TMM.2025.3618542
Peirong Ma;Wu Ran;Zhiquan He;Jian Pu;Hong Lu
Open-vocabulary multi-label classification (OV- MLC) aims to leverage the rich multi-modal knowledge from Vision-language pre-training (VLP) models to further improve the recognition ability for unseen (novel) classes beyond the training set in multi-label scenarios. Existing OV-MLC methods only perform predictions on single hierarchical regions, and aggregate the prediction scores of these regions through simple top-k mean pooling. This fails to unleash the potential of rich hierarchical region clues in multi-label images and does not fully exploit the discriminative information from all regions in the image, resulting in sub-optimal performance. In this work, we propose a novel OV-MLC framework to fully harness the power of multiple hierarchical region clues. Specifically, we first design a hierarchical clue gathering (HCG) module to gather different hierarchical clues, enabling more precise recognition of multiple object categories with different sizes in a multi-label image. Then, by viewing multi-label classification as single-label classification of each region within the image, we present a novel hierarchical score aggregation (HSA) approach, thereby better utilizing the predictions of each image region for each class. We also utilize a well-designed region selection strategy (RSS) to eliminate noise or background regions in an image that are irrelevant to classification, achieving higher multi-label classification accuracy. In addition, we propose a hybrid prompt learning (HPL) strategy to enhance visual-semantic consistency while preserving the generalization capability of label embeddings for unseen classes. Extensive experiments on public benchmark datasets demonstrate that our method significantly outperforms the current state-of-the-art.
{"title":"Unleashing the Potential of Hierarchical Region Clues for Open-Vocabulary Multi-Label Classification","authors":"Peirong Ma;Wu Ran;Zhiquan He;Jian Pu;Hong Lu","doi":"10.1109/TMM.2025.3618542","DOIUrl":"https://doi.org/10.1109/TMM.2025.3618542","url":null,"abstract":"Open-vocabulary multi-label classification (OV- MLC) aims to leverage the rich multi-modal knowledge from Vision-language pre-training (VLP) models to further improve the recognition ability for unseen (novel) classes beyond the training set in multi-label scenarios. Existing OV-MLC methods only perform predictions on single hierarchical regions, and aggregate the prediction scores of these regions through simple <italic>top-k</i> mean pooling. This fails to unleash the potential of rich hierarchical region clues in multi-label images and does not fully exploit the discriminative information from all regions in the image, resulting in sub-optimal performance. In this work, we propose a novel OV-MLC framework to fully harness the power of multiple hierarchical region clues. Specifically, we first design a hierarchical clue gathering (HCG) module to gather different hierarchical clues, enabling more precise recognition of multiple object categories with different sizes in a multi-label image. Then, by viewing multi-label classification as single-label classification of each region within the image, we present a novel hierarchical score aggregation (HSA) approach, thereby better utilizing the predictions of each image region for each class. We also utilize a well-designed region selection strategy (RSS) to eliminate noise or background regions in an image that are irrelevant to classification, achieving higher multi-label classification accuracy. In addition, we propose a hybrid prompt learning (HPL) strategy to enhance visual-semantic consistency while preserving the generalization capability of label embeddings for unseen classes. Extensive experiments on public benchmark datasets demonstrate that our method significantly outperforms the current state-of-the-art.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"9832-9846"},"PeriodicalIF":9.7,"publicationDate":"2025-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145885034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-06DOI: 10.1109/TMM.2025.3618537
Kaichen Chi;Wei Jing;Junjie Li;Qiang Li;Qi Wang
Shadows are dark areas, typically rendering low illumination intensity. Admittedly, the infrared image can provide robust illumination cues that the visible image lacks, but existing methods ignore the collaboration between heterogeneous modalities. To fill this gap, we propose a weakly supervised shadow removal network with a spherical feature space, dubbed S2-ShadowNet, to explore the best of both worlds for visible and infrared modalities. Specifically, we employ a modal translation (visible-to-infrared) model to learn the cross-domain mapping, thus generating realistic infrared samples. Then, Swin Transformer is utilized to extract strong representational visible/infrared features. Simultaneously, the extracted features are mapped to the smooth spherical manifold, which alleviates the domain shift through regularization. Well-designed similarity loss and orthogonality loss are embedded into the spherical space, prompting the separation of private visible/infrared features and the alignment of shared visible/infrared features through constraints on both representation content and orientation. Such a manner encourages implicit reciprocity between modalities, thus providing a novel insight into shadow removal. Notably, ground truth is not available in practice, thus S2-ShadowNet is trained by cropping shadow and shadow-free patches from the shadow image itself, avoiding stereotypical and strict pair data acquisition. More importantly, we contribute a large-scale weakly supervised shadow removal benchmark that makes shadow removal independent of specific scenario constraints possible. Extensive experiments demonstrate that S2-ShadowNet outperforms state-of-the-art methods in both qualitative and quantitative comparisons.
{"title":"Cross-Modal Spherical Aggregation for Weakly Supervised Remote Sensing Shadow Removal","authors":"Kaichen Chi;Wei Jing;Junjie Li;Qiang Li;Qi Wang","doi":"10.1109/TMM.2025.3618537","DOIUrl":"https://doi.org/10.1109/TMM.2025.3618537","url":null,"abstract":"Shadows are dark areas, typically rendering low illumination intensity. Admittedly, the infrared image can provide robust illumination cues that the visible image lacks, but existing methods ignore the collaboration between heterogeneous modalities. To fill this gap, we propose a weakly supervised shadow removal network with a spherical feature space, dubbed S2-ShadowNet, to explore the best of both worlds for visible and infrared modalities. Specifically, we employ a modal translation (visible-to-infrared) model to learn the cross-domain mapping, thus generating realistic infrared samples. Then, Swin Transformer is utilized to extract strong representational visible/infrared features. Simultaneously, the extracted features are mapped to the smooth spherical manifold, which alleviates the domain shift through regularization. Well-designed similarity loss and orthogonality loss are embedded into the spherical space, prompting the separation of private visible/infrared features and the alignment of shared visible/infrared features through constraints on both representation content and orientation. Such a manner encourages implicit reciprocity between modalities, thus providing a novel insight into shadow removal. Notably, ground truth is not available in practice, thus S2-ShadowNet is trained by cropping shadow and shadow-free patches from the shadow image itself, avoiding stereotypical and strict pair data acquisition. More importantly, we contribute a large-scale weakly supervised shadow removal benchmark that makes shadow removal independent of specific scenario constraints possible. Extensive experiments demonstrate that S2-ShadowNet outperforms state-of-the-art methods in both qualitative and quantitative comparisons.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"813-824"},"PeriodicalIF":9.7,"publicationDate":"2025-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145929529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-06DOI: 10.1109/TMM.2025.3618533
Ziyi Liu;Caiyun Xie;Wenbing Ding;Dengpan Ye;Long Tang;Qian Wang
Adversarial attacks have become a critical focus in visual object tracking (VOT) research. Small, carefully crafted adversarial perturbations to video frames can easily disrupt the visual object tracker, leading to tracking failure. Therefore, studying adversarial attacks contributes to the development of more robust and reliable trackers. Considering that trackers are agnostic in real-world scenarios, research on decision-based black-box attacks is straightforward and practical. However, existing decision-based black-box attacks neither comprehensively analyze the unique characteristics of object tracking nor sufficiently consider the imperceptibility of adversarial perturbations. In this paper, we propose invisible local attack (ILA), a novel decision-based adversarial attack specifically for VOT with imperceptible perturbations. We assume that a significant number of pixels in a frame, irrelevant to the tracked object, do not substantially contribute to the functioning mechanism of a deep tracker. Based on this consideration, we propose a search algorithm to identify the pixel set focused on by the tracker during object tracking. The adversarial noise is then confined to these pixels and iteratively optimized through a heuristic algorithm of ILA. By perturbing only the key pixels, ILA significantly enhances both the attack performance and imperceptibility when it is applied to visual object trackers. Extensive experiments demonstrate that our ILA method achieves a 121% increase in the robustness metric and a 137% improvement in the structural similarity index measure (SSIM) across multiple datasets for various trackers compared with the state-of-the-art (SOTA) method.
{"title":"Towards Invisible Decision-Based Adversarial Attacks Against Visual Object Tracking","authors":"Ziyi Liu;Caiyun Xie;Wenbing Ding;Dengpan Ye;Long Tang;Qian Wang","doi":"10.1109/TMM.2025.3618533","DOIUrl":"https://doi.org/10.1109/TMM.2025.3618533","url":null,"abstract":"Adversarial attacks have become a critical focus in visual object tracking (VOT) research. Small, carefully crafted adversarial perturbations to video frames can easily disrupt the visual object tracker, leading to tracking failure. Therefore, studying adversarial attacks contributes to the development of more robust and reliable trackers. Considering that trackers are agnostic in real-world scenarios, research on decision-based black-box attacks is straightforward and practical. However, existing decision-based black-box attacks neither comprehensively analyze the unique characteristics of object tracking nor sufficiently consider the imperceptibility of adversarial perturbations. In this paper, we propose invisible local attack (ILA), a novel decision-based adversarial attack specifically for VOT with imperceptible perturbations. We assume that a significant number of pixels in a frame, irrelevant to the tracked object, do not substantially contribute to the functioning mechanism of a deep tracker. Based on this consideration, we propose a search algorithm to identify the pixel set focused on by the tracker during object tracking. The adversarial noise is then confined to these pixels and iteratively optimized through a heuristic algorithm of ILA. By perturbing only the key pixels, ILA significantly enhances both the attack performance and imperceptibility when it is applied to visual object trackers. Extensive experiments demonstrate that our ILA method achieves a 121% increase in the robustness metric and a 137% improvement in the structural similarity index measure (SSIM) across multiple datasets for various trackers compared with the state-of-the-art (SOTA) method.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"9861-9872"},"PeriodicalIF":9.7,"publicationDate":"2025-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145886608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-22DOI: 10.1109/TMM.2025.3613171
Haojun Dai;Dawen Xu;Lin Yang;Rangding Wang
With high embedding capacity and security, transform coefficient-based video steganography has become an important branch of video steganography. However, existing steganalysis methods against transform coefficient-based steganography provide insufficient consideration to the prediction process of HEVC compression, which results in steganalysis that is not straightforward and fail to effectively detect adaptive steganography methods in low embedding rate scenarios. In this paper, an HEVC video steganalysis method based on centralized error and attention mechanism against transform coefficient-based steganography is proposed. Firstly, the centralized error phenomenon brought by distortion compensation-based steganography is analyzed, and prediction error maps is constructed for steganalysis to achieve higher SNR(signal-to-noise ratio). Secondly, a video steganalysis network called CESNet (Centralized Error Steganalysis Network) is proposed. The network takes the prediction error maps as input and four types of convolutional modules are designed to adapt to different stages of feature extraction. To address the intra-frame sparsity of adaptive steganography, CEA (Centralized Error Attention) modules based on spatial and channel attention mechanisms are proposed to adaptively enhance the steganographic region. Finally, after extracting the feature vectors of each frame, the detection of steganographic video is completed using the self-attention mechanism. Experimental results show that compared with the existing transform coefficient-based video steganalysis methods, the proposed method can effectively detect multiple transform coefficient-based steganography algorithms and achieve higher detection performance in low payload scenarios.
{"title":"HEVC Video Steganalysis Based on Centralized Error and Attention Mechanism","authors":"Haojun Dai;Dawen Xu;Lin Yang;Rangding Wang","doi":"10.1109/TMM.2025.3613171","DOIUrl":"https://doi.org/10.1109/TMM.2025.3613171","url":null,"abstract":"With high embedding capacity and security, transform coefficient-based video steganography has become an important branch of video steganography. However, existing steganalysis methods against transform coefficient-based steganography provide insufficient consideration to the prediction process of HEVC compression, which results in steganalysis that is not straightforward and fail to effectively detect adaptive steganography methods in low embedding rate scenarios. In this paper, an HEVC video steganalysis method based on centralized error and attention mechanism against transform coefficient-based steganography is proposed. Firstly, the centralized error phenomenon brought by distortion compensation-based steganography is analyzed, and prediction error maps is constructed for steganalysis to achieve higher SNR(signal-to-noise ratio). Secondly, a video steganalysis network called CESNet (Centralized Error Steganalysis Network) is proposed. The network takes the prediction error maps as input and four types of convolutional modules are designed to adapt to different stages of feature extraction. To address the intra-frame sparsity of adaptive steganography, CEA (Centralized Error Attention) modules based on spatial and channel attention mechanisms are proposed to adaptively enhance the steganographic region. Finally, after extracting the feature vectors of each frame, the detection of steganographic video is completed using the self-attention mechanism. Experimental results show that compared with the existing transform coefficient-based video steganalysis methods, the proposed method can effectively detect multiple transform coefficient-based steganography algorithms and achieve higher detection performance in low payload scenarios.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"8914-8925"},"PeriodicalIF":9.7,"publicationDate":"2025-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145510152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-09DOI: 10.1109/TMM.2025.3607726
Han Jiang;Xiaoshan Yang;Chaofan Chen;Changsheng Xu
Compositional zero-shot learning (CZSL) aims to identify novel compositions formed by known primitives (attributes and objects). Motivated by recent advancements in pre-trained vision-language models such as CLIP, many methods attempt to fine-tune CLIP for CZSL and achieve remarkable performance. However, the existing CLIP-based CZSL methods focus mainly on text prompt tuning, which lacks the flexibility to dynamically adapt both modalities. To solve this issue, an intuitive solution is to additionally introduce visual prompt tuning. This insight is not trivial to achieve because effectively learning prompts for CZSL involves the challenge of entanglement between visual primitives as well as appearance shifts in different compositions. In this paper, we propose a novel Synergetic Prompts as Disentanglement Queries (SPDQ) framework for CZSL. It can disentangle primitive features based on synergetic prompts to jointly alleviate these challenges. Specifically, we first design a low-rank primitive modulator to produce synergetic adaptive attribute and object prompts based on prior knowledge of each instance for model adaptation. Then, we additionally utilize text prefix prompts to construct synergetic prompt queries, which are used to resample corresponding visual features from local visual patches. Comprehensive experiments conducted on three benchmarks demonstrate that our SPDQ approach achieves state-of-the-art results.
{"title":"SPDQ: Synergetic Prompts as Disentanglement Queries for Compositional Zero-Shot Learning","authors":"Han Jiang;Xiaoshan Yang;Chaofan Chen;Changsheng Xu","doi":"10.1109/TMM.2025.3607726","DOIUrl":"https://doi.org/10.1109/TMM.2025.3607726","url":null,"abstract":"Compositional zero-shot learning (CZSL) aims to identify novel compositions formed by known primitives (attributes and objects). Motivated by recent advancements in pre-trained vision-language models such as CLIP, many methods attempt to fine-tune CLIP for CZSL and achieve remarkable performance. However, the existing CLIP-based CZSL methods focus mainly on text prompt tuning, which lacks the flexibility to dynamically adapt both modalities. To solve this issue, an intuitive solution is to additionally introduce visual prompt tuning. This insight is not trivial to achieve because effectively learning prompts for CZSL involves the challenge of entanglement between visual primitives as well as appearance shifts in different compositions. In this paper, we propose a novel Synergetic Prompts as Disentanglement Queries (SPDQ) framework for CZSL. It can disentangle primitive features based on synergetic prompts to jointly alleviate these challenges. Specifically, we first design a low-rank primitive modulator to produce synergetic adaptive attribute and object prompts based on prior knowledge of each instance for model adaptation. Then, we additionally utilize text prefix prompts to construct synergetic prompt queries, which are used to resample corresponding visual features from local visual patches. Comprehensive experiments conducted on three benchmarks demonstrate that our SPDQ approach achieves state-of-the-art results.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"8888-8899"},"PeriodicalIF":9.7,"publicationDate":"2025-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145510101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Effectively representing and transferring user preferences across various domains presents a significant challenge in cross-domain recommendation (CDR). Some approaches utilize graph neural networks that use interaction behavior to establish relationships between entities, providing a comprehensive understanding of user interests. However, the impact of consistent semantics across various types, fields, and perspectives of social media information on user preferences is overlooked, i.e. the multidimensional consistency of user preferences. This oversight results in graph node representations that inadequately reflect user preferences. To address these limitations, we propose a multi-layer transfer learning network (MTLG) for CDR based on graph node representation enhancement via multi-dimensional consistent user preferences. Firstly, the model introduces a set of globally shared semantic units to perform different-grained semantic alignment of multiple media information without clear alignment boundaries, thereby modeling multi-dimensional consistent user preference features. These features are then seamlessly integrated with the initial high-order graph structure embedding features, thus significantly improving the quality of graph node representation. Secondly, the model innovatively designs a multi-layer transfer learning network that hierarchically aligns the domain distribution differences. It calculates the similarity between domains to derive layer weights for more precise transfer learning, thereby mitigating the possibility of information error accumulation resulting from inaccurate feature aggregation processes. We conducted numerous experiments on 3 scenarios, including 7,954,943 rating information from the Amazon dataset. The results indicate that MTLG’s recommendation accuracy surpasses those of state-of-the-art methods.
{"title":"Multi-Layer Transfer Learning for Cross-Domain Recommendation Based on Graph Node Representation Enhancement","authors":"Xin Ni;Jie Nie;Niantai Jing;Jianliang Xu;Xiaodong Wang;Xuesong Gao;MingXing Jiang;Chi-Hung Chi;Zhiqiang Wei","doi":"10.1109/TMM.2025.3607706","DOIUrl":"https://doi.org/10.1109/TMM.2025.3607706","url":null,"abstract":"Effectively representing and transferring user preferences across various domains presents a significant challenge in cross-domain recommendation (CDR). Some approaches utilize graph neural networks that use interaction behavior to establish relationships between entities, providing a comprehensive understanding of user interests. However, the impact of consistent semantics across various types, fields, and perspectives of social media information on user preferences is overlooked, i.e. the multidimensional consistency of user preferences. This oversight results in graph node representations that inadequately reflect user preferences. To address these limitations, we propose a multi-layer transfer learning network (MTLG) for CDR based on graph node representation enhancement via multi-dimensional consistent user preferences. Firstly, the model introduces a set of globally shared semantic units to perform different-grained semantic alignment of multiple media information without clear alignment boundaries, thereby modeling multi-dimensional consistent user preference features. These features are then seamlessly integrated with the initial high-order graph structure embedding features, thus significantly improving the quality of graph node representation. Secondly, the model innovatively designs a multi-layer transfer learning network that hierarchically aligns the domain distribution differences. It calculates the similarity between domains to derive layer weights for more precise transfer learning, thereby mitigating the possibility of information error accumulation resulting from inaccurate feature aggregation processes. We conducted numerous experiments on 3 scenarios, including 7,954,943 rating information from the Amazon dataset. The results indicate that MTLG’s recommendation accuracy surpasses those of state-of-the-art methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"8940-8953"},"PeriodicalIF":9.7,"publicationDate":"2025-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145510159","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-08DOI: 10.1109/TMM.2025.3604977
Yuyu Jia;Qing Zhou;Junyu Gao;Qiang Li;Qi Wang
Few-shot learning aims to generalize the recognizer from seen categories to an entirely novel scenario. With only a few support samples, several advanced methods initially introduce class names as prior knowledge for identifying novel classes. However, obstacles still impede achieving a comprehensive understanding of how to harness the mutual advantages of visual and textual knowledge. In this paper, we set out to fill this gap via a coherent Bidirectional Knowledge Permeation strategy called BiKop, which is grounded in human intuition: a class name description offers a more general representation, whereas an image captures the specificity of individuals. BiKop primarily establishes a hierarchical joint general-specific representation through bidirectional knowledge permeation. On the other hand, considering the bias of joint representation towards the base set, we disentangle base-class-relevant semantics during training, thereby alleviating the suppression of potential novel-class-relevant information. Experiments on four challenging benchmarks demonstrate the remarkable superiority of BiKop, particularly outperforming previous methods by a substantial margin in the 1-shot setting (improving the accuracy by 7.58% on miniImageNet).
{"title":"Like Humans to Few-Shot Learning Through Knowledge Permeation of Visual and Language","authors":"Yuyu Jia;Qing Zhou;Junyu Gao;Qiang Li;Qi Wang","doi":"10.1109/TMM.2025.3604977","DOIUrl":"https://doi.org/10.1109/TMM.2025.3604977","url":null,"abstract":"Few-shot learning aims to generalize the recognizer from seen categories to an entirely novel scenario. With only a few support samples, several advanced methods initially introduce class names as prior knowledge for identifying novel classes. However, obstacles still impede achieving a comprehensive understanding of how to harness the mutual advantages of visual and textual knowledge. In this paper, we set out to fill this gap via a coherent Bidirectional Knowledge Permeation strategy called BiKop, which is grounded in human intuition: a class name description offers a more <italic>general</i> representation, whereas an image captures the <italic>specificity</i> of individuals. BiKop primarily establishes a hierarchical joint general-specific representation through bidirectional knowledge permeation. On the other hand, considering the bias of joint representation towards the base set, we disentangle base-class-relevant semantics during training, thereby alleviating the suppression of potential novel-class-relevant information. Experiments on four challenging benchmarks demonstrate the remarkable superiority of BiKop, particularly outperforming previous methods by a substantial margin in the 1-shot setting (improving the accuracy by 7.58% on <italic>mini</i>ImageNet).","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"7905-7916"},"PeriodicalIF":9.7,"publicationDate":"2025-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145351925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-02DOI: 10.1109/TMM.2025.3604903
Hongqi Yu;Sixian Chan;Xiaolong Zhou;Xiaoqin Zhang
Effective and robust 3D panoptic segmentation is crucial for scene perception in autonomous driving. Modern methods widely adopt multi-modal fusion based simple feature concatenation to enhance 3D scene understanding, resulting in generated multi-modal representations typically lack comprehensive semantic and geometry information. These methods focused on panoptic prediction in a single step also limit the capability to progressively refine panoptic predictions under varying noise levels, which is essential for enhancing model robustness. To address these limitations, we first utilize BEV space to unify semantic-geometry perceptual representation, allowing for a more effective integration of LiDAR and camera data. Then, we propose PrimePSegter, a progressively combined diffusion 3D panoptic segmentation model that is conditioned on BEV maps to iteratively refine predictions by denoising samples generated from Gaussian distribution. PrimePSegter adopts a conditional encoder-decoder architecture for fine-grained panoptic predictions. Specifically, a multi-modal conditional encoder is equipped with BEV fusion network to integrate semantic and geometric information from LiDAR and camera streams into unified BEV space. Additionally, a diffusion transformer decoder operates on multi-modal BEV features with varying noise levels to guide the training of diffusion model, refining the BEV panoptic representations enriched with semantics and geometry in a progressive way. PrimePSegter achieves state-of-the-art performance on the nuScenes and competitive results on the SemanticKITTI, respectively. Moreover, PrimePSegter demonstrates superior robustness towards various scenarios, outperforming leading methods.
{"title":"PrimePSegter: Progressively Combined Diffusion for 3D Panoptic Segmentation With Multi-Modal BEV Refinement","authors":"Hongqi Yu;Sixian Chan;Xiaolong Zhou;Xiaoqin Zhang","doi":"10.1109/TMM.2025.3604903","DOIUrl":"https://doi.org/10.1109/TMM.2025.3604903","url":null,"abstract":"Effective and robust 3D panoptic segmentation is crucial for scene perception in autonomous driving. Modern methods widely adopt multi-modal fusion based simple feature concatenation to enhance 3D scene understanding, resulting in generated multi-modal representations typically lack comprehensive semantic and geometry information. These methods focused on panoptic prediction in a single step also limit the capability to progressively refine panoptic predictions under varying noise levels, which is essential for enhancing model robustness. To address these limitations, we first utilize BEV space to unify semantic-geometry perceptual representation, allowing for a more effective integration of LiDAR and camera data. Then, we propose PrimePSegter, a progressively combined diffusion 3D panoptic segmentation model that is conditioned on BEV maps to iteratively refine predictions by denoising samples generated from Gaussian distribution. PrimePSegter adopts a conditional encoder-decoder architecture for fine-grained panoptic predictions. Specifically, a multi-modal conditional encoder is equipped with BEV fusion network to integrate semantic and geometric information from LiDAR and camera streams into unified BEV space. Additionally, a diffusion transformer decoder operates on multi-modal BEV features with varying noise levels to guide the training of diffusion model, refining the BEV panoptic representations enriched with semantics and geometry in a progressive way. PrimePSegter achieves state-of-the-art performance on the nuScenes and competitive results on the SemanticKITTI, respectively. Moreover, PrimePSegter demonstrates superior robustness towards various scenarios, outperforming leading methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"7891-7904"},"PeriodicalIF":9.7,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145351954","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-01DOI: 10.1109/TMM.2025.3604967
Junlin Liu;Xinchen Lyu;Chenshan Ren;Qimei Cui
Input diversity is an effective technique for crafting transferable adversarial examples that can deceive unknown AI models. Existing input-diversity-based methods typically use single input transformation, limiting targeted transferability and defense robustness. Combining different transformation types is challenging, as keeping increasing types would degrade semantic information and targeted transferability. This paper proposes a quality-aware transformation combination attack (TCA) that selects high-quality transformation combinations. The quality-aware selection enables expansion of transformation types, enhances input diversity, and hence improves targeted transferability and defense robustness. We first design a quality-evaluation framework to quantify the effectiveness of transformation combinations, which jointly considers convergence, transferability, and robustness. Only a small group (up to 10) of images are required for computation-efficient quality evaluation. Experiments validate TCA’s superiority over state-of-the-art baselines in adversarial transferability and robustness. When defenses are secured, the average targeted success rate of TCA with four transformation types (i.e., TCA-t4) outperforms the best baseline by 26%$sim$42% on ImageNet.
{"title":"Crafting More Transferable Adversarial Examples via Quality-Aware Transformation Combination","authors":"Junlin Liu;Xinchen Lyu;Chenshan Ren;Qimei Cui","doi":"10.1109/TMM.2025.3604967","DOIUrl":"https://doi.org/10.1109/TMM.2025.3604967","url":null,"abstract":"Input diversity is an effective technique for crafting transferable adversarial examples that can deceive unknown AI models. Existing input-diversity-based methods typically use single input transformation, limiting targeted transferability and defense robustness. Combining different transformation types is challenging, as keeping increasing types would degrade semantic information and targeted transferability. This paper proposes a quality-aware <underline>t</u>ransformation <underline>c</u>ombination <underline>a</u>ttack (TCA) that selects high-quality transformation combinations. The quality-aware selection enables expansion of transformation types, enhances input diversity, and hence improves targeted transferability and defense robustness. We first design a quality-evaluation framework to quantify the effectiveness of transformation combinations, which jointly considers convergence, transferability, and robustness. Only a small group (up to 10) of images are required for computation-efficient quality evaluation. Experiments validate TCA’s superiority over state-of-the-art baselines in adversarial transferability and robustness. When defenses are secured, the average targeted success rate of TCA with four transformation types (i.e., TCA-t4) outperforms the best baseline by 26%<inline-formula><tex-math>$sim$</tex-math></inline-formula>42% on ImageNet.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"7917-7929"},"PeriodicalIF":9.7,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145351943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nowadays, virtual reality technology is advancing rapidly and becoming increasingly matured. Omnidirectional images have integrated into the daily lives of many individuals. However, these images are susceptible to irreversible distortion during the encoding and transmission processes. Given the unique characteristics of deformation and distortion in omnidirectional images, the development of a quality assessment method is crucial. To ensure that our network not only delivers efficient and stable performance but also maintains a minimal parameter count, we have integrated the concept of knowledge distillation into our network. This involves utilizing a full-reference (FR) teacher network to guide the training of a no-reference (NR) student network by cross-projection distilling knowledge. To specifically implement this method, a Dual Projection Format Fusion (DPFF) module is specifically designed to complement and integrate the mutual fusion of the two projection formats of omnidirectional images. In the design of our knowledge distillation process and loss function, we have introduced a review mechanism to enhance the performance and efficiency of response-based knowledge, as well as utilized intermediate fusion features to improve the effectiveness of feature-based knowledge. These components are combined to formulate the final loss function. Experimental results validate the superiority of our proposed model over existing FR and NR methods when evaluated on four omnidirectional image databases. This highlights the effectiveness of our proposed model in elevating the quality assessment of omnidirectional images.
当前,虚拟现实技术发展迅速,日趋成熟。全方位的图像已经融入了许多人的日常生活。然而,这些图像在编码和传输过程中容易产生不可逆失真。鉴于全向图像的形变和畸变的独特特性,开发一种质量评估方法至关重要。为了确保我们的网络不仅提供高效和稳定的性能,而且保持最小的参数计数,我们将知识蒸馏的概念集成到我们的网络中。这涉及到利用全参考(FR)教师网络通过交叉投影提取知识来指导无参考(NR)学生网络的训练。为具体实现该方法,专门设计了双投影格式融合(Dual Projection Format Fusion, DPFF)模块,对全向图像两种投影格式的相互融合进行补充和集成。在知识蒸馏过程和损失函数的设计中,我们引入了评审机制来提高基于响应的知识的性能和效率,并利用中间融合特征来提高基于特征的知识的有效性。这些分量组合起来形成最终的损失函数。实验结果验证了该模型在4个全向图像数据库上优于现有的FR和NR方法。这突出了我们提出的模型在提高全向图像质量评估方面的有效性。
{"title":"Cross-Projection Distilling Knowledge for Omnidirectional Image Quality Assessment","authors":"Huixin Hu;Feng Shao;Hangwei Chen;Xiongli Chai;Qiuping Jiang","doi":"10.1109/TMM.2025.3590920","DOIUrl":"https://doi.org/10.1109/TMM.2025.3590920","url":null,"abstract":"Nowadays, virtual reality technology is advancing rapidly and becoming increasingly matured. Omnidirectional images have integrated into the daily lives of many individuals. However, these images are susceptible to irreversible distortion during the encoding and transmission processes. Given the unique characteristics of deformation and distortion in omnidirectional images, the development of a quality assessment method is crucial. To ensure that our network not only delivers efficient and stable performance but also maintains a minimal parameter count, we have integrated the concept of knowledge distillation into our network. This involves utilizing a full-reference (FR) teacher network to guide the training of a no-reference (NR) student network by cross-projection distilling knowledge. To specifically implement this method, a Dual Projection Format Fusion (DPFF) module is specifically designed to complement and integrate the mutual fusion of the two projection formats of omnidirectional images. In the design of our knowledge distillation process and loss function, we have introduced a review mechanism to enhance the performance and efficiency of response-based knowledge, as well as utilized intermediate fusion features to improve the effectiveness of feature-based knowledge. These components are combined to formulate the final loss function. Experimental results validate the superiority of our proposed model over existing FR and NR methods when evaluated on four omnidirectional image databases. This highlights the effectiveness of our proposed model in elevating the quality assessment of omnidirectional images.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"6752-6765"},"PeriodicalIF":9.7,"publicationDate":"2025-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141602","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}