首页 > 最新文献

IEEE Transactions on Multimedia最新文献

英文 中文
Unleashing the Potential of Hierarchical Region Clues for Open-Vocabulary Multi-Label Classification 释放层次区域线索在开放词汇多标签分类中的潜力
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-10-06 DOI: 10.1109/TMM.2025.3618542
Peirong Ma;Wu Ran;Zhiquan He;Jian Pu;Hong Lu
Open-vocabulary multi-label classification (OV- MLC) aims to leverage the rich multi-modal knowledge from Vision-language pre-training (VLP) models to further improve the recognition ability for unseen (novel) classes beyond the training set in multi-label scenarios. Existing OV-MLC methods only perform predictions on single hierarchical regions, and aggregate the prediction scores of these regions through simple top-k mean pooling. This fails to unleash the potential of rich hierarchical region clues in multi-label images and does not fully exploit the discriminative information from all regions in the image, resulting in sub-optimal performance. In this work, we propose a novel OV-MLC framework to fully harness the power of multiple hierarchical region clues. Specifically, we first design a hierarchical clue gathering (HCG) module to gather different hierarchical clues, enabling more precise recognition of multiple object categories with different sizes in a multi-label image. Then, by viewing multi-label classification as single-label classification of each region within the image, we present a novel hierarchical score aggregation (HSA) approach, thereby better utilizing the predictions of each image region for each class. We also utilize a well-designed region selection strategy (RSS) to eliminate noise or background regions in an image that are irrelevant to classification, achieving higher multi-label classification accuracy. In addition, we propose a hybrid prompt learning (HPL) strategy to enhance visual-semantic consistency while preserving the generalization capability of label embeddings for unseen classes. Extensive experiments on public benchmark datasets demonstrate that our method significantly outperforms the current state-of-the-art.
开放词汇多标签分类(OV- MLC)旨在利用视觉语言预训练(VLP)模型丰富的多模态知识,进一步提高多标签场景下对训练集以外未见(新颖)类的识别能力。现有的OV-MLC方法仅对单个分层区域进行预测,并通过简单的top-k均值池化对这些区域的预测分数进行汇总。这种方法没有充分发挥多标签图像中丰富层次区域线索的潜力,也没有充分利用图像中所有区域的判别信息,导致性能不佳。在这项工作中,我们提出了一个新的OV-MLC框架,以充分利用多个分层区域线索的力量。具体来说,我们首先设计了一个分层线索收集(HCG)模块来收集不同的分层线索,从而能够更精确地识别多标签图像中不同大小的多个对象类别。然后,通过将多标签分类视为图像内每个区域的单标签分类,我们提出了一种新的分层分数聚合(HSA)方法,从而更好地利用每个图像区域对每个类别的预测。我们还利用精心设计的区域选择策略(RSS)来消除图像中与分类无关的噪声或背景区域,从而实现更高的多标签分类精度。此外,我们提出了一种混合提示学习(HPL)策略,以增强视觉语义一致性,同时保留标签嵌入对未见类的泛化能力。在公共基准数据集上进行的大量实验表明,我们的方法明显优于当前最先进的方法。
{"title":"Unleashing the Potential of Hierarchical Region Clues for Open-Vocabulary Multi-Label Classification","authors":"Peirong Ma;Wu Ran;Zhiquan He;Jian Pu;Hong Lu","doi":"10.1109/TMM.2025.3618542","DOIUrl":"https://doi.org/10.1109/TMM.2025.3618542","url":null,"abstract":"Open-vocabulary multi-label classification (OV- MLC) aims to leverage the rich multi-modal knowledge from Vision-language pre-training (VLP) models to further improve the recognition ability for unseen (novel) classes beyond the training set in multi-label scenarios. Existing OV-MLC methods only perform predictions on single hierarchical regions, and aggregate the prediction scores of these regions through simple <italic>top-k</i> mean pooling. This fails to unleash the potential of rich hierarchical region clues in multi-label images and does not fully exploit the discriminative information from all regions in the image, resulting in sub-optimal performance. In this work, we propose a novel OV-MLC framework to fully harness the power of multiple hierarchical region clues. Specifically, we first design a hierarchical clue gathering (HCG) module to gather different hierarchical clues, enabling more precise recognition of multiple object categories with different sizes in a multi-label image. Then, by viewing multi-label classification as single-label classification of each region within the image, we present a novel hierarchical score aggregation (HSA) approach, thereby better utilizing the predictions of each image region for each class. We also utilize a well-designed region selection strategy (RSS) to eliminate noise or background regions in an image that are irrelevant to classification, achieving higher multi-label classification accuracy. In addition, we propose a hybrid prompt learning (HPL) strategy to enhance visual-semantic consistency while preserving the generalization capability of label embeddings for unseen classes. Extensive experiments on public benchmark datasets demonstrate that our method significantly outperforms the current state-of-the-art.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"9832-9846"},"PeriodicalIF":9.7,"publicationDate":"2025-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145885034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cross-Modal Spherical Aggregation for Weakly Supervised Remote Sensing Shadow Removal 弱监督遥感阴影去除的跨模态球面聚集
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-10-06 DOI: 10.1109/TMM.2025.3618537
Kaichen Chi;Wei Jing;Junjie Li;Qiang Li;Qi Wang
Shadows are dark areas, typically rendering low illumination intensity. Admittedly, the infrared image can provide robust illumination cues that the visible image lacks, but existing methods ignore the collaboration between heterogeneous modalities. To fill this gap, we propose a weakly supervised shadow removal network with a spherical feature space, dubbed S2-ShadowNet, to explore the best of both worlds for visible and infrared modalities. Specifically, we employ a modal translation (visible-to-infrared) model to learn the cross-domain mapping, thus generating realistic infrared samples. Then, Swin Transformer is utilized to extract strong representational visible/infrared features. Simultaneously, the extracted features are mapped to the smooth spherical manifold, which alleviates the domain shift through regularization. Well-designed similarity loss and orthogonality loss are embedded into the spherical space, prompting the separation of private visible/infrared features and the alignment of shared visible/infrared features through constraints on both representation content and orientation. Such a manner encourages implicit reciprocity between modalities, thus providing a novel insight into shadow removal. Notably, ground truth is not available in practice, thus S2-ShadowNet is trained by cropping shadow and shadow-free patches from the shadow image itself, avoiding stereotypical and strict pair data acquisition. More importantly, we contribute a large-scale weakly supervised shadow removal benchmark that makes shadow removal independent of specific scenario constraints possible. Extensive experiments demonstrate that S2-ShadowNet outperforms state-of-the-art methods in both qualitative and quantitative comparisons.
阴影是暗的区域,通常呈现低照明强度。诚然,红外图像可以提供可见图像所缺乏的鲁棒照明线索,但现有方法忽略了异构模态之间的协作。为了填补这一空白,我们提出了一个带有球形特征空间的弱监督阴影去除网络,称为S2-ShadowNet,以探索可见光和红外模式的最佳效果。具体来说,我们采用模态平移(可见光到红外)模型来学习跨域映射,从而生成逼真的红外样本。然后,利用Swin Transformer提取强代表性的可见/红外特征。同时,将提取的特征映射到光滑球面流形上,通过正则化减轻了域漂移。精心设计的相似性损失和正交性损失嵌入到球面空间中,通过对表示内容和方向的约束,促使私有可见/红外特征的分离和共享可见/红外特征的对齐。这种方式鼓励了模式之间的隐性互惠,从而为阴影去除提供了一种新颖的见解。值得注意的是,在实践中,地面真相是不可用的,因此S2-ShadowNet是通过裁剪阴影图像本身的阴影和无阴影斑块来训练的,避免了刻板和严格的成对数据采集。更重要的是,我们提供了一个大规模的弱监督阴影去除基准,使阴影去除独立于特定的场景约束成为可能。广泛的实验表明,S2-ShadowNet在定性和定量比较中都优于最先进的方法。
{"title":"Cross-Modal Spherical Aggregation for Weakly Supervised Remote Sensing Shadow Removal","authors":"Kaichen Chi;Wei Jing;Junjie Li;Qiang Li;Qi Wang","doi":"10.1109/TMM.2025.3618537","DOIUrl":"https://doi.org/10.1109/TMM.2025.3618537","url":null,"abstract":"Shadows are dark areas, typically rendering low illumination intensity. Admittedly, the infrared image can provide robust illumination cues that the visible image lacks, but existing methods ignore the collaboration between heterogeneous modalities. To fill this gap, we propose a weakly supervised shadow removal network with a spherical feature space, dubbed S2-ShadowNet, to explore the best of both worlds for visible and infrared modalities. Specifically, we employ a modal translation (visible-to-infrared) model to learn the cross-domain mapping, thus generating realistic infrared samples. Then, Swin Transformer is utilized to extract strong representational visible/infrared features. Simultaneously, the extracted features are mapped to the smooth spherical manifold, which alleviates the domain shift through regularization. Well-designed similarity loss and orthogonality loss are embedded into the spherical space, prompting the separation of private visible/infrared features and the alignment of shared visible/infrared features through constraints on both representation content and orientation. Such a manner encourages implicit reciprocity between modalities, thus providing a novel insight into shadow removal. Notably, ground truth is not available in practice, thus S2-ShadowNet is trained by cropping shadow and shadow-free patches from the shadow image itself, avoiding stereotypical and strict pair data acquisition. More importantly, we contribute a large-scale weakly supervised shadow removal benchmark that makes shadow removal independent of specific scenario constraints possible. Extensive experiments demonstrate that S2-ShadowNet outperforms state-of-the-art methods in both qualitative and quantitative comparisons.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"813-824"},"PeriodicalIF":9.7,"publicationDate":"2025-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145929529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards Invisible Decision-Based Adversarial Attacks Against Visual Object Tracking 针对视觉目标跟踪的基于决策的不可见对抗性攻击
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-10-06 DOI: 10.1109/TMM.2025.3618533
Ziyi Liu;Caiyun Xie;Wenbing Ding;Dengpan Ye;Long Tang;Qian Wang
Adversarial attacks have become a critical focus in visual object tracking (VOT) research. Small, carefully crafted adversarial perturbations to video frames can easily disrupt the visual object tracker, leading to tracking failure. Therefore, studying adversarial attacks contributes to the development of more robust and reliable trackers. Considering that trackers are agnostic in real-world scenarios, research on decision-based black-box attacks is straightforward and practical. However, existing decision-based black-box attacks neither comprehensively analyze the unique characteristics of object tracking nor sufficiently consider the imperceptibility of adversarial perturbations. In this paper, we propose invisible local attack (ILA), a novel decision-based adversarial attack specifically for VOT with imperceptible perturbations. We assume that a significant number of pixels in a frame, irrelevant to the tracked object, do not substantially contribute to the functioning mechanism of a deep tracker. Based on this consideration, we propose a search algorithm to identify the pixel set focused on by the tracker during object tracking. The adversarial noise is then confined to these pixels and iteratively optimized through a heuristic algorithm of ILA. By perturbing only the key pixels, ILA significantly enhances both the attack performance and imperceptibility when it is applied to visual object trackers. Extensive experiments demonstrate that our ILA method achieves a 121% increase in the robustness metric and a 137% improvement in the structural similarity index measure (SSIM) across multiple datasets for various trackers compared with the state-of-the-art (SOTA) method.
对抗性攻击已成为视觉目标跟踪(VOT)研究的热点。对视频帧的微小、精心制作的对抗性扰动很容易破坏视觉目标跟踪器,导致跟踪失败。因此,研究对抗性攻击有助于开发更健壮和可靠的跟踪器。考虑到跟踪器在现实场景中是不可知的,基于决策的黑盒攻击的研究是直接和实用的。然而,现有的基于决策的黑盒攻击既没有全面分析目标跟踪的独特性,也没有充分考虑对抗性扰动的不可感知性。在本文中,我们提出了一种新的基于决策的对抗攻击(ILA),专门针对具有不可察觉扰动的VOT。我们假设一帧中与被跟踪对象无关的大量像素对深度跟踪器的功能机制没有实质性贡献。基于此,我们提出了一种搜索算法来识别跟踪器在目标跟踪过程中所关注的像素集。然后,对抗性噪声被限制在这些像素上,并通过ILA的启发式算法迭代优化。在视觉目标跟踪中,通过对关键像素进行干扰,可显著提高攻击性能和不可感知性。大量的实验表明,与最先进的(SOTA)方法相比,我们的ILA方法在不同跟踪器的多个数据集上实现了121%的鲁棒性度量增加和137%的结构相似指数度量(SSIM)改进。
{"title":"Towards Invisible Decision-Based Adversarial Attacks Against Visual Object Tracking","authors":"Ziyi Liu;Caiyun Xie;Wenbing Ding;Dengpan Ye;Long Tang;Qian Wang","doi":"10.1109/TMM.2025.3618533","DOIUrl":"https://doi.org/10.1109/TMM.2025.3618533","url":null,"abstract":"Adversarial attacks have become a critical focus in visual object tracking (VOT) research. Small, carefully crafted adversarial perturbations to video frames can easily disrupt the visual object tracker, leading to tracking failure. Therefore, studying adversarial attacks contributes to the development of more robust and reliable trackers. Considering that trackers are agnostic in real-world scenarios, research on decision-based black-box attacks is straightforward and practical. However, existing decision-based black-box attacks neither comprehensively analyze the unique characteristics of object tracking nor sufficiently consider the imperceptibility of adversarial perturbations. In this paper, we propose invisible local attack (ILA), a novel decision-based adversarial attack specifically for VOT with imperceptible perturbations. We assume that a significant number of pixels in a frame, irrelevant to the tracked object, do not substantially contribute to the functioning mechanism of a deep tracker. Based on this consideration, we propose a search algorithm to identify the pixel set focused on by the tracker during object tracking. The adversarial noise is then confined to these pixels and iteratively optimized through a heuristic algorithm of ILA. By perturbing only the key pixels, ILA significantly enhances both the attack performance and imperceptibility when it is applied to visual object trackers. Extensive experiments demonstrate that our ILA method achieves a 121% increase in the robustness metric and a 137% improvement in the structural similarity index measure (SSIM) across multiple datasets for various trackers compared with the state-of-the-art (SOTA) method.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"9861-9872"},"PeriodicalIF":9.7,"publicationDate":"2025-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145886608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HEVC Video Steganalysis Based on Centralized Error and Attention Mechanism 基于集中错误和注意机制的HEVC视频隐写分析
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-09-22 DOI: 10.1109/TMM.2025.3613171
Haojun Dai;Dawen Xu;Lin Yang;Rangding Wang
With high embedding capacity and security, transform coefficient-based video steganography has become an important branch of video steganography. However, existing steganalysis methods against transform coefficient-based steganography provide insufficient consideration to the prediction process of HEVC compression, which results in steganalysis that is not straightforward and fail to effectively detect adaptive steganography methods in low embedding rate scenarios. In this paper, an HEVC video steganalysis method based on centralized error and attention mechanism against transform coefficient-based steganography is proposed. Firstly, the centralized error phenomenon brought by distortion compensation-based steganography is analyzed, and prediction error maps is constructed for steganalysis to achieve higher SNR(signal-to-noise ratio). Secondly, a video steganalysis network called CESNet (Centralized Error Steganalysis Network) is proposed. The network takes the prediction error maps as input and four types of convolutional modules are designed to adapt to different stages of feature extraction. To address the intra-frame sparsity of adaptive steganography, CEA (Centralized Error Attention) modules based on spatial and channel attention mechanisms are proposed to adaptively enhance the steganographic region. Finally, after extracting the feature vectors of each frame, the detection of steganographic video is completed using the self-attention mechanism. Experimental results show that compared with the existing transform coefficient-based video steganalysis methods, the proposed method can effectively detect multiple transform coefficient-based steganography algorithms and achieve higher detection performance in low payload scenarios.
基于变换系数的视频隐写具有较高的嵌入容量和安全性,已成为视频隐写技术的一个重要分支。然而,现有针对变换系数隐写的隐写分析方法对HEVC压缩的预测过程考虑不足,导致隐写分析不够直观,在低嵌入率场景下无法有效检测自适应隐写方法。针对基于变换系数的隐写,提出了一种基于集中误差和注意机制的HEVC视频隐写分析方法。首先,分析了基于失真补偿的隐写带来的集中误差现象,构建了隐写预测误差图,实现了更高的信噪比。其次,提出了一种视频隐写分析网络CESNet (Centralized Error steganalysis network)。该网络以预测误差图为输入,设计了四种类型的卷积模块,以适应不同阶段的特征提取。为了解决自适应隐写的帧内稀疏性问题,提出了基于空间和通道注意机制的集中式错误注意(CEA)模块来自适应增强隐写区域。最后,在提取每一帧的特征向量后,利用自注意机制完成隐写视频的检测。实验结果表明,与现有的基于变换系数的视频隐写分析方法相比,该方法可以有效检测多种基于变换系数的隐写算法,在低载荷场景下具有更高的检测性能。
{"title":"HEVC Video Steganalysis Based on Centralized Error and Attention Mechanism","authors":"Haojun Dai;Dawen Xu;Lin Yang;Rangding Wang","doi":"10.1109/TMM.2025.3613171","DOIUrl":"https://doi.org/10.1109/TMM.2025.3613171","url":null,"abstract":"With high embedding capacity and security, transform coefficient-based video steganography has become an important branch of video steganography. However, existing steganalysis methods against transform coefficient-based steganography provide insufficient consideration to the prediction process of HEVC compression, which results in steganalysis that is not straightforward and fail to effectively detect adaptive steganography methods in low embedding rate scenarios. In this paper, an HEVC video steganalysis method based on centralized error and attention mechanism against transform coefficient-based steganography is proposed. Firstly, the centralized error phenomenon brought by distortion compensation-based steganography is analyzed, and prediction error maps is constructed for steganalysis to achieve higher SNR(signal-to-noise ratio). Secondly, a video steganalysis network called CESNet (Centralized Error Steganalysis Network) is proposed. The network takes the prediction error maps as input and four types of convolutional modules are designed to adapt to different stages of feature extraction. To address the intra-frame sparsity of adaptive steganography, CEA (Centralized Error Attention) modules based on spatial and channel attention mechanisms are proposed to adaptively enhance the steganographic region. Finally, after extracting the feature vectors of each frame, the detection of steganographic video is completed using the self-attention mechanism. Experimental results show that compared with the existing transform coefficient-based video steganalysis methods, the proposed method can effectively detect multiple transform coefficient-based steganography algorithms and achieve higher detection performance in low payload scenarios.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"8914-8925"},"PeriodicalIF":9.7,"publicationDate":"2025-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145510152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SPDQ: Synergetic Prompts as Disentanglement Queries for Compositional Zero-Shot Learning 协同提示作为解纠缠查询的组合零射击学习
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-09-09 DOI: 10.1109/TMM.2025.3607726
Han Jiang;Xiaoshan Yang;Chaofan Chen;Changsheng Xu
Compositional zero-shot learning (CZSL) aims to identify novel compositions formed by known primitives (attributes and objects). Motivated by recent advancements in pre-trained vision-language models such as CLIP, many methods attempt to fine-tune CLIP for CZSL and achieve remarkable performance. However, the existing CLIP-based CZSL methods focus mainly on text prompt tuning, which lacks the flexibility to dynamically adapt both modalities. To solve this issue, an intuitive solution is to additionally introduce visual prompt tuning. This insight is not trivial to achieve because effectively learning prompts for CZSL involves the challenge of entanglement between visual primitives as well as appearance shifts in different compositions. In this paper, we propose a novel Synergetic Prompts as Disentanglement Queries (SPDQ) framework for CZSL. It can disentangle primitive features based on synergetic prompts to jointly alleviate these challenges. Specifically, we first design a low-rank primitive modulator to produce synergetic adaptive attribute and object prompts based on prior knowledge of each instance for model adaptation. Then, we additionally utilize text prefix prompts to construct synergetic prompt queries, which are used to resample corresponding visual features from local visual patches. Comprehensive experiments conducted on three benchmarks demonstrate that our SPDQ approach achieves state-of-the-art results.
组合零射击学习(CZSL)旨在识别由已知原语(属性和对象)组成的新组合。在预训练视觉语言模型(如CLIP)的最新进展的推动下,许多方法试图对CZSL的CLIP进行微调并取得显着的性能。然而,现有的基于clip的CZSL方法主要侧重于文本提示调优,缺乏动态适应这两种模式的灵活性。为了解决这个问题,一个直观的解决方案是额外引入视觉提示调优。要实现这种见解并非易事,因为有效地学习CZSL提示涉及到视觉原语之间的纠缠以及不同组合中的外观变化的挑战。在本文中,我们提出了一个新的协同提示作为解纠缠查询(SPDQ)框架。它可以基于协同提示解开原始特征,共同缓解这些挑战。具体而言,我们首先设计了一个低秩原语调制器,根据每个实例的先验知识产生协同自适应属性和对象提示,用于模型自适应。然后,我们还利用文本前缀提示构建协同提示查询,用于从局部视觉补丁中重新采样相应的视觉特征。在三个基准测试上进行的综合实验表明,我们的SPDQ方法达到了最先进的结果。
{"title":"SPDQ: Synergetic Prompts as Disentanglement Queries for Compositional Zero-Shot Learning","authors":"Han Jiang;Xiaoshan Yang;Chaofan Chen;Changsheng Xu","doi":"10.1109/TMM.2025.3607726","DOIUrl":"https://doi.org/10.1109/TMM.2025.3607726","url":null,"abstract":"Compositional zero-shot learning (CZSL) aims to identify novel compositions formed by known primitives (attributes and objects). Motivated by recent advancements in pre-trained vision-language models such as CLIP, many methods attempt to fine-tune CLIP for CZSL and achieve remarkable performance. However, the existing CLIP-based CZSL methods focus mainly on text prompt tuning, which lacks the flexibility to dynamically adapt both modalities. To solve this issue, an intuitive solution is to additionally introduce visual prompt tuning. This insight is not trivial to achieve because effectively learning prompts for CZSL involves the challenge of entanglement between visual primitives as well as appearance shifts in different compositions. In this paper, we propose a novel Synergetic Prompts as Disentanglement Queries (SPDQ) framework for CZSL. It can disentangle primitive features based on synergetic prompts to jointly alleviate these challenges. Specifically, we first design a low-rank primitive modulator to produce synergetic adaptive attribute and object prompts based on prior knowledge of each instance for model adaptation. Then, we additionally utilize text prefix prompts to construct synergetic prompt queries, which are used to resample corresponding visual features from local visual patches. Comprehensive experiments conducted on three benchmarks demonstrate that our SPDQ approach achieves state-of-the-art results.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"8888-8899"},"PeriodicalIF":9.7,"publicationDate":"2025-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145510101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-Layer Transfer Learning for Cross-Domain Recommendation Based on Graph Node Representation Enhancement 基于图节点表示增强的跨域推荐多层迁移学习
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-09-09 DOI: 10.1109/TMM.2025.3607706
Xin Ni;Jie Nie;Niantai Jing;Jianliang Xu;Xiaodong Wang;Xuesong Gao;MingXing Jiang;Chi-Hung Chi;Zhiqiang Wei
Effectively representing and transferring user preferences across various domains presents a significant challenge in cross-domain recommendation (CDR). Some approaches utilize graph neural networks that use interaction behavior to establish relationships between entities, providing a comprehensive understanding of user interests. However, the impact of consistent semantics across various types, fields, and perspectives of social media information on user preferences is overlooked, i.e. the multidimensional consistency of user preferences. This oversight results in graph node representations that inadequately reflect user preferences. To address these limitations, we propose a multi-layer transfer learning network (MTLG) for CDR based on graph node representation enhancement via multi-dimensional consistent user preferences. Firstly, the model introduces a set of globally shared semantic units to perform different-grained semantic alignment of multiple media information without clear alignment boundaries, thereby modeling multi-dimensional consistent user preference features. These features are then seamlessly integrated with the initial high-order graph structure embedding features, thus significantly improving the quality of graph node representation. Secondly, the model innovatively designs a multi-layer transfer learning network that hierarchically aligns the domain distribution differences. It calculates the similarity between domains to derive layer weights for more precise transfer learning, thereby mitigating the possibility of information error accumulation resulting from inaccurate feature aggregation processes. We conducted numerous experiments on 3 scenarios, including 7,954,943 rating information from the Amazon dataset. The results indicate that MTLG’s recommendation accuracy surpasses those of state-of-the-art methods.
在跨域推荐(CDR)中,如何有效地表示和传递用户偏好是一个重大挑战。一些方法利用图形神经网络,使用交互行为来建立实体之间的关系,提供对用户兴趣的全面理解。然而,社交媒体信息的不同类型、领域和视角的语义一致性对用户偏好的影响被忽视了,即用户偏好的多维一致性。这种疏忽导致图节点表示不能充分反映用户偏好。为了解决这些限制,我们提出了一种基于基于多维一致用户偏好的图节点表示增强的CDR多层迁移学习网络(MTLG)。首先,该模型引入一组全局共享的语义单元,对多种媒体信息进行不同粒度的语义对齐,没有明确的对齐边界,从而建模多维一致的用户偏好特征;然后将这些特征与初始的高阶图结构嵌入特征无缝集成,从而显著提高图节点表示的质量。其次,该模型创新地设计了一个多层迁移学习网络,分层排列领域分布差异;它计算域之间的相似度,以获得更精确的迁移学习层权重,从而减少由于不准确的特征聚集过程而导致信息错误积累的可能性。我们在3个场景下进行了大量的实验,包括来自Amazon数据集的7,954,943个评级信息。结果表明,MTLG的推荐准确率超过了目前最先进的推荐方法。
{"title":"Multi-Layer Transfer Learning for Cross-Domain Recommendation Based on Graph Node Representation Enhancement","authors":"Xin Ni;Jie Nie;Niantai Jing;Jianliang Xu;Xiaodong Wang;Xuesong Gao;MingXing Jiang;Chi-Hung Chi;Zhiqiang Wei","doi":"10.1109/TMM.2025.3607706","DOIUrl":"https://doi.org/10.1109/TMM.2025.3607706","url":null,"abstract":"Effectively representing and transferring user preferences across various domains presents a significant challenge in cross-domain recommendation (CDR). Some approaches utilize graph neural networks that use interaction behavior to establish relationships between entities, providing a comprehensive understanding of user interests. However, the impact of consistent semantics across various types, fields, and perspectives of social media information on user preferences is overlooked, i.e. the multidimensional consistency of user preferences. This oversight results in graph node representations that inadequately reflect user preferences. To address these limitations, we propose a multi-layer transfer learning network (MTLG) for CDR based on graph node representation enhancement via multi-dimensional consistent user preferences. Firstly, the model introduces a set of globally shared semantic units to perform different-grained semantic alignment of multiple media information without clear alignment boundaries, thereby modeling multi-dimensional consistent user preference features. These features are then seamlessly integrated with the initial high-order graph structure embedding features, thus significantly improving the quality of graph node representation. Secondly, the model innovatively designs a multi-layer transfer learning network that hierarchically aligns the domain distribution differences. It calculates the similarity between domains to derive layer weights for more precise transfer learning, thereby mitigating the possibility of information error accumulation resulting from inaccurate feature aggregation processes. We conducted numerous experiments on 3 scenarios, including 7,954,943 rating information from the Amazon dataset. The results indicate that MTLG’s recommendation accuracy surpasses those of state-of-the-art methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"8940-8953"},"PeriodicalIF":9.7,"publicationDate":"2025-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145510159","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Like Humans to Few-Shot Learning Through Knowledge Permeation of Visual and Language 通过视觉和语言的知识渗透,像人一样进行短镜头学习
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-09-08 DOI: 10.1109/TMM.2025.3604977
Yuyu Jia;Qing Zhou;Junyu Gao;Qiang Li;Qi Wang
Few-shot learning aims to generalize the recognizer from seen categories to an entirely novel scenario. With only a few support samples, several advanced methods initially introduce class names as prior knowledge for identifying novel classes. However, obstacles still impede achieving a comprehensive understanding of how to harness the mutual advantages of visual and textual knowledge. In this paper, we set out to fill this gap via a coherent Bidirectional Knowledge Permeation strategy called BiKop, which is grounded in human intuition: a class name description offers a more general representation, whereas an image captures the specificity of individuals. BiKop primarily establishes a hierarchical joint general-specific representation through bidirectional knowledge permeation. On the other hand, considering the bias of joint representation towards the base set, we disentangle base-class-relevant semantics during training, thereby alleviating the suppression of potential novel-class-relevant information. Experiments on four challenging benchmarks demonstrate the remarkable superiority of BiKop, particularly outperforming previous methods by a substantial margin in the 1-shot setting (improving the accuracy by 7.58% on miniImageNet).
Few-shot学习的目的是将识别器从已知的类别推广到一个全新的场景。只有几个支持样本,一些高级方法最初引入类名作为识别新类的先验知识。然而,对如何利用视觉和文本知识的相互优势的全面理解仍然存在障碍。在本文中,我们开始通过一种名为BiKop的连贯双向知识渗透策略来填补这一空白,该策略以人类直觉为基础:类名描述提供了更一般的表示,而图像捕获了个体的特殊性。BiKop主要是通过双向的知识渗透建立一个层次的联合通用表示。另一方面,考虑到联合表示对基集的偏见,我们在训练过程中解开了基类相关语义,从而减轻了潜在的新类相关信息的抑制。在四个具有挑战性的基准测试中进行的实验证明了BiKop的显著优势,特别是在1次射击设置中表现明显优于以前的方法(在miniImageNet上提高了7.58%的精度)。
{"title":"Like Humans to Few-Shot Learning Through Knowledge Permeation of Visual and Language","authors":"Yuyu Jia;Qing Zhou;Junyu Gao;Qiang Li;Qi Wang","doi":"10.1109/TMM.2025.3604977","DOIUrl":"https://doi.org/10.1109/TMM.2025.3604977","url":null,"abstract":"Few-shot learning aims to generalize the recognizer from seen categories to an entirely novel scenario. With only a few support samples, several advanced methods initially introduce class names as prior knowledge for identifying novel classes. However, obstacles still impede achieving a comprehensive understanding of how to harness the mutual advantages of visual and textual knowledge. In this paper, we set out to fill this gap via a coherent Bidirectional Knowledge Permeation strategy called BiKop, which is grounded in human intuition: a class name description offers a more <italic>general</i> representation, whereas an image captures the <italic>specificity</i> of individuals. BiKop primarily establishes a hierarchical joint general-specific representation through bidirectional knowledge permeation. On the other hand, considering the bias of joint representation towards the base set, we disentangle base-class-relevant semantics during training, thereby alleviating the suppression of potential novel-class-relevant information. Experiments on four challenging benchmarks demonstrate the remarkable superiority of BiKop, particularly outperforming previous methods by a substantial margin in the 1-shot setting (improving the accuracy by 7.58% on <italic>mini</i>ImageNet).","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"7905-7916"},"PeriodicalIF":9.7,"publicationDate":"2025-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145351925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PrimePSegter: Progressively Combined Diffusion for 3D Panoptic Segmentation With Multi-Modal BEV Refinement PrimePSegter:基于多模态BEV细化的3D全视分割的渐进组合扩散
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-09-02 DOI: 10.1109/TMM.2025.3604903
Hongqi Yu;Sixian Chan;Xiaolong Zhou;Xiaoqin Zhang
Effective and robust 3D panoptic segmentation is crucial for scene perception in autonomous driving. Modern methods widely adopt multi-modal fusion based simple feature concatenation to enhance 3D scene understanding, resulting in generated multi-modal representations typically lack comprehensive semantic and geometry information. These methods focused on panoptic prediction in a single step also limit the capability to progressively refine panoptic predictions under varying noise levels, which is essential for enhancing model robustness. To address these limitations, we first utilize BEV space to unify semantic-geometry perceptual representation, allowing for a more effective integration of LiDAR and camera data. Then, we propose PrimePSegter, a progressively combined diffusion 3D panoptic segmentation model that is conditioned on BEV maps to iteratively refine predictions by denoising samples generated from Gaussian distribution. PrimePSegter adopts a conditional encoder-decoder architecture for fine-grained panoptic predictions. Specifically, a multi-modal conditional encoder is equipped with BEV fusion network to integrate semantic and geometric information from LiDAR and camera streams into unified BEV space. Additionally, a diffusion transformer decoder operates on multi-modal BEV features with varying noise levels to guide the training of diffusion model, refining the BEV panoptic representations enriched with semantics and geometry in a progressive way. PrimePSegter achieves state-of-the-art performance on the nuScenes and competitive results on the SemanticKITTI, respectively. Moreover, PrimePSegter demonstrates superior robustness towards various scenarios, outperforming leading methods.
有效、鲁棒的三维全景分割是实现自动驾驶场景感知的关键。现代方法普遍采用基于简单特征拼接的多模态融合来增强对三维场景的理解,导致生成的多模态表示通常缺乏全面的语义和几何信息。这些方法侧重于单步全光学预测,也限制了在不同噪声水平下逐步改进全光学预测的能力,这对于增强模型的鲁棒性至关重要。为了解决这些限制,我们首先利用BEV空间来统一语义-几何感知表示,从而更有效地集成激光雷达和相机数据。然后,我们提出了PrimePSegter,这是一种渐进组合的扩散3D全视分割模型,它以BEV映射为条件,通过对高斯分布生成的样本去噪来迭代地改进预测。PrimePSegter采用条件编码器-解码器架构进行细粒度的全光预测。具体而言,多模态条件编码器配备了BEV融合网络,将来自LiDAR和相机流的语义和几何信息整合到统一的BEV空间中。此外,扩散转换器解码器对具有不同噪声水平的多模态BEV特征进行操作,指导扩散模型的训练,逐步细化具有丰富语义和几何的BEV全景表示。PrimePSegter分别在nuScenes和SemanticKITTI上取得了最先进的性能和竞争结果。此外,PrimePSegter对各种场景表现出优越的鲁棒性,优于领先的方法。
{"title":"PrimePSegter: Progressively Combined Diffusion for 3D Panoptic Segmentation With Multi-Modal BEV Refinement","authors":"Hongqi Yu;Sixian Chan;Xiaolong Zhou;Xiaoqin Zhang","doi":"10.1109/TMM.2025.3604903","DOIUrl":"https://doi.org/10.1109/TMM.2025.3604903","url":null,"abstract":"Effective and robust 3D panoptic segmentation is crucial for scene perception in autonomous driving. Modern methods widely adopt multi-modal fusion based simple feature concatenation to enhance 3D scene understanding, resulting in generated multi-modal representations typically lack comprehensive semantic and geometry information. These methods focused on panoptic prediction in a single step also limit the capability to progressively refine panoptic predictions under varying noise levels, which is essential for enhancing model robustness. To address these limitations, we first utilize BEV space to unify semantic-geometry perceptual representation, allowing for a more effective integration of LiDAR and camera data. Then, we propose PrimePSegter, a progressively combined diffusion 3D panoptic segmentation model that is conditioned on BEV maps to iteratively refine predictions by denoising samples generated from Gaussian distribution. PrimePSegter adopts a conditional encoder-decoder architecture for fine-grained panoptic predictions. Specifically, a multi-modal conditional encoder is equipped with BEV fusion network to integrate semantic and geometric information from LiDAR and camera streams into unified BEV space. Additionally, a diffusion transformer decoder operates on multi-modal BEV features with varying noise levels to guide the training of diffusion model, refining the BEV panoptic representations enriched with semantics and geometry in a progressive way. PrimePSegter achieves state-of-the-art performance on the nuScenes and competitive results on the SemanticKITTI, respectively. Moreover, PrimePSegter demonstrates superior robustness towards various scenarios, outperforming leading methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"7891-7904"},"PeriodicalIF":9.7,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145351954","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Crafting More Transferable Adversarial Examples via Quality-Aware Transformation Combination 通过质量意识转换组合制作更多可转移的对抗性示例
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-09-01 DOI: 10.1109/TMM.2025.3604967
Junlin Liu;Xinchen Lyu;Chenshan Ren;Qimei Cui
Input diversity is an effective technique for crafting transferable adversarial examples that can deceive unknown AI models. Existing input-diversity-based methods typically use single input transformation, limiting targeted transferability and defense robustness. Combining different transformation types is challenging, as keeping increasing types would degrade semantic information and targeted transferability. This paper proposes a quality-aware transformation combination attack (TCA) that selects high-quality transformation combinations. The quality-aware selection enables expansion of transformation types, enhances input diversity, and hence improves targeted transferability and defense robustness. We first design a quality-evaluation framework to quantify the effectiveness of transformation combinations, which jointly considers convergence, transferability, and robustness. Only a small group (up to 10) of images are required for computation-efficient quality evaluation. Experiments validate TCA’s superiority over state-of-the-art baselines in adversarial transferability and robustness. When defenses are secured, the average targeted success rate of TCA with four transformation types (i.e., TCA-t4) outperforms the best baseline by 26%$sim$42% on ImageNet.
输入多样性是一种有效的技术,用于制作可转移的对抗示例,可以欺骗未知的AI模型。现有的基于输入多样性的方法通常使用单输入转换,限制了目标可转移性和防御鲁棒性。组合不同的转换类型具有挑战性,因为不断增加类型会降低语义信息和目标可移植性。提出了一种选择高质量转换组合的质量感知转换组合攻击(TCA)。质量意识选择可以扩展转换类型,增强输入多样性,从而提高目标可转移性和防御鲁棒性。我们首先设计了一个质量评估框架来量化转换组合的有效性,它共同考虑了收敛性、可转移性和鲁棒性。计算效率高的质量评估只需要一小组(最多10张)图像。实验验证了TCA在对抗可转移性和鲁棒性方面优于最先进的基线。当防御得到保护时,具有四种转换类型(即TCA-t4)的TCA的平均目标成功率在ImageNet上比最佳基线高出26%。
{"title":"Crafting More Transferable Adversarial Examples via Quality-Aware Transformation Combination","authors":"Junlin Liu;Xinchen Lyu;Chenshan Ren;Qimei Cui","doi":"10.1109/TMM.2025.3604967","DOIUrl":"https://doi.org/10.1109/TMM.2025.3604967","url":null,"abstract":"Input diversity is an effective technique for crafting transferable adversarial examples that can deceive unknown AI models. Existing input-diversity-based methods typically use single input transformation, limiting targeted transferability and defense robustness. Combining different transformation types is challenging, as keeping increasing types would degrade semantic information and targeted transferability. This paper proposes a quality-aware <underline>t</u>ransformation <underline>c</u>ombination <underline>a</u>ttack (TCA) that selects high-quality transformation combinations. The quality-aware selection enables expansion of transformation types, enhances input diversity, and hence improves targeted transferability and defense robustness. We first design a quality-evaluation framework to quantify the effectiveness of transformation combinations, which jointly considers convergence, transferability, and robustness. Only a small group (up to 10) of images are required for computation-efficient quality evaluation. Experiments validate TCA’s superiority over state-of-the-art baselines in adversarial transferability and robustness. When defenses are secured, the average targeted success rate of TCA with four transformation types (i.e., TCA-t4) outperforms the best baseline by 26%<inline-formula><tex-math>$sim$</tex-math></inline-formula>42% on ImageNet.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"7917-7929"},"PeriodicalIF":9.7,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145351943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cross-Projection Distilling Knowledge for Omnidirectional Image Quality Assessment 面向全方位图像质量评价的交叉投影提取知识
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-07-28 DOI: 10.1109/TMM.2025.3590920
Huixin Hu;Feng Shao;Hangwei Chen;Xiongli Chai;Qiuping Jiang
Nowadays, virtual reality technology is advancing rapidly and becoming increasingly matured. Omnidirectional images have integrated into the daily lives of many individuals. However, these images are susceptible to irreversible distortion during the encoding and transmission processes. Given the unique characteristics of deformation and distortion in omnidirectional images, the development of a quality assessment method is crucial. To ensure that our network not only delivers efficient and stable performance but also maintains a minimal parameter count, we have integrated the concept of knowledge distillation into our network. This involves utilizing a full-reference (FR) teacher network to guide the training of a no-reference (NR) student network by cross-projection distilling knowledge. To specifically implement this method, a Dual Projection Format Fusion (DPFF) module is specifically designed to complement and integrate the mutual fusion of the two projection formats of omnidirectional images. In the design of our knowledge distillation process and loss function, we have introduced a review mechanism to enhance the performance and efficiency of response-based knowledge, as well as utilized intermediate fusion features to improve the effectiveness of feature-based knowledge. These components are combined to formulate the final loss function. Experimental results validate the superiority of our proposed model over existing FR and NR methods when evaluated on four omnidirectional image databases. This highlights the effectiveness of our proposed model in elevating the quality assessment of omnidirectional images.
当前,虚拟现实技术发展迅速,日趋成熟。全方位的图像已经融入了许多人的日常生活。然而,这些图像在编码和传输过程中容易产生不可逆失真。鉴于全向图像的形变和畸变的独特特性,开发一种质量评估方法至关重要。为了确保我们的网络不仅提供高效和稳定的性能,而且保持最小的参数计数,我们将知识蒸馏的概念集成到我们的网络中。这涉及到利用全参考(FR)教师网络通过交叉投影提取知识来指导无参考(NR)学生网络的训练。为具体实现该方法,专门设计了双投影格式融合(Dual Projection Format Fusion, DPFF)模块,对全向图像两种投影格式的相互融合进行补充和集成。在知识蒸馏过程和损失函数的设计中,我们引入了评审机制来提高基于响应的知识的性能和效率,并利用中间融合特征来提高基于特征的知识的有效性。这些分量组合起来形成最终的损失函数。实验结果验证了该模型在4个全向图像数据库上优于现有的FR和NR方法。这突出了我们提出的模型在提高全向图像质量评估方面的有效性。
{"title":"Cross-Projection Distilling Knowledge for Omnidirectional Image Quality Assessment","authors":"Huixin Hu;Feng Shao;Hangwei Chen;Xiongli Chai;Qiuping Jiang","doi":"10.1109/TMM.2025.3590920","DOIUrl":"https://doi.org/10.1109/TMM.2025.3590920","url":null,"abstract":"Nowadays, virtual reality technology is advancing rapidly and becoming increasingly matured. Omnidirectional images have integrated into the daily lives of many individuals. However, these images are susceptible to irreversible distortion during the encoding and transmission processes. Given the unique characteristics of deformation and distortion in omnidirectional images, the development of a quality assessment method is crucial. To ensure that our network not only delivers efficient and stable performance but also maintains a minimal parameter count, we have integrated the concept of knowledge distillation into our network. This involves utilizing a full-reference (FR) teacher network to guide the training of a no-reference (NR) student network by cross-projection distilling knowledge. To specifically implement this method, a Dual Projection Format Fusion (DPFF) module is specifically designed to complement and integrate the mutual fusion of the two projection formats of omnidirectional images. In the design of our knowledge distillation process and loss function, we have introduced a review mechanism to enhance the performance and efficiency of response-based knowledge, as well as utilized intermediate fusion features to improve the effectiveness of feature-based knowledge. These components are combined to formulate the final loss function. Experimental results validate the superiority of our proposed model over existing FR and NR methods when evaluated on four omnidirectional image databases. This highlights the effectiveness of our proposed model in elevating the quality assessment of omnidirectional images.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"6752-6765"},"PeriodicalIF":9.7,"publicationDate":"2025-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141602","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Multimedia
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1