首页 > 最新文献

IEEE Transactions on Multimedia最新文献

英文 中文
Boosting Dataset Distillation With the Assistance of Crucial Samples for Visual Learning 基于关键样本的数据集升华视觉学习
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-10-07 DOI: 10.1109/TMM.2025.3618578
Xiaodan Li;Yao Zhu;Yuefeng Chen;Cen Chen;Jianmei Guo;Shuhui Wang
In recent years, massive datasets have significantly driven the advancement of visual learning such as multi-modal large model at the expense of high computational costs and extensive storage requirements. Dataset distillation (DD) aims to address this challenge by learning a small synthetic dataset such that a model trained on it can achieve a test performance comparable to that of the model trained on the original dataset. This task can be formulated as a bi-level learning problem where the outer loop optimizes the learned dataset and the inner loop updates the model parameters based on the distilled data. Different from previous studies that focus primarily on optimizing the inner loop in this bi-level problem, we delve into the task of dataset distillation from the perspective of sample cruciality. We find that discarding easy samples and keeping the hard ones that are difficult to be represented by the learned synthetic samples in the outer loop can be beneficial for DD. Motivated by this observation, we further develop an Infinite Semantic Augmentation (ISA) based dataset distillation algorithm, which discards some easier samples and implicitly enriches harder ones in the semantic space through continuous interpolation between two target feature vectors. Through detailed mathematical derivation, the joint contribution to the training loss of all interpolated feature points is formed into an analytical closed-form solution of an integral that can be optimized with almost no extra computational cost. Experimental results on several benchmark datasets demonstrate the effectiveness of our approach in reducing the dataset size while preserving the accuracy of the model. Furthermore, we show that high-quality distilled data can also benefit downstream applications, such as continual learning and membership inference defense.
近年来,海量数据集极大地推动了多模态大模型等视觉学习的进步,但代价是高昂的计算成本和大量的存储需求。数据集蒸馏(DD)旨在通过学习一个小的合成数据集来解决这一挑战,这样在它上面训练的模型可以达到与在原始数据集上训练的模型相当的测试性能。该任务可以表述为一个双层学习问题,其中外环优化学习数据集,内环根据提取的数据更新模型参数。与以往的研究主要集中在优化这一双层次问题的内环不同,我们从样本关键度的角度深入研究数据集蒸馏的任务。我们发现,丢弃简单的样本,并将难以被学习的合成样本表示的难样本保留在外环中,这对DD是有益的。基于这一观察结果,我们进一步开发了一种基于无限语义增强(ISA)的数据集蒸馏算法,该算法通过两个目标特征向量之间的连续插值,在语义空间中丢弃一些容易的样本,隐式地丰富更难的样本。通过详细的数学推导,将所有插值特征点对训练损失的共同贡献形成一个积分的解析封闭解,几乎不需要额外的计算成本就可以对其进行优化。在几个基准数据集上的实验结果表明,我们的方法在减少数据集大小的同时保持了模型的准确性。此外,我们还表明,高质量的蒸馏数据也可以使下游应用受益,例如持续学习和成员推理防御。
{"title":"Boosting Dataset Distillation With the Assistance of Crucial Samples for Visual Learning","authors":"Xiaodan Li;Yao Zhu;Yuefeng Chen;Cen Chen;Jianmei Guo;Shuhui Wang","doi":"10.1109/TMM.2025.3618578","DOIUrl":"https://doi.org/10.1109/TMM.2025.3618578","url":null,"abstract":"In recent years, massive datasets have significantly driven the advancement of visual learning such as multi-modal large model at the expense of high computational costs and extensive storage requirements. Dataset distillation (DD) aims to address this challenge by learning a small synthetic dataset such that a model trained on it can achieve a test performance comparable to that of the model trained on the original dataset. This task can be formulated as a bi-level learning problem where the outer loop optimizes the learned dataset and the inner loop updates the model parameters based on the distilled data. Different from previous studies that focus primarily on optimizing the inner loop in this bi-level problem, we delve into the task of dataset distillation from the perspective of sample cruciality. We find that discarding easy samples and keeping the hard ones that are difficult to be represented by the learned synthetic samples in the outer loop can be beneficial for DD. Motivated by this observation, we further develop an Infinite Semantic Augmentation (ISA) based dataset distillation algorithm, which discards some easier samples and implicitly enriches harder ones in the semantic space through continuous interpolation between two target feature vectors. Through detailed mathematical derivation, the joint contribution to the training loss of all interpolated feature points is formed into an analytical closed-form solution of an integral that can be optimized with almost no extra computational cost. Experimental results on several benchmark datasets demonstrate the effectiveness of our approach in reducing the dataset size while preserving the accuracy of the model. Furthermore, we show that high-quality distilled data can also benefit downstream applications, such as continual learning and membership inference defense.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"9873-9886"},"PeriodicalIF":9.7,"publicationDate":"2025-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145886604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Unleashing the Potential of Hierarchical Region Clues for Open-Vocabulary Multi-Label Classification 释放层次区域线索在开放词汇多标签分类中的潜力
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-10-06 DOI: 10.1109/TMM.2025.3618542
Peirong Ma;Wu Ran;Zhiquan He;Jian Pu;Hong Lu
Open-vocabulary multi-label classification (OV- MLC) aims to leverage the rich multi-modal knowledge from Vision-language pre-training (VLP) models to further improve the recognition ability for unseen (novel) classes beyond the training set in multi-label scenarios. Existing OV-MLC methods only perform predictions on single hierarchical regions, and aggregate the prediction scores of these regions through simple top-k mean pooling. This fails to unleash the potential of rich hierarchical region clues in multi-label images and does not fully exploit the discriminative information from all regions in the image, resulting in sub-optimal performance. In this work, we propose a novel OV-MLC framework to fully harness the power of multiple hierarchical region clues. Specifically, we first design a hierarchical clue gathering (HCG) module to gather different hierarchical clues, enabling more precise recognition of multiple object categories with different sizes in a multi-label image. Then, by viewing multi-label classification as single-label classification of each region within the image, we present a novel hierarchical score aggregation (HSA) approach, thereby better utilizing the predictions of each image region for each class. We also utilize a well-designed region selection strategy (RSS) to eliminate noise or background regions in an image that are irrelevant to classification, achieving higher multi-label classification accuracy. In addition, we propose a hybrid prompt learning (HPL) strategy to enhance visual-semantic consistency while preserving the generalization capability of label embeddings for unseen classes. Extensive experiments on public benchmark datasets demonstrate that our method significantly outperforms the current state-of-the-art.
开放词汇多标签分类(OV- MLC)旨在利用视觉语言预训练(VLP)模型丰富的多模态知识,进一步提高多标签场景下对训练集以外未见(新颖)类的识别能力。现有的OV-MLC方法仅对单个分层区域进行预测,并通过简单的top-k均值池化对这些区域的预测分数进行汇总。这种方法没有充分发挥多标签图像中丰富层次区域线索的潜力,也没有充分利用图像中所有区域的判别信息,导致性能不佳。在这项工作中,我们提出了一个新的OV-MLC框架,以充分利用多个分层区域线索的力量。具体来说,我们首先设计了一个分层线索收集(HCG)模块来收集不同的分层线索,从而能够更精确地识别多标签图像中不同大小的多个对象类别。然后,通过将多标签分类视为图像内每个区域的单标签分类,我们提出了一种新的分层分数聚合(HSA)方法,从而更好地利用每个图像区域对每个类别的预测。我们还利用精心设计的区域选择策略(RSS)来消除图像中与分类无关的噪声或背景区域,从而实现更高的多标签分类精度。此外,我们提出了一种混合提示学习(HPL)策略,以增强视觉语义一致性,同时保留标签嵌入对未见类的泛化能力。在公共基准数据集上进行的大量实验表明,我们的方法明显优于当前最先进的方法。
{"title":"Unleashing the Potential of Hierarchical Region Clues for Open-Vocabulary Multi-Label Classification","authors":"Peirong Ma;Wu Ran;Zhiquan He;Jian Pu;Hong Lu","doi":"10.1109/TMM.2025.3618542","DOIUrl":"https://doi.org/10.1109/TMM.2025.3618542","url":null,"abstract":"Open-vocabulary multi-label classification (OV- MLC) aims to leverage the rich multi-modal knowledge from Vision-language pre-training (VLP) models to further improve the recognition ability for unseen (novel) classes beyond the training set in multi-label scenarios. Existing OV-MLC methods only perform predictions on single hierarchical regions, and aggregate the prediction scores of these regions through simple <italic>top-k</i> mean pooling. This fails to unleash the potential of rich hierarchical region clues in multi-label images and does not fully exploit the discriminative information from all regions in the image, resulting in sub-optimal performance. In this work, we propose a novel OV-MLC framework to fully harness the power of multiple hierarchical region clues. Specifically, we first design a hierarchical clue gathering (HCG) module to gather different hierarchical clues, enabling more precise recognition of multiple object categories with different sizes in a multi-label image. Then, by viewing multi-label classification as single-label classification of each region within the image, we present a novel hierarchical score aggregation (HSA) approach, thereby better utilizing the predictions of each image region for each class. We also utilize a well-designed region selection strategy (RSS) to eliminate noise or background regions in an image that are irrelevant to classification, achieving higher multi-label classification accuracy. In addition, we propose a hybrid prompt learning (HPL) strategy to enhance visual-semantic consistency while preserving the generalization capability of label embeddings for unseen classes. Extensive experiments on public benchmark datasets demonstrate that our method significantly outperforms the current state-of-the-art.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"9832-9846"},"PeriodicalIF":9.7,"publicationDate":"2025-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145885034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cross-Modal Spherical Aggregation for Weakly Supervised Remote Sensing Shadow Removal 弱监督遥感阴影去除的跨模态球面聚集
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-10-06 DOI: 10.1109/TMM.2025.3618537
Kaichen Chi;Wei Jing;Junjie Li;Qiang Li;Qi Wang
Shadows are dark areas, typically rendering low illumination intensity. Admittedly, the infrared image can provide robust illumination cues that the visible image lacks, but existing methods ignore the collaboration between heterogeneous modalities. To fill this gap, we propose a weakly supervised shadow removal network with a spherical feature space, dubbed S2-ShadowNet, to explore the best of both worlds for visible and infrared modalities. Specifically, we employ a modal translation (visible-to-infrared) model to learn the cross-domain mapping, thus generating realistic infrared samples. Then, Swin Transformer is utilized to extract strong representational visible/infrared features. Simultaneously, the extracted features are mapped to the smooth spherical manifold, which alleviates the domain shift through regularization. Well-designed similarity loss and orthogonality loss are embedded into the spherical space, prompting the separation of private visible/infrared features and the alignment of shared visible/infrared features through constraints on both representation content and orientation. Such a manner encourages implicit reciprocity between modalities, thus providing a novel insight into shadow removal. Notably, ground truth is not available in practice, thus S2-ShadowNet is trained by cropping shadow and shadow-free patches from the shadow image itself, avoiding stereotypical and strict pair data acquisition. More importantly, we contribute a large-scale weakly supervised shadow removal benchmark that makes shadow removal independent of specific scenario constraints possible. Extensive experiments demonstrate that S2-ShadowNet outperforms state-of-the-art methods in both qualitative and quantitative comparisons.
阴影是暗的区域,通常呈现低照明强度。诚然,红外图像可以提供可见图像所缺乏的鲁棒照明线索,但现有方法忽略了异构模态之间的协作。为了填补这一空白,我们提出了一个带有球形特征空间的弱监督阴影去除网络,称为S2-ShadowNet,以探索可见光和红外模式的最佳效果。具体来说,我们采用模态平移(可见光到红外)模型来学习跨域映射,从而生成逼真的红外样本。然后,利用Swin Transformer提取强代表性的可见/红外特征。同时,将提取的特征映射到光滑球面流形上,通过正则化减轻了域漂移。精心设计的相似性损失和正交性损失嵌入到球面空间中,通过对表示内容和方向的约束,促使私有可见/红外特征的分离和共享可见/红外特征的对齐。这种方式鼓励了模式之间的隐性互惠,从而为阴影去除提供了一种新颖的见解。值得注意的是,在实践中,地面真相是不可用的,因此S2-ShadowNet是通过裁剪阴影图像本身的阴影和无阴影斑块来训练的,避免了刻板和严格的成对数据采集。更重要的是,我们提供了一个大规模的弱监督阴影去除基准,使阴影去除独立于特定的场景约束成为可能。广泛的实验表明,S2-ShadowNet在定性和定量比较中都优于最先进的方法。
{"title":"Cross-Modal Spherical Aggregation for Weakly Supervised Remote Sensing Shadow Removal","authors":"Kaichen Chi;Wei Jing;Junjie Li;Qiang Li;Qi Wang","doi":"10.1109/TMM.2025.3618537","DOIUrl":"https://doi.org/10.1109/TMM.2025.3618537","url":null,"abstract":"Shadows are dark areas, typically rendering low illumination intensity. Admittedly, the infrared image can provide robust illumination cues that the visible image lacks, but existing methods ignore the collaboration between heterogeneous modalities. To fill this gap, we propose a weakly supervised shadow removal network with a spherical feature space, dubbed S2-ShadowNet, to explore the best of both worlds for visible and infrared modalities. Specifically, we employ a modal translation (visible-to-infrared) model to learn the cross-domain mapping, thus generating realistic infrared samples. Then, Swin Transformer is utilized to extract strong representational visible/infrared features. Simultaneously, the extracted features are mapped to the smooth spherical manifold, which alleviates the domain shift through regularization. Well-designed similarity loss and orthogonality loss are embedded into the spherical space, prompting the separation of private visible/infrared features and the alignment of shared visible/infrared features through constraints on both representation content and orientation. Such a manner encourages implicit reciprocity between modalities, thus providing a novel insight into shadow removal. Notably, ground truth is not available in practice, thus S2-ShadowNet is trained by cropping shadow and shadow-free patches from the shadow image itself, avoiding stereotypical and strict pair data acquisition. More importantly, we contribute a large-scale weakly supervised shadow removal benchmark that makes shadow removal independent of specific scenario constraints possible. Extensive experiments demonstrate that S2-ShadowNet outperforms state-of-the-art methods in both qualitative and quantitative comparisons.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"813-824"},"PeriodicalIF":9.7,"publicationDate":"2025-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145929529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards Invisible Decision-Based Adversarial Attacks Against Visual Object Tracking 针对视觉目标跟踪的基于决策的不可见对抗性攻击
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-10-06 DOI: 10.1109/TMM.2025.3618533
Ziyi Liu;Caiyun Xie;Wenbing Ding;Dengpan Ye;Long Tang;Qian Wang
Adversarial attacks have become a critical focus in visual object tracking (VOT) research. Small, carefully crafted adversarial perturbations to video frames can easily disrupt the visual object tracker, leading to tracking failure. Therefore, studying adversarial attacks contributes to the development of more robust and reliable trackers. Considering that trackers are agnostic in real-world scenarios, research on decision-based black-box attacks is straightforward and practical. However, existing decision-based black-box attacks neither comprehensively analyze the unique characteristics of object tracking nor sufficiently consider the imperceptibility of adversarial perturbations. In this paper, we propose invisible local attack (ILA), a novel decision-based adversarial attack specifically for VOT with imperceptible perturbations. We assume that a significant number of pixels in a frame, irrelevant to the tracked object, do not substantially contribute to the functioning mechanism of a deep tracker. Based on this consideration, we propose a search algorithm to identify the pixel set focused on by the tracker during object tracking. The adversarial noise is then confined to these pixels and iteratively optimized through a heuristic algorithm of ILA. By perturbing only the key pixels, ILA significantly enhances both the attack performance and imperceptibility when it is applied to visual object trackers. Extensive experiments demonstrate that our ILA method achieves a 121% increase in the robustness metric and a 137% improvement in the structural similarity index measure (SSIM) across multiple datasets for various trackers compared with the state-of-the-art (SOTA) method.
对抗性攻击已成为视觉目标跟踪(VOT)研究的热点。对视频帧的微小、精心制作的对抗性扰动很容易破坏视觉目标跟踪器,导致跟踪失败。因此,研究对抗性攻击有助于开发更健壮和可靠的跟踪器。考虑到跟踪器在现实场景中是不可知的,基于决策的黑盒攻击的研究是直接和实用的。然而,现有的基于决策的黑盒攻击既没有全面分析目标跟踪的独特性,也没有充分考虑对抗性扰动的不可感知性。在本文中,我们提出了一种新的基于决策的对抗攻击(ILA),专门针对具有不可察觉扰动的VOT。我们假设一帧中与被跟踪对象无关的大量像素对深度跟踪器的功能机制没有实质性贡献。基于此,我们提出了一种搜索算法来识别跟踪器在目标跟踪过程中所关注的像素集。然后,对抗性噪声被限制在这些像素上,并通过ILA的启发式算法迭代优化。在视觉目标跟踪中,通过对关键像素进行干扰,可显著提高攻击性能和不可感知性。大量的实验表明,与最先进的(SOTA)方法相比,我们的ILA方法在不同跟踪器的多个数据集上实现了121%的鲁棒性度量增加和137%的结构相似指数度量(SSIM)改进。
{"title":"Towards Invisible Decision-Based Adversarial Attacks Against Visual Object Tracking","authors":"Ziyi Liu;Caiyun Xie;Wenbing Ding;Dengpan Ye;Long Tang;Qian Wang","doi":"10.1109/TMM.2025.3618533","DOIUrl":"https://doi.org/10.1109/TMM.2025.3618533","url":null,"abstract":"Adversarial attacks have become a critical focus in visual object tracking (VOT) research. Small, carefully crafted adversarial perturbations to video frames can easily disrupt the visual object tracker, leading to tracking failure. Therefore, studying adversarial attacks contributes to the development of more robust and reliable trackers. Considering that trackers are agnostic in real-world scenarios, research on decision-based black-box attacks is straightforward and practical. However, existing decision-based black-box attacks neither comprehensively analyze the unique characteristics of object tracking nor sufficiently consider the imperceptibility of adversarial perturbations. In this paper, we propose invisible local attack (ILA), a novel decision-based adversarial attack specifically for VOT with imperceptible perturbations. We assume that a significant number of pixels in a frame, irrelevant to the tracked object, do not substantially contribute to the functioning mechanism of a deep tracker. Based on this consideration, we propose a search algorithm to identify the pixel set focused on by the tracker during object tracking. The adversarial noise is then confined to these pixels and iteratively optimized through a heuristic algorithm of ILA. By perturbing only the key pixels, ILA significantly enhances both the attack performance and imperceptibility when it is applied to visual object trackers. Extensive experiments demonstrate that our ILA method achieves a 121% increase in the robustness metric and a 137% improvement in the structural similarity index measure (SSIM) across multiple datasets for various trackers compared with the state-of-the-art (SOTA) method.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"9861-9872"},"PeriodicalIF":9.7,"publicationDate":"2025-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145886608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CAD-Mesher: A Convenient, Accurate, Dense Mesh-Based Mapping Module in SLAM for Dynamic Environments CAD-Mesher:一个方便、准确、密集的基于网格的SLAM动态环境映射模块
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-10-06 DOI: 10.1109/TMM.2025.3618573
Yanpeng Jia;Fengkui Cao;Ting Wang;Yandong Tang;Shiliang Shao;Lianqing Liu
Most LiDAR odometry and SLAM systems construct maps in point clouds, which are discrete and sparse when zoomed in, making them not directly suitable for navigation. Mesh maps represent a dense and continuous map format with low memory consumption, which can approximate complex structures with simple elements, attracting significant attention of researchers in recent years. However, most existing methods operate under a static environment assumption. In effect, moving objects cause ghosting, degrading the quality of meshing. To address these issues, we propose a plug-and-play meshing module adapting to dynamic environments, which can easily integrate with various LiDAR odometry to generally improve the pose estimation accuracy of odometry. In our meshing module, a novel two-stage coarse-to-fine dynamic removal method is designed to effectively filter dynamic objects, generating consistent, accurate, and dense mesh maps. To the best of our knowledge, this is the first mesh construction method with explicit dynamic removal. Additionally, sliding window-based keyframe aggregation and adaptive downsampling strategies are used to ensure the uniformity of point cloud, benefiting for Gaussian process in mesh construction. We evaluate the localization and mapping accuracy on six publicly available datasets. Extensive experiments demonstrate the superiority of our method compared with the state-of-the-art algorithms. The code and introduction video are publicly available at https://yaepiii.github.io/CAD-Mesher/.
大多数激光雷达里程计和SLAM系统在点云中构建地图,这些点云在放大时是离散和稀疏的,因此不适合直接用于导航。网格地图是一种密集连续的地图格式,具有较低的内存消耗,可以用简单的元素近似复杂的结构,近年来引起了研究人员的广泛关注。然而,大多数现有方法都是在静态环境假设下运行的。实际上,移动的物体会导致重影,降低网格的质量。为了解决这些问题,我们提出了一种适应动态环境的即插即用网格模块,该模块可以很容易地与各种激光雷达测程集成,从而普遍提高测程的姿态估计精度。在网格划分模块中,我们设计了一种新的两阶段粗到精的动态去除方法,以有效地过滤动态对象,生成一致、准确、密集的网格图。据我们所知,这是第一个具有显式动态去除的网格构造方法。此外,采用基于滑动窗口的关键帧聚合和自适应下采样策略保证了点云的均匀性,有利于网格构建中的高斯过程。我们在六个公开可用的数据集上评估了定位和映射精度。大量的实验证明了我们的方法与最先进的算法相比的优越性。代码和介绍视频可在https://yaepiii.github.io/CAD-Mesher/上公开获取。
{"title":"CAD-Mesher: A Convenient, Accurate, Dense Mesh-Based Mapping Module in SLAM for Dynamic Environments","authors":"Yanpeng Jia;Fengkui Cao;Ting Wang;Yandong Tang;Shiliang Shao;Lianqing Liu","doi":"10.1109/TMM.2025.3618573","DOIUrl":"https://doi.org/10.1109/TMM.2025.3618573","url":null,"abstract":"Most LiDAR odometry and SLAM systems construct maps in point clouds, which are discrete and sparse when zoomed in, making them not directly suitable for navigation. Mesh maps represent a dense and continuous map format with low memory consumption, which can approximate complex structures with simple elements, attracting significant attention of researchers in recent years. However, most existing methods operate under a static environment assumption. In effect, moving objects cause ghosting, degrading the quality of meshing. To address these issues, we propose a plug-and-play meshing module adapting to dynamic environments, which can easily integrate with various LiDAR odometry to generally improve the pose estimation accuracy of odometry. In our meshing module, a novel two-stage coarse-to-fine dynamic removal method is designed to effectively filter dynamic objects, generating consistent, accurate, and dense mesh maps. To the best of our knowledge, this is the first mesh construction method with explicit dynamic removal. Additionally, sliding window-based keyframe aggregation and adaptive downsampling strategies are used to ensure the uniformity of point cloud, benefiting for Gaussian process in mesh construction. We evaluate the localization and mapping accuracy on six publicly available datasets. Extensive experiments demonstrate the superiority of our method compared with the state-of-the-art algorithms. The code and introduction video are publicly available at <uri>https://yaepiii.github.io/CAD-Mesher/</uri>.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"1025-1036"},"PeriodicalIF":9.7,"publicationDate":"2025-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146175626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HEVC Video Steganalysis Based on Centralized Error and Attention Mechanism 基于集中错误和注意机制的HEVC视频隐写分析
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-09-22 DOI: 10.1109/TMM.2025.3613171
Haojun Dai;Dawen Xu;Lin Yang;Rangding Wang
With high embedding capacity and security, transform coefficient-based video steganography has become an important branch of video steganography. However, existing steganalysis methods against transform coefficient-based steganography provide insufficient consideration to the prediction process of HEVC compression, which results in steganalysis that is not straightforward and fail to effectively detect adaptive steganography methods in low embedding rate scenarios. In this paper, an HEVC video steganalysis method based on centralized error and attention mechanism against transform coefficient-based steganography is proposed. Firstly, the centralized error phenomenon brought by distortion compensation-based steganography is analyzed, and prediction error maps is constructed for steganalysis to achieve higher SNR(signal-to-noise ratio). Secondly, a video steganalysis network called CESNet (Centralized Error Steganalysis Network) is proposed. The network takes the prediction error maps as input and four types of convolutional modules are designed to adapt to different stages of feature extraction. To address the intra-frame sparsity of adaptive steganography, CEA (Centralized Error Attention) modules based on spatial and channel attention mechanisms are proposed to adaptively enhance the steganographic region. Finally, after extracting the feature vectors of each frame, the detection of steganographic video is completed using the self-attention mechanism. Experimental results show that compared with the existing transform coefficient-based video steganalysis methods, the proposed method can effectively detect multiple transform coefficient-based steganography algorithms and achieve higher detection performance in low payload scenarios.
基于变换系数的视频隐写具有较高的嵌入容量和安全性,已成为视频隐写技术的一个重要分支。然而,现有针对变换系数隐写的隐写分析方法对HEVC压缩的预测过程考虑不足,导致隐写分析不够直观,在低嵌入率场景下无法有效检测自适应隐写方法。针对基于变换系数的隐写,提出了一种基于集中误差和注意机制的HEVC视频隐写分析方法。首先,分析了基于失真补偿的隐写带来的集中误差现象,构建了隐写预测误差图,实现了更高的信噪比。其次,提出了一种视频隐写分析网络CESNet (Centralized Error steganalysis network)。该网络以预测误差图为输入,设计了四种类型的卷积模块,以适应不同阶段的特征提取。为了解决自适应隐写的帧内稀疏性问题,提出了基于空间和通道注意机制的集中式错误注意(CEA)模块来自适应增强隐写区域。最后,在提取每一帧的特征向量后,利用自注意机制完成隐写视频的检测。实验结果表明,与现有的基于变换系数的视频隐写分析方法相比,该方法可以有效检测多种基于变换系数的隐写算法,在低载荷场景下具有更高的检测性能。
{"title":"HEVC Video Steganalysis Based on Centralized Error and Attention Mechanism","authors":"Haojun Dai;Dawen Xu;Lin Yang;Rangding Wang","doi":"10.1109/TMM.2025.3613171","DOIUrl":"https://doi.org/10.1109/TMM.2025.3613171","url":null,"abstract":"With high embedding capacity and security, transform coefficient-based video steganography has become an important branch of video steganography. However, existing steganalysis methods against transform coefficient-based steganography provide insufficient consideration to the prediction process of HEVC compression, which results in steganalysis that is not straightforward and fail to effectively detect adaptive steganography methods in low embedding rate scenarios. In this paper, an HEVC video steganalysis method based on centralized error and attention mechanism against transform coefficient-based steganography is proposed. Firstly, the centralized error phenomenon brought by distortion compensation-based steganography is analyzed, and prediction error maps is constructed for steganalysis to achieve higher SNR(signal-to-noise ratio). Secondly, a video steganalysis network called CESNet (Centralized Error Steganalysis Network) is proposed. The network takes the prediction error maps as input and four types of convolutional modules are designed to adapt to different stages of feature extraction. To address the intra-frame sparsity of adaptive steganography, CEA (Centralized Error Attention) modules based on spatial and channel attention mechanisms are proposed to adaptively enhance the steganographic region. Finally, after extracting the feature vectors of each frame, the detection of steganographic video is completed using the self-attention mechanism. Experimental results show that compared with the existing transform coefficient-based video steganalysis methods, the proposed method can effectively detect multiple transform coefficient-based steganography algorithms and achieve higher detection performance in low payload scenarios.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"8914-8925"},"PeriodicalIF":9.7,"publicationDate":"2025-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145510152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SPDQ: Synergetic Prompts as Disentanglement Queries for Compositional Zero-Shot Learning 协同提示作为解纠缠查询的组合零射击学习
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-09-09 DOI: 10.1109/TMM.2025.3607726
Han Jiang;Xiaoshan Yang;Chaofan Chen;Changsheng Xu
Compositional zero-shot learning (CZSL) aims to identify novel compositions formed by known primitives (attributes and objects). Motivated by recent advancements in pre-trained vision-language models such as CLIP, many methods attempt to fine-tune CLIP for CZSL and achieve remarkable performance. However, the existing CLIP-based CZSL methods focus mainly on text prompt tuning, which lacks the flexibility to dynamically adapt both modalities. To solve this issue, an intuitive solution is to additionally introduce visual prompt tuning. This insight is not trivial to achieve because effectively learning prompts for CZSL involves the challenge of entanglement between visual primitives as well as appearance shifts in different compositions. In this paper, we propose a novel Synergetic Prompts as Disentanglement Queries (SPDQ) framework for CZSL. It can disentangle primitive features based on synergetic prompts to jointly alleviate these challenges. Specifically, we first design a low-rank primitive modulator to produce synergetic adaptive attribute and object prompts based on prior knowledge of each instance for model adaptation. Then, we additionally utilize text prefix prompts to construct synergetic prompt queries, which are used to resample corresponding visual features from local visual patches. Comprehensive experiments conducted on three benchmarks demonstrate that our SPDQ approach achieves state-of-the-art results.
组合零射击学习(CZSL)旨在识别由已知原语(属性和对象)组成的新组合。在预训练视觉语言模型(如CLIP)的最新进展的推动下,许多方法试图对CZSL的CLIP进行微调并取得显着的性能。然而,现有的基于clip的CZSL方法主要侧重于文本提示调优,缺乏动态适应这两种模式的灵活性。为了解决这个问题,一个直观的解决方案是额外引入视觉提示调优。要实现这种见解并非易事,因为有效地学习CZSL提示涉及到视觉原语之间的纠缠以及不同组合中的外观变化的挑战。在本文中,我们提出了一个新的协同提示作为解纠缠查询(SPDQ)框架。它可以基于协同提示解开原始特征,共同缓解这些挑战。具体而言,我们首先设计了一个低秩原语调制器,根据每个实例的先验知识产生协同自适应属性和对象提示,用于模型自适应。然后,我们还利用文本前缀提示构建协同提示查询,用于从局部视觉补丁中重新采样相应的视觉特征。在三个基准测试上进行的综合实验表明,我们的SPDQ方法达到了最先进的结果。
{"title":"SPDQ: Synergetic Prompts as Disentanglement Queries for Compositional Zero-Shot Learning","authors":"Han Jiang;Xiaoshan Yang;Chaofan Chen;Changsheng Xu","doi":"10.1109/TMM.2025.3607726","DOIUrl":"https://doi.org/10.1109/TMM.2025.3607726","url":null,"abstract":"Compositional zero-shot learning (CZSL) aims to identify novel compositions formed by known primitives (attributes and objects). Motivated by recent advancements in pre-trained vision-language models such as CLIP, many methods attempt to fine-tune CLIP for CZSL and achieve remarkable performance. However, the existing CLIP-based CZSL methods focus mainly on text prompt tuning, which lacks the flexibility to dynamically adapt both modalities. To solve this issue, an intuitive solution is to additionally introduce visual prompt tuning. This insight is not trivial to achieve because effectively learning prompts for CZSL involves the challenge of entanglement between visual primitives as well as appearance shifts in different compositions. In this paper, we propose a novel Synergetic Prompts as Disentanglement Queries (SPDQ) framework for CZSL. It can disentangle primitive features based on synergetic prompts to jointly alleviate these challenges. Specifically, we first design a low-rank primitive modulator to produce synergetic adaptive attribute and object prompts based on prior knowledge of each instance for model adaptation. Then, we additionally utilize text prefix prompts to construct synergetic prompt queries, which are used to resample corresponding visual features from local visual patches. Comprehensive experiments conducted on three benchmarks demonstrate that our SPDQ approach achieves state-of-the-art results.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"8888-8899"},"PeriodicalIF":9.7,"publicationDate":"2025-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145510101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-Layer Transfer Learning for Cross-Domain Recommendation Based on Graph Node Representation Enhancement 基于图节点表示增强的跨域推荐多层迁移学习
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-09-09 DOI: 10.1109/TMM.2025.3607706
Xin Ni;Jie Nie;Niantai Jing;Jianliang Xu;Xiaodong Wang;Xuesong Gao;MingXing Jiang;Chi-Hung Chi;Zhiqiang Wei
Effectively representing and transferring user preferences across various domains presents a significant challenge in cross-domain recommendation (CDR). Some approaches utilize graph neural networks that use interaction behavior to establish relationships between entities, providing a comprehensive understanding of user interests. However, the impact of consistent semantics across various types, fields, and perspectives of social media information on user preferences is overlooked, i.e. the multidimensional consistency of user preferences. This oversight results in graph node representations that inadequately reflect user preferences. To address these limitations, we propose a multi-layer transfer learning network (MTLG) for CDR based on graph node representation enhancement via multi-dimensional consistent user preferences. Firstly, the model introduces a set of globally shared semantic units to perform different-grained semantic alignment of multiple media information without clear alignment boundaries, thereby modeling multi-dimensional consistent user preference features. These features are then seamlessly integrated with the initial high-order graph structure embedding features, thus significantly improving the quality of graph node representation. Secondly, the model innovatively designs a multi-layer transfer learning network that hierarchically aligns the domain distribution differences. It calculates the similarity between domains to derive layer weights for more precise transfer learning, thereby mitigating the possibility of information error accumulation resulting from inaccurate feature aggregation processes. We conducted numerous experiments on 3 scenarios, including 7,954,943 rating information from the Amazon dataset. The results indicate that MTLG’s recommendation accuracy surpasses those of state-of-the-art methods.
在跨域推荐(CDR)中,如何有效地表示和传递用户偏好是一个重大挑战。一些方法利用图形神经网络,使用交互行为来建立实体之间的关系,提供对用户兴趣的全面理解。然而,社交媒体信息的不同类型、领域和视角的语义一致性对用户偏好的影响被忽视了,即用户偏好的多维一致性。这种疏忽导致图节点表示不能充分反映用户偏好。为了解决这些限制,我们提出了一种基于基于多维一致用户偏好的图节点表示增强的CDR多层迁移学习网络(MTLG)。首先,该模型引入一组全局共享的语义单元,对多种媒体信息进行不同粒度的语义对齐,没有明确的对齐边界,从而建模多维一致的用户偏好特征;然后将这些特征与初始的高阶图结构嵌入特征无缝集成,从而显著提高图节点表示的质量。其次,该模型创新地设计了一个多层迁移学习网络,分层排列领域分布差异;它计算域之间的相似度,以获得更精确的迁移学习层权重,从而减少由于不准确的特征聚集过程而导致信息错误积累的可能性。我们在3个场景下进行了大量的实验,包括来自Amazon数据集的7,954,943个评级信息。结果表明,MTLG的推荐准确率超过了目前最先进的推荐方法。
{"title":"Multi-Layer Transfer Learning for Cross-Domain Recommendation Based on Graph Node Representation Enhancement","authors":"Xin Ni;Jie Nie;Niantai Jing;Jianliang Xu;Xiaodong Wang;Xuesong Gao;MingXing Jiang;Chi-Hung Chi;Zhiqiang Wei","doi":"10.1109/TMM.2025.3607706","DOIUrl":"https://doi.org/10.1109/TMM.2025.3607706","url":null,"abstract":"Effectively representing and transferring user preferences across various domains presents a significant challenge in cross-domain recommendation (CDR). Some approaches utilize graph neural networks that use interaction behavior to establish relationships between entities, providing a comprehensive understanding of user interests. However, the impact of consistent semantics across various types, fields, and perspectives of social media information on user preferences is overlooked, i.e. the multidimensional consistency of user preferences. This oversight results in graph node representations that inadequately reflect user preferences. To address these limitations, we propose a multi-layer transfer learning network (MTLG) for CDR based on graph node representation enhancement via multi-dimensional consistent user preferences. Firstly, the model introduces a set of globally shared semantic units to perform different-grained semantic alignment of multiple media information without clear alignment boundaries, thereby modeling multi-dimensional consistent user preference features. These features are then seamlessly integrated with the initial high-order graph structure embedding features, thus significantly improving the quality of graph node representation. Secondly, the model innovatively designs a multi-layer transfer learning network that hierarchically aligns the domain distribution differences. It calculates the similarity between domains to derive layer weights for more precise transfer learning, thereby mitigating the possibility of information error accumulation resulting from inaccurate feature aggregation processes. We conducted numerous experiments on 3 scenarios, including 7,954,943 rating information from the Amazon dataset. The results indicate that MTLG’s recommendation accuracy surpasses those of state-of-the-art methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"8940-8953"},"PeriodicalIF":9.7,"publicationDate":"2025-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145510159","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Like Humans to Few-Shot Learning Through Knowledge Permeation of Visual and Language 通过视觉和语言的知识渗透,像人一样进行短镜头学习
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-09-08 DOI: 10.1109/TMM.2025.3604977
Yuyu Jia;Qing Zhou;Junyu Gao;Qiang Li;Qi Wang
Few-shot learning aims to generalize the recognizer from seen categories to an entirely novel scenario. With only a few support samples, several advanced methods initially introduce class names as prior knowledge for identifying novel classes. However, obstacles still impede achieving a comprehensive understanding of how to harness the mutual advantages of visual and textual knowledge. In this paper, we set out to fill this gap via a coherent Bidirectional Knowledge Permeation strategy called BiKop, which is grounded in human intuition: a class name description offers a more general representation, whereas an image captures the specificity of individuals. BiKop primarily establishes a hierarchical joint general-specific representation through bidirectional knowledge permeation. On the other hand, considering the bias of joint representation towards the base set, we disentangle base-class-relevant semantics during training, thereby alleviating the suppression of potential novel-class-relevant information. Experiments on four challenging benchmarks demonstrate the remarkable superiority of BiKop, particularly outperforming previous methods by a substantial margin in the 1-shot setting (improving the accuracy by 7.58% on miniImageNet).
Few-shot学习的目的是将识别器从已知的类别推广到一个全新的场景。只有几个支持样本,一些高级方法最初引入类名作为识别新类的先验知识。然而,对如何利用视觉和文本知识的相互优势的全面理解仍然存在障碍。在本文中,我们开始通过一种名为BiKop的连贯双向知识渗透策略来填补这一空白,该策略以人类直觉为基础:类名描述提供了更一般的表示,而图像捕获了个体的特殊性。BiKop主要是通过双向的知识渗透建立一个层次的联合通用表示。另一方面,考虑到联合表示对基集的偏见,我们在训练过程中解开了基类相关语义,从而减轻了潜在的新类相关信息的抑制。在四个具有挑战性的基准测试中进行的实验证明了BiKop的显著优势,特别是在1次射击设置中表现明显优于以前的方法(在miniImageNet上提高了7.58%的精度)。
{"title":"Like Humans to Few-Shot Learning Through Knowledge Permeation of Visual and Language","authors":"Yuyu Jia;Qing Zhou;Junyu Gao;Qiang Li;Qi Wang","doi":"10.1109/TMM.2025.3604977","DOIUrl":"https://doi.org/10.1109/TMM.2025.3604977","url":null,"abstract":"Few-shot learning aims to generalize the recognizer from seen categories to an entirely novel scenario. With only a few support samples, several advanced methods initially introduce class names as prior knowledge for identifying novel classes. However, obstacles still impede achieving a comprehensive understanding of how to harness the mutual advantages of visual and textual knowledge. In this paper, we set out to fill this gap via a coherent Bidirectional Knowledge Permeation strategy called BiKop, which is grounded in human intuition: a class name description offers a more <italic>general</i> representation, whereas an image captures the <italic>specificity</i> of individuals. BiKop primarily establishes a hierarchical joint general-specific representation through bidirectional knowledge permeation. On the other hand, considering the bias of joint representation towards the base set, we disentangle base-class-relevant semantics during training, thereby alleviating the suppression of potential novel-class-relevant information. Experiments on four challenging benchmarks demonstrate the remarkable superiority of BiKop, particularly outperforming previous methods by a substantial margin in the 1-shot setting (improving the accuracy by 7.58% on <italic>mini</i>ImageNet).","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"7905-7916"},"PeriodicalIF":9.7,"publicationDate":"2025-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145351925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PrimePSegter: Progressively Combined Diffusion for 3D Panoptic Segmentation With Multi-Modal BEV Refinement PrimePSegter:基于多模态BEV细化的3D全视分割的渐进组合扩散
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-09-02 DOI: 10.1109/TMM.2025.3604903
Hongqi Yu;Sixian Chan;Xiaolong Zhou;Xiaoqin Zhang
Effective and robust 3D panoptic segmentation is crucial for scene perception in autonomous driving. Modern methods widely adopt multi-modal fusion based simple feature concatenation to enhance 3D scene understanding, resulting in generated multi-modal representations typically lack comprehensive semantic and geometry information. These methods focused on panoptic prediction in a single step also limit the capability to progressively refine panoptic predictions under varying noise levels, which is essential for enhancing model robustness. To address these limitations, we first utilize BEV space to unify semantic-geometry perceptual representation, allowing for a more effective integration of LiDAR and camera data. Then, we propose PrimePSegter, a progressively combined diffusion 3D panoptic segmentation model that is conditioned on BEV maps to iteratively refine predictions by denoising samples generated from Gaussian distribution. PrimePSegter adopts a conditional encoder-decoder architecture for fine-grained panoptic predictions. Specifically, a multi-modal conditional encoder is equipped with BEV fusion network to integrate semantic and geometric information from LiDAR and camera streams into unified BEV space. Additionally, a diffusion transformer decoder operates on multi-modal BEV features with varying noise levels to guide the training of diffusion model, refining the BEV panoptic representations enriched with semantics and geometry in a progressive way. PrimePSegter achieves state-of-the-art performance on the nuScenes and competitive results on the SemanticKITTI, respectively. Moreover, PrimePSegter demonstrates superior robustness towards various scenarios, outperforming leading methods.
有效、鲁棒的三维全景分割是实现自动驾驶场景感知的关键。现代方法普遍采用基于简单特征拼接的多模态融合来增强对三维场景的理解,导致生成的多模态表示通常缺乏全面的语义和几何信息。这些方法侧重于单步全光学预测,也限制了在不同噪声水平下逐步改进全光学预测的能力,这对于增强模型的鲁棒性至关重要。为了解决这些限制,我们首先利用BEV空间来统一语义-几何感知表示,从而更有效地集成激光雷达和相机数据。然后,我们提出了PrimePSegter,这是一种渐进组合的扩散3D全视分割模型,它以BEV映射为条件,通过对高斯分布生成的样本去噪来迭代地改进预测。PrimePSegter采用条件编码器-解码器架构进行细粒度的全光预测。具体而言,多模态条件编码器配备了BEV融合网络,将来自LiDAR和相机流的语义和几何信息整合到统一的BEV空间中。此外,扩散转换器解码器对具有不同噪声水平的多模态BEV特征进行操作,指导扩散模型的训练,逐步细化具有丰富语义和几何的BEV全景表示。PrimePSegter分别在nuScenes和SemanticKITTI上取得了最先进的性能和竞争结果。此外,PrimePSegter对各种场景表现出优越的鲁棒性,优于领先的方法。
{"title":"PrimePSegter: Progressively Combined Diffusion for 3D Panoptic Segmentation With Multi-Modal BEV Refinement","authors":"Hongqi Yu;Sixian Chan;Xiaolong Zhou;Xiaoqin Zhang","doi":"10.1109/TMM.2025.3604903","DOIUrl":"https://doi.org/10.1109/TMM.2025.3604903","url":null,"abstract":"Effective and robust 3D panoptic segmentation is crucial for scene perception in autonomous driving. Modern methods widely adopt multi-modal fusion based simple feature concatenation to enhance 3D scene understanding, resulting in generated multi-modal representations typically lack comprehensive semantic and geometry information. These methods focused on panoptic prediction in a single step also limit the capability to progressively refine panoptic predictions under varying noise levels, which is essential for enhancing model robustness. To address these limitations, we first utilize BEV space to unify semantic-geometry perceptual representation, allowing for a more effective integration of LiDAR and camera data. Then, we propose PrimePSegter, a progressively combined diffusion 3D panoptic segmentation model that is conditioned on BEV maps to iteratively refine predictions by denoising samples generated from Gaussian distribution. PrimePSegter adopts a conditional encoder-decoder architecture for fine-grained panoptic predictions. Specifically, a multi-modal conditional encoder is equipped with BEV fusion network to integrate semantic and geometric information from LiDAR and camera streams into unified BEV space. Additionally, a diffusion transformer decoder operates on multi-modal BEV features with varying noise levels to guide the training of diffusion model, refining the BEV panoptic representations enriched with semantics and geometry in a progressive way. PrimePSegter achieves state-of-the-art performance on the nuScenes and competitive results on the SemanticKITTI, respectively. Moreover, PrimePSegter demonstrates superior robustness towards various scenarios, outperforming leading methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"7891-7904"},"PeriodicalIF":9.7,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145351954","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Multimedia
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1