首页 > 最新文献

IEEE transactions on pattern analysis and machine intelligence最新文献

英文 中文
MSFA Image Denoising Using Physics-Based Noise Model and Noise-Decoupled Network 基于物理噪声模型和噪声解耦网络的MSFA图像去噪。
IF 18.6 Pub Date : 2025-09-16 DOI: 10.1109/TPAMI.2025.3610243
Yuqi Jiang;Ying Fu;Qiankun Liu;Jun Zhang
Multispectral filter array (MSFA) camera is increasingly used due to its compact size and fast capturing speed. However, because of its narrow-band property, it often suffers from the light-deficient problem, and images captured are easily overwhelmed by noise. As a type of commonly used denoising method, neural networks have shown their power to achieve satisfactory denoising results. However, their performance highly depends on high-quality noisy-clean image pairs. For the task of MSFA image denoising, there is currently neither a paired real dataset nor an accurate noise model capable of generating realistic noisy images. To this end, we present a physics-based noise model that is capable to match the real noise distribution and synthesize realistic noisy images. In our noise model, those different types of noise can be divided into SimpleDist component and ComplexDist component. The former contains all the types of noise that can be described using a simple probability distribution like Gaussian or Poisson distribution, and the latter contains the complicated color bias noise that cannot be modeled using a simple probability distribution. Besides, we design a noise-decoupled network consisting of a SimpleDist noise removal network (SNRNet) and a ComplexDist noise removal network (CNRNet) to sequentially remove each component. Moreover, according to the non-uniformity of color bias noise in our noise model, we introduce a learnable position embedding in CNRNet to indicate the position information. To verify the effectiveness of our physics-based noise model and noise-decoupled network, we collect a real MSFA denoising dataset with paired long-exposure clean images and short-exposure noisy images. Experiments are conducted to prove that the network trained using synthetic data generated by our noise model performs as well as trained using paired real data, and our noise-decoupled network outperforms other state-of-the-art denoising methods.
多光谱滤波阵列(MSFA)相机以其体积小、拍摄速度快等优点得到越来越多的应用。然而,由于其窄带特性,它经常遭受光不足的问题,并且捕获的图像很容易被噪声淹没。神经网络作为一种常用的去噪方法,已经显示出其强大的去噪能力。然而,它们的性能高度依赖于高质量的无噪图像对。对于MSFA图像去噪任务,目前既没有配对的真实数据集,也没有精确的噪声模型能够生成逼真的噪声图像。为此,我们提出了一种基于物理的噪声模型,该模型能够匹配真实的噪声分布并合成逼真的噪声图像。在我们的噪声模型中,这些不同类型的噪声可以分为SimpleDist分量和ComplexDist分量。前者包含可以用高斯分布或泊松分布等简单概率分布描述的所有类型的噪声,后者包含无法用简单概率分布建模的复杂色偏噪声。此外,我们设计了一个由简单dist噪声去除网络(SNRNet)和复杂dist噪声去除网络(CNRNet)组成的噪声解耦网络,依次去除每个分量。此外,针对噪声模型中色差噪声的不均匀性,我们在CNRNet中引入了可学习的位置嵌入来表示位置信息。为了验证我们基于物理的噪声模型和噪声解耦网络的有效性,我们收集了一个真实的MSFA去噪数据集,其中包含配对的长时间曝光的干净图像和短时间曝光的噪声图像。实验证明,使用我们的噪声模型生成的合成数据训练的网络与使用配对真实数据训练的网络一样好,并且我们的噪声解耦网络优于其他最先进的去噪方法。项目页面可在https://github.com/ying-fu/msfa上找到。
{"title":"MSFA Image Denoising Using Physics-Based Noise Model and Noise-Decoupled Network","authors":"Yuqi Jiang;Ying Fu;Qiankun Liu;Jun Zhang","doi":"10.1109/TPAMI.2025.3610243","DOIUrl":"10.1109/TPAMI.2025.3610243","url":null,"abstract":"Multispectral filter array (MSFA) camera is increasingly used due to its compact size and fast capturing speed. However, because of its narrow-band property, it often suffers from the light-deficient problem, and images captured are easily overwhelmed by noise. As a type of commonly used denoising method, neural networks have shown their power to achieve satisfactory denoising results. However, their performance highly depends on high-quality noisy-clean image pairs. For the task of MSFA image denoising, there is currently neither a paired real dataset nor an accurate noise model capable of generating realistic noisy images. To this end, we present a physics-based noise model that is capable to match the real noise distribution and synthesize realistic noisy images. In our noise model, those different types of noise can be divided into <italic>SimpleDist</i> component and <italic>ComplexDist</i> component. The former contains all the types of noise that can be described using a simple probability distribution like Gaussian or Poisson distribution, and the latter contains the complicated color bias noise that cannot be modeled using a simple probability distribution. Besides, we design a noise-decoupled network consisting of a SimpleDist noise removal network (SNRNet) and a ComplexDist noise removal network (CNRNet) to sequentially remove each component. Moreover, according to the non-uniformity of color bias noise in our noise model, we introduce a learnable position embedding in CNRNet to indicate the position information. To verify the effectiveness of our physics-based noise model and noise-decoupled network, we collect a real MSFA denoising dataset with paired long-exposure clean images and short-exposure noisy images. Experiments are conducted to prove that the network trained using synthetic data generated by our noise model performs as well as trained using paired real data, and our noise-decoupled network outperforms other state-of-the-art denoising methods.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 1","pages":"859-875"},"PeriodicalIF":18.6,"publicationDate":"2025-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145071715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PlaneRecTR++: Unified Query Learning for Joint 3D Planar Reconstruction and Pose Estimation PlaneRecTR++:用于关节三维平面重建和姿态估计的统一查询学习。
IF 18.6 Pub Date : 2025-09-16 DOI: 10.1109/TPAMI.2025.3610500
Jingjia Shi;Shuaifeng Zhi;Kai Xu
The challenging task of 3D planar reconstruction from images involves several sub-tasks including frame-wise plane detection, segmentation, parameter regression and possibly depth prediction, along with cross-frame plane correspondence and relative camera pose estimation. Previous works adopt a divide and conquer strategy, addressing above sub-tasks with distinct network modules in a two-stage paradigm. Specifically, given an initial camera pose and per-frame plane predictions from the first stage, further exclusively designed modules relying on external plane correspondence labeling are applied to merge multi-view plane entities and produce refined camera pose. Notably, existing work fails to integrate these closely related sub-tasks into a unified framework, and instead addresses them separately and sequentially, which we identify as a primary source of performance limitations. Motivated by this finding and the success of query-based learning in enriching reasoning among semantic entities, in this paper, we propose PlaneRecTR++, a Transformer-based architecture, which for the first time unifies all tasks of multi-view planar reconstruction and pose estimation within a compact single-stage framework, eliminating the need for the initial pose estimation and supervision of plane correspondence. Extensive quantitative and qualitative experiments demonstrate that our proposed unified learning achieves mutual benefits across sub-tasks, achieving a new state-of-the-art performance on the public ScanNetv1, ScanNetv2, NYUv2-Plane, and MatterPort3D datasets.
从图像中重建三维平面通常可以分为平面检测、分割、参数回归和可能的每帧深度预测等几个子任务,以及帧之间的平面对应和相对相机姿态估计。以前的作品倾向于用不同的网络模块来划分和征服这些子任务,总体上采用两阶段范式。通过第一阶段提供的初始相机姿态和每帧平面预测,专门设计的模块(可能依赖于额外的平面对应标记)用于合并多视图平面实体并产生6DoF相机姿态。由于现有的工作都没有设法将上述密切相关的子任务集成到一个统一的框架中,而是分别和顺序地对待它们,我们怀疑这可能是现有方法性能限制的主要来源。基于这一发现以及基于查询的学习在丰富语义实体之间推理方面的成功,本文提出了PlaneRecTR++,这是一种基于transformer的架构,首次将与多视图重构和姿态估计相关的所有子任务统一为一个紧凑的单阶段模型,避免了初始姿态估计和面对应监督。大量的定量和定性实验表明,我们提出的统一学习在子任务之间实现了互利,在公共ScanNetv1, ScanNetv2, NYUv2-Plane和MatterPort3D数据集上获得了新的最先进的性能。
{"title":"PlaneRecTR++: Unified Query Learning for Joint 3D Planar Reconstruction and Pose Estimation","authors":"Jingjia Shi;Shuaifeng Zhi;Kai Xu","doi":"10.1109/TPAMI.2025.3610500","DOIUrl":"10.1109/TPAMI.2025.3610500","url":null,"abstract":"The challenging task of 3D planar reconstruction from images involves several sub-tasks including frame-wise plane detection, segmentation, parameter regression and possibly depth prediction, along with cross-frame plane correspondence and relative camera pose estimation. Previous works adopt a divide and conquer strategy, addressing above sub-tasks with distinct network modules in a two-stage paradigm. Specifically, given an initial camera pose and per-frame plane predictions from the first stage, further exclusively designed modules relying on external plane correspondence labeling are applied to merge multi-view plane entities and produce refined camera pose. Notably, existing work fails to integrate these closely related sub-tasks into a unified framework, and instead addresses them separately and sequentially, which we identify as a primary source of performance limitations. Motivated by this finding and the success of query-based learning in enriching reasoning among semantic entities, in this paper, we propose PlaneRecTR++, a Transformer-based architecture, which for the first time unifies all tasks of multi-view planar reconstruction and pose estimation within a compact single-stage framework, eliminating the need for the initial pose estimation and supervision of plane correspondence. Extensive quantitative and qualitative experiments demonstrate that our proposed unified learning achieves mutual benefits across sub-tasks, achieving a new state-of-the-art performance on the public ScanNetv1, ScanNetv2, NYUv2-Plane, and MatterPort3D datasets.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 1","pages":"962-981"},"PeriodicalIF":18.6,"publicationDate":"2025-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145071895","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Open-CRB: Toward Open World Active Learning for 3D Object Detection Open- crb:面向3D目标检测的开放世界主动学习
IF 18.6 Pub Date : 2025-09-12 DOI: 10.1109/TPAMI.2025.3575756
Zhuoxiao Chen;Yadan Luo;Zixin Wang;Zijian Wang;Zi Huang
LiDAR-based 3D object detection has recently seen significant advancements through active learning (AL), attaining satisfactory performance by training on a small fraction of strategically selected point clouds. However, in real-world deployments where streaming point clouds may include unknown or novel objects, the ability of current AL methods to capture such objects remains unexplored. This paper investigates a more practical and challenging research task: Open World Active Learning for 3D Object Detection (OWAL-3D), aimed at acquiring informative point clouds with new concepts. To tackle this challenge, we propose a simple yet effective strategy called Open Label Conciseness (OLC), which mines novel 3D objects with minimal annotation costs. Our empirical results show that OLC successfully adapts the 3D detection model to the open world scenario with just a single round of selection. Any generic AL policy can then be integrated with the proposed OLC to efficiently address the OWAL-3D problem. Based on this, we introduce the Open-CRB framework, which seamlessly integrates OLC with our preliminary AL method, CRB, designed specifically for 3D object detection. We develop a comprehensive codebase for easy reproducing and future research, supporting 15 baseline methods (i.e., active learning, out-of-distribution detection and open world detection), 2 types of modern 3D detectors (i.e., one-stage SECOND and two-stage PV-RCNN) and 3 benchmark 3D datasets (i.e., KITTI, nuScenes and Waymo). Extensive experiments evidence that the proposed Open-CRB demonstrates superiority and flexibility in recognizing both novel and known classes with very limited labeling costs, compared to state-of-the-art baselines.
基于激光雷达的3D目标检测最近通过主动学习(AL)取得了重大进展,通过在一小部分战略选择的点云上进行训练,获得了令人满意的性能。然而,在现实世界的部署中,流点云可能包含未知或新颖的对象,当前的人工智能方法捕获这些对象的能力仍未得到探索。本文研究了一个更实际和更具挑战性的研究任务:开放世界主动学习3D目标检测(OWAL-3D),旨在用新概念获取信息点云。为了应对这一挑战,我们提出了一种简单而有效的策略,称为开放标签简洁(OLC),它以最小的注释成本挖掘新的3D对象。我们的实证结果表明,OLC仅通过一轮选择就成功地使3D检测模型适应开放世界场景。然后,任何通用的人工智能策略都可以与提议的OLC集成,以有效地解决owl - 3d问题。在此基础上,我们引入了Open-CRB框架,该框架将OLC与我们专门为3D目标检测设计的初步人工智能方法CRB无缝集成。我们开发了一个全面的代码库,便于复制和未来的研究,支持15种基线方法(即主动学习,分布外检测和开放世界检测),2种类型的现代3D检测器(即一阶段SECOND和两阶段PV-RCNN)和3个基准3D数据集(即KITTI, nuScenes和Waymo)。大量的实验证明,与最先进的基线相比,所提出的Open-CRB在识别新类别和已知类别方面具有优势和灵活性,并且标签成本非常有限。
{"title":"Open-CRB: Toward Open World Active Learning for 3D Object Detection","authors":"Zhuoxiao Chen;Yadan Luo;Zixin Wang;Zijian Wang;Zi Huang","doi":"10.1109/TPAMI.2025.3575756","DOIUrl":"https://doi.org/10.1109/TPAMI.2025.3575756","url":null,"abstract":"LiDAR-based 3D object detection has recently seen significant advancements through active learning (AL), attaining satisfactory performance by training on a small fraction of strategically selected point clouds. However, in real-world deployments where streaming point clouds may include unknown or novel objects, the ability of current AL methods to capture such objects remains unexplored. This paper investigates a more practical and challenging research task: Open World Active Learning for 3D Object Detection (OWAL-3D), aimed at acquiring informative point clouds with new concepts. To tackle this challenge, we propose a simple yet effective strategy called Open Label Conciseness (OLC), which mines novel 3D objects with minimal annotation costs. Our empirical results show that OLC successfully adapts the 3D detection model to the open world scenario with just a single round of selection. Any generic AL policy can then be integrated with the proposed OLC to efficiently address the OWAL-3D problem. Based on this, we introduce the Open-CRB framework, which seamlessly integrates OLC with our preliminary AL method, CRB, designed specifically for 3D object detection. We develop a comprehensive codebase for easy reproducing and future research, supporting 15 baseline methods (i.e., active learning, out-of-distribution detection and open world detection), 2 types of modern 3D detectors (i.e., one-stage SECOND and two-stage PV-RCNN) and 3 benchmark 3D datasets (i.e., KITTI, nuScenes and Waymo). Extensive experiments evidence that the proposed Open-CRB demonstrates superiority and flexibility in recognizing both novel and known classes with very limited labeling costs, compared to state-of-the-art baselines.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 10","pages":"8336-8350"},"PeriodicalIF":18.6,"publicationDate":"2025-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145036795","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
M$^{3}$3D: A Multimodal, Multilingual and Multitask Dataset for Grounded Document-Level Information Extraction M3D:用于基础文档级信息提取的多模态、多语言和多任务数据集
IF 18.6 Pub Date : 2025-09-11 DOI: 10.1109/TPAMI.2025.3609288
Jiang Liu;Bobo Li;Xinran Yang;Na Yang;Hao Fei;Mingyao Zhang;Fei Li;Donghong Ji
Multimodal information extraction (IE) tasks have attracted increasing attention because many studies have shown that multimodal information benefits text information extraction. However, existing multimodal IE datasets mainly focus on sentence-level image-facilitated IE in English text, and pay little attention to video-based multimodal IE and fine-grained visual grounding. Therefore, in order to promote the development of multimodal IE, we constructed a multimodal multilingual multitask dataset, named M$^{3}$D, which has the following features: (1) It contains paired document-level text and video to enrich multimodal information; (2) It supports two widely-used languages, namely English and Chinese; (3) It includes more multimodal IE tasks such as entity recognition, entity chain extraction, relation extraction and visual grounding. In addition, our dataset introduces an unexplored theme, i.e., biography, enriching the domains of multimodal IE resources. To establish a benchmark for our dataset, we propose an innovative hierarchical multimodal IE model. This model effectively leverages and integrates multimodal information through a Denoised Feature Fusion Module (DFFM). Furthermore, in non-ideal scenarios, modal information is often incomplete. Thus, we designed a Missing Modality Construction Module (MMCM) to alleviate the issues caused by missing modalities. Our model achieved an average performance of 53.80% and 53.77% on four tasks in English and Chinese datasets, respectively, which set a reasonable standard for subsequent research. In addition, we conducted more analytical experiments to verify the effectiveness of our proposed module. We believe that our work can promote the development of the field of multimodal IE.
多模态信息提取(IE)任务越来越受到人们的关注,因为许多研究表明多模态信息有利于文本信息提取。然而,现有的多模态IE数据集主要集中在句子级图像支持的英文文本IE,而对基于视频的多模态IE和细粒度视觉基础关注较少。因此,为了促进多模态IE的发展,我们构建了一个多模态多语言多任务数据集,命名为M$^{3}$D,该数据集具有以下特点:(1)包含配对的文档级文本和视频,以丰富多模态信息;(2)支持英汉两种常用语言;(3)它包含了更多的多模态IE任务,如实体识别、实体链提取、关系提取和视觉接地。此外,我们的数据集引入了一个未开发的主题,即传记,丰富了多模态IE资源的领域。为了为我们的数据集建立基准,我们提出了一个创新的分层多模态IE模型。该模型通过去噪特征融合模块(DFFM)有效地利用和集成多模态信息。此外,在非理想情况下,模态信息通常是不完整的。因此,我们设计了一个缺失模态构建模块(MMCM)来缓解缺失模态带来的问题。我们的模型在英语和中文数据集的4个任务上分别取得了53.80%和53.77%的平均性能,为后续的研究提供了合理的标准。此外,我们进行了更多的分析实验来验证我们提出的模块的有效性。我们相信我们的工作可以促进多模式IE领域的发展。
{"title":"M$^{3}$3D: A Multimodal, Multilingual and Multitask Dataset for Grounded Document-Level Information Extraction","authors":"Jiang Liu;Bobo Li;Xinran Yang;Na Yang;Hao Fei;Mingyao Zhang;Fei Li;Donghong Ji","doi":"10.1109/TPAMI.2025.3609288","DOIUrl":"10.1109/TPAMI.2025.3609288","url":null,"abstract":"Multimodal information extraction (IE) tasks have attracted increasing attention because many studies have shown that multimodal information benefits text information extraction. However, existing multimodal IE datasets mainly focus on sentence-level image-facilitated IE in English text, and pay little attention to video-based multimodal IE and fine-grained visual grounding. Therefore, in order to promote the development of multimodal IE, we constructed a multimodal multilingual multitask dataset, named M<inline-formula><tex-math>$^{3}$</tex-math></inline-formula>D, which has the following features: (1) It contains paired document-level text and video to enrich multimodal information; (2) It supports two widely-used languages, namely English and Chinese; (3) It includes more multimodal IE tasks such as entity recognition, entity chain extraction, relation extraction and visual grounding. In addition, our dataset introduces an unexplored theme, i.e., biography, enriching the domains of multimodal IE resources. To establish a benchmark for our dataset, we propose an innovative hierarchical multimodal IE model. This model effectively leverages and integrates multimodal information through a Denoised Feature Fusion Module (DFFM). Furthermore, in non-ideal scenarios, modal information is often incomplete. Thus, we designed a Missing Modality Construction Module (MMCM) to alleviate the issues caused by missing modalities. Our model achieved an average performance of 53.80% and 53.77% on four tasks in English and Chinese datasets, respectively, which set a reasonable standard for subsequent research. In addition, we conducted more analytical experiments to verify the effectiveness of our proposed module. We believe that our work can promote the development of the field of multimodal IE.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 1","pages":"807-823"},"PeriodicalIF":18.6,"publicationDate":"2025-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145035494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards Real Zero-Shot Camouflaged Object Segmentation Without Camouflaged Annotations 实现无伪装标注的真实零射击伪装对象分割。
IF 18.6 Pub Date : 2025-09-10 DOI: 10.1109/TPAMI.2025.3600461
Cheng Lei;Jie Fan;Xinran Li;Tian-Zhu Xiang;Ao Li;Ce Zhu;Le Zhang
Camouflaged Object Segmentation (COS) faces significant challenges due to the scarcity of annotated data, where meticulous pixel-level annotation is both labor-intensive and costly, primarily due to the intricate object-background boundaries. Addressing the core question, “Can COS be effectively achieved in a zero-shot manner without manual annotations for any camouflaged object?”, we propose an affirmative solution. We examine the learned attention patterns for camouflaged objects and introduce a robust zero-shot COS framework. Our findings reveal that while transformer models for salient object segmentation (SOS) prioritize global features in their attention mechanisms, camouflaged object segmentation exhibits both global and local attention biases. Based on these findings, we design a framework that adapts with the inherent local pattern bias of COS while incorporating global attention patterns and a broad semantic feature space derived from SOS. This enables efficient zero-shot transfer for COS. Specifically, We incorporate a Masked Image Modeling (MIM) based image encoder optimized for Parameter-Efficient Fine-Tuning (PEFT), a Multimodal Large Language Model (M-LLM), and a Multi-scale Fine-grained Alignment (MFA) mechanism. The MIM encoder captures essential local features, while the PEFT module learns global and semantic representations from SOS datasets. To further enhance semantic granularity, we leverage the M-LLM to generate caption embeddings conditioned on visual cues, which are meticulously aligned with multi-scale visual features via MFA. This alignment enables precise interpretation of complex semantic contexts. Moreover, we introduce a learnable codebook to represent the M-LLM during inference, significantly reducing computational demands while maintaining performance. Our framework demonstrates its versatility and efficacy through rigorous experimentation, achieving state-of-the-art performance in zero-shot COS with $F_{beta }^{w}$ scores of 72.9% on CAMO and 71.7% on COD10K. By removing the M-LLM during inference, we achieve an inference speed comparable to that of traditional end-to-end models, reaching 18.1 FPS. Additionally, our method excels in polyp segmentation, and underwater scene segmentation, outperforming challenging baselines in both zero-shot and supervised settings, thereby implying its potentiality in various segmentation tasks.
由于标注数据的稀缺性,迷彩对象分割(COS)面临着巨大的挑战,其中精细的像素级标注既费力又昂贵,主要是由于复杂的对象-背景边界。针对“对于任何伪装对象,是否可以在没有手动注释的情况下,以零射击的方式有效地实现COS”这一核心问题,我们提出了一个肯定的解决方案。我们分析了伪装对象的学习注意模式,并引入了一个鲁棒的零射击COS框架。我们的研究结果表明,显著目标分割(SOS)的变形模型在其注意机制中优先考虑全局特征,而伪装目标分割同时表现出全局和局部注意偏差。基于这些发现,我们设计了一个框架,该框架既适应COS固有的局部模式偏差,又结合全局注意模式和源自SOS的广泛语义特征空间。这为COS提供了有效的零射击转移。具体来说,我们结合了一个基于掩模图像建模(MIM)的图像编码器,该编码器针对参数高效微调(PEFT)进行了优化,一个多模态大语言模型(M-LLM)和一个多尺度细粒度对齐(MFA)机制。MIM编码器捕获基本的局部特征,而PEFT模块从SOS数据集学习全局和语义表示。为了进一步增强语义粒度,我们利用M-LLM来生成基于视觉线索的标题嵌入,这些嵌入通过MFA与多尺度视觉特征精心对齐。这种对齐可以精确地解释复杂的语义上下文。此外,我们在推理过程中引入了一个可学习的码本来表示M-LLM,在保持性能的同时显着降低了计算需求。我们的框架通过严格的实验证明了它的多功能性和有效性,在零射击COS中实现了最先进的性能,在CAMO上的$F_{beta}^{w}$得分为72.9%,在COD10K上得分为71.7%。通过在推理过程中去除M-LLM,我们实现了与传统端到端模型相当的推理速度,达到18.1 FPS。此外,我们的方法在息肉分割和水下场景分割方面表现出色,在零拍摄和监督设置中都优于具有挑战性的基线,从而突出了其在各种分割任务中的广泛适用性。
{"title":"Towards Real Zero-Shot Camouflaged Object Segmentation Without Camouflaged Annotations","authors":"Cheng Lei;Jie Fan;Xinran Li;Tian-Zhu Xiang;Ao Li;Ce Zhu;Le Zhang","doi":"10.1109/TPAMI.2025.3600461","DOIUrl":"10.1109/TPAMI.2025.3600461","url":null,"abstract":"Camouflaged Object Segmentation (COS) faces significant challenges due to the scarcity of annotated data, where meticulous pixel-level annotation is both labor-intensive and costly, primarily due to the intricate object-background boundaries. Addressing the core question, “Can COS be effectively achieved in a zero-shot manner without manual annotations for any camouflaged object?”, we propose an affirmative solution. We examine the learned attention patterns for camouflaged objects and introduce a robust zero-shot COS framework. Our findings reveal that while transformer models for salient object segmentation (SOS) prioritize global features in their attention mechanisms, camouflaged object segmentation exhibits both global and local attention biases. Based on these findings, we design a framework that adapts with the inherent local pattern bias of COS while incorporating global attention patterns and a broad semantic feature space derived from SOS. This enables efficient zero-shot transfer for COS. Specifically, We incorporate a Masked Image Modeling (MIM) based image encoder optimized for Parameter-Efficient Fine-Tuning (PEFT), a Multimodal Large Language Model (M-LLM), and a Multi-scale Fine-grained Alignment (MFA) mechanism. The MIM encoder captures essential local features, while the PEFT module learns global and semantic representations from SOS datasets. To further enhance semantic granularity, we leverage the M-LLM to generate caption embeddings conditioned on visual cues, which are meticulously aligned with multi-scale visual features via MFA. This alignment enables precise interpretation of complex semantic contexts. Moreover, we introduce a learnable codebook to represent the M-LLM during inference, significantly reducing computational demands while maintaining performance. Our framework demonstrates its versatility and efficacy through rigorous experimentation, achieving state-of-the-art performance in zero-shot COS with <inline-formula><tex-math>$F_{beta }^{w}$</tex-math></inline-formula> scores of 72.9% on CAMO and 71.7% on COD10K. By removing the M-LLM during inference, we achieve an inference speed comparable to that of traditional end-to-end models, reaching 18.1 FPS. Additionally, our method excels in polyp segmentation, and underwater scene segmentation, outperforming challenging baselines in both zero-shot and supervised settings, thereby implying its potentiality in various segmentation tasks.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 12","pages":"11990-12004"},"PeriodicalIF":18.6,"publicationDate":"2025-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145034509","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
StylizedGS: Controllable Stylization for 3D Gaussian Splatting StylizedGS: 3D高斯飞溅的可控风格化
IF 18.6 Pub Date : 2025-08-28 DOI: 10.1109/TPAMI.2025.3604010
Dingxi Zhang;Yu-Jie Yuan;Zhuoxun Chen;Fang-Lue Zhang;Zhenliang He;Shiguang Shan;Lin Gao
As XR technology continues to advance rapidly, 3D generation and editing are increasingly crucial. Among these, stylization plays a key role in enhancing the appearance of 3D models. By utilizing stylization, users can achieve consistent artistic effects in 3D editing using a single reference style image, making it a user-friendly editing method. However, recent NeRF-based 3D stylization methods encounter efficiency issues that impact the user experience, and their implicit nature limits their ability to accurately transfer geometric pattern styles. Additionally, the ability for artists to apply flexible control over stylized scenes is considered highly desirable to foster an environment conducive to creative exploration. To address the above issues, we introduce StylizedGS, an efficient 3D neural style transfer framework with adaptable control over perceptual factors based on 3D Gaussian Splatting representation. We propose a filter-based refinement to eliminate floaters that affect the stylization effects in the scene reconstruction process. The nearest neighbor-based style loss is introduced to achieve stylization by fine-tuning the geometry and color parameters of 3DGS, while a depth preservation loss with other regularizations is proposed to prevent the tampering of geometry content. Moreover, facilitated by specially designed losses, StylizedGS enables users to control color, stylized scale, and regions during the stylization to possess customization capabilities. Our method achieves high-quality stylization results characterized by faithful brushstrokes and geometric consistency with flexible controls. Extensive experiments across various scenes and styles demonstrate the effectiveness and efficiency of our method concerning both stylization quality and inference speed.
随着XR技术的快速发展,3D生成和编辑变得越来越重要。其中,风格化在增强3D模型的外观方面起着关键作用。通过使用风格化,用户可以在使用单一参考样式图像的3D编辑中达到一致的艺术效果,使其成为一种用户友好的编辑方法。然而,最近基于nerf的3D样式化方法遇到了影响用户体验的效率问题,并且它们的隐式性质限制了它们准确转移几何图案样式的能力。此外,艺术家灵活控制风格化场景的能力被认为是非常可取的,以培养有利于创造性探索的环境。为了解决上述问题,我们引入了StylizedGS,这是一个高效的3D神经风格传递框架,基于3D高斯飞溅表示对感知因素进行自适应控制。我们提出了一种基于滤波器的细化,以消除在场景重建过程中影响风格化效果的浮动。引入了基于最近邻的风格损失,通过对三维图像的几何和颜色参数进行微调来实现风格化,同时提出了结合其他正则化的深度保持损失,以防止几何内容被篡改。此外,通过特别设计的损失,StylizedGS使用户可以在风格化过程中控制颜色,风格化比例和区域,从而拥有自定义能力。我们的方法实现了高质量的风格化结果,其特点是笔触忠实,几何一致性和灵活的控制。各种场景和风格的大量实验证明了我们的方法在风格化质量和推理速度方面的有效性和效率。
{"title":"StylizedGS: Controllable Stylization for 3D Gaussian Splatting","authors":"Dingxi Zhang;Yu-Jie Yuan;Zhuoxun Chen;Fang-Lue Zhang;Zhenliang He;Shiguang Shan;Lin Gao","doi":"10.1109/TPAMI.2025.3604010","DOIUrl":"10.1109/TPAMI.2025.3604010","url":null,"abstract":"As XR technology continues to advance rapidly, 3D generation and editing are increasingly crucial. Among these, stylization plays a key role in enhancing the appearance of 3D models. By utilizing stylization, users can achieve consistent artistic effects in 3D editing using a single reference style image, making it a user-friendly editing method. However, recent NeRF-based 3D stylization methods encounter efficiency issues that impact the user experience, and their implicit nature limits their ability to accurately transfer geometric pattern styles. Additionally, the ability for artists to apply flexible control over stylized scenes is considered highly desirable to foster an environment conducive to creative exploration. To address the above issues, we introduce StylizedGS, an efficient 3D neural style transfer framework with adaptable control over perceptual factors based on 3D Gaussian Splatting representation. We propose a filter-based refinement to eliminate floaters that affect the stylization effects in the scene reconstruction process. The nearest neighbor-based style loss is introduced to achieve stylization by fine-tuning the geometry and color parameters of 3DGS, while a depth preservation loss with other regularizations is proposed to prevent the tampering of geometry content. Moreover, facilitated by specially designed losses, StylizedGS enables users to control color, stylized scale, and regions during the stylization to possess customization capabilities. Our method achieves high-quality stylization results characterized by faithful brushstrokes and geometric consistency with flexible controls. Extensive experiments across various scenes and styles demonstrate the effectiveness and efficiency of our method concerning both stylization quality and inference speed.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 12","pages":"11961-11973"},"PeriodicalIF":18.6,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144915471","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Aligning Logits Generatively for Principled Black-Box Knowledge Distillation in the Wild 野外有原则黑箱知识蒸馏的逻辑生成对齐
IF 18.6 Pub Date : 2025-08-25 DOI: 10.1109/TPAMI.2025.3602663
Xiang Xiang;Jing Ma;Dongrui Wu;Zhigang Zeng;Xilin Chen
Black-Box Knowledge Distillation (B2KD) is a conservative task in cloud-to-edge model compression, emphasizing the protection of data privacy and model copyrights on both the cloud and edge. With invisible data and models hosted on the server, B2KD aims to utilize only the API queries of the teacher model’s inference results in the cloud to effectively distill a lightweight student model deployed on edge devices. B2KD faces challenges such as limited Internet exchange and edge-cloud disparity in data distribution. To address these issues, we theoretically provide a new optimization direction from logits to cell boundary, different from direct logits alignment, and formalize a workflow comprising deprivatization, distillation, and adaptation at test time. Guided by this, we propose a method, Mapping-Emulation KD (MEKD), to enhance the robust prediction and anti-interference capabilities of the student model on edge devices for any unknown data distribution in real-world scenarios. Our method does not differentiate between treating soft or hard responses and consists of: 1) deprivatization: emulating the inverse mapping of the teacher function with a generator, 2) distillation: aligning low-dimensional logits of the teacher and student models by reducing the distance of high-dimensional image points, and 3) adaptation: correcting the student’s online prediction bias through a graph propagation-based only-forward test-time adaptation algorithm. Our method demonstrates inspiring performance for edge model distillation and adaptation across different teacher-student pairs. We validate the effectiveness of our method on multiple image recognition benchmarks and various Deep Neural Network models, achieving state-of-the-art performance and showcasing its practical value in remote sensing image recognition applications.
黑箱知识蒸馏(B2KD)是云到边缘模型压缩中的一项保守任务,强调在云和边缘上对数据隐私和模型版权的保护。通过托管在服务器上的不可见数据和模型,B2KD旨在仅利用云中的教师模型推理结果的API查询来有效地提取部署在边缘设备上的轻量级学生模型。B2KD面临着有限的互联网交换和数据分布的边缘云差异等挑战。为了解决这些问题,我们从理论上提供了一个新的优化方向,从logit到细胞边界,不同于直接的logit对齐,并形式化了一个工作流,包括在测试时的去私有化,蒸馏和适应。在此指导下,我们提出了一种方法,映射仿真KD (MEKD),以增强学生模型在边缘设备上对现实场景中任何未知数据分布的鲁棒预测和抗干扰能力。我们的方法不区分处理软响应或硬响应,包括:1)去私有化:用生成器模拟教师函数的逆映射,2)蒸馏:通过减少高维图像点的距离来对齐教师和学生模型的低维对数,以及3)适应:通过基于图传播的仅前向测试时间自适应算法纠正学生的在线预测偏差。我们的方法在不同的师生对中展示了令人鼓舞的边缘模型蒸馏和自适应性能。我们在多个图像识别基准和各种深度神经网络模型上验证了我们的方法的有效性,实现了最先进的性能,并展示了其在遥感图像识别应用中的实用价值。
{"title":"Aligning Logits Generatively for Principled Black-Box Knowledge Distillation in the Wild","authors":"Xiang Xiang;Jing Ma;Dongrui Wu;Zhigang Zeng;Xilin Chen","doi":"10.1109/TPAMI.2025.3602663","DOIUrl":"10.1109/TPAMI.2025.3602663","url":null,"abstract":"Black-Box Knowledge Distillation (B2KD) is a conservative task in cloud-to-edge model compression, emphasizing the protection of data privacy and model copyrights on both the cloud and edge. With invisible data and models hosted on the server, B2KD aims to utilize only the API queries of the teacher model’s inference results in the cloud to effectively distill a lightweight student model deployed on edge devices. B2KD faces challenges such as limited Internet exchange and edge-cloud disparity in data distribution. To address these issues, we theoretically provide a new optimization direction from logits to cell boundary, different from direct logits alignment, and formalize a workflow comprising deprivatization, distillation, and adaptation at test time. Guided by this, we propose a method, Mapping-Emulation KD (MEKD), to enhance the robust prediction and anti-interference capabilities of the student model on edge devices for any unknown data distribution in real-world scenarios. Our method does not differentiate between treating soft or hard responses and consists of: 1) deprivatization: emulating the inverse mapping of the teacher function with a generator, 2) distillation: aligning low-dimensional logits of the teacher and student models by reducing the distance of high-dimensional image points, and 3) adaptation: correcting the student’s online prediction bias through a graph propagation-based only-forward test-time adaptation algorithm. Our method demonstrates inspiring performance for edge model distillation and adaptation across different teacher-student pairs. We validate the effectiveness of our method on multiple image recognition benchmarks and various Deep Neural Network models, achieving state-of-the-art performance and showcasing its practical value in remote sensing image recognition applications.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 12","pages":"11929-11945"},"PeriodicalIF":18.6,"publicationDate":"2025-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144900423","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
VQ-FedDiff: Federated Learning Algorithm of Diffusion Models With Client-Specific Vector-Quantized Conditioning VQ-FedDiff:具有客户特定矢量量化条件的扩散模型的联邦学习算法
IF 18.6 Pub Date : 2025-08-22 DOI: 10.1109/TPAMI.2025.3602282
Tehrim Yoon;Minyoung Hwang;Eunho Yang
Modern generative models, particularly denoising diffusion probabilistic models (DDPMs), provide high-quality synthetic images, enabling users to generate diverse images and videos that are realistic. However, in a number of situations, edge devices or individual institutions may possess locally collected data that is highly sensitive and should ensure data privacy, such as in the field of healthcare and finance. Under such federated learning (FL) settings, various methods on training generative models have been studied, but most of them assume generative adversarial networks (GANs), and the algorithms are specific to GANs and not other forms of generative models such as DDPM. This paper proposes a new algorithm for training DDPMs under federated learning settings, VQ-FedDiff, which provides a personalized algorithm for training diffusion models that can generate higher-quality images FID while still keeping risk of breaching sensitive information as low as locally-trained secure models. We demonstrate that VQ-FedDiff shows state-of-the-art performance on existing federated learning of diffusion models in both IID and non-IID settings, and in benchmark photorealistic and medical image datasets. Our results show that diffusion models can efficiently learn with decentralized, sensitive data, generating high-quality images while preserving data privacy.
现代生成模型,特别是去噪扩散概率模型(ddpm),提供高质量的合成图像,使用户能够生成逼真的各种图像和视频。然而,在许多情况下,边缘设备或个别机构可能拥有本地收集的高度敏感数据,应确保数据隐私,例如在医疗保健和金融领域。在这种联邦学习(FL)设置下,已经研究了各种训练生成模型的方法,但大多数方法都假设生成对抗网络(GANs),并且算法是针对GANs而不是其他形式的生成模型(如DDPM)的。本文提出了一种在联邦学习设置下训练ddpm的新算法VQ-FedDiff,该算法为训练扩散模型提供了一种个性化的算法,该算法可以生成更高质量的图像FID,同时将泄露敏感信息的风险保持在与本地训练的安全模型一样低的水平。我们证明了VQ-FedDiff在IID和非IID设置中,以及在基准逼真图像和医学图像数据集中,在现有的扩散模型的联邦学习上表现出了最先进的性能。我们的研究结果表明,扩散模型可以有效地学习分散的、敏感的数据,在保护数据隐私的同时生成高质量的图像。
{"title":"VQ-FedDiff: Federated Learning Algorithm of Diffusion Models With Client-Specific Vector-Quantized Conditioning","authors":"Tehrim Yoon;Minyoung Hwang;Eunho Yang","doi":"10.1109/TPAMI.2025.3602282","DOIUrl":"10.1109/TPAMI.2025.3602282","url":null,"abstract":"Modern generative models, particularly denoising diffusion probabilistic models (DDPMs), provide high-quality synthetic images, enabling users to generate diverse images and videos that are realistic. However, in a number of situations, edge devices or individual institutions may possess locally collected data that is highly sensitive and should ensure data privacy, such as in the field of healthcare and finance. Under such federated learning (FL) settings, various methods on training generative models have been studied, but most of them assume generative adversarial networks (GANs), and the algorithms are specific to GANs and not other forms of generative models such as DDPM. This paper proposes a new algorithm for training DDPMs under federated learning settings, VQ-FedDiff, which provides a personalized algorithm for training diffusion models that can generate higher-quality images FID while still keeping risk of breaching sensitive information as low as locally-trained secure models. We demonstrate that VQ-FedDiff shows state-of-the-art performance on existing federated learning of diffusion models in both IID and non-IID settings, and in benchmark photorealistic and medical image datasets. Our results show that diffusion models can efficiently learn with decentralized, sensitive data, generating high-quality images while preserving data privacy.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 12","pages":"11863-11873"},"PeriodicalIF":18.6,"publicationDate":"2025-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144900424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Unified Modality Separation: A Vision-Language Framework for Unsupervised Domain Adaptation 统一模态分离:无监督领域自适应的视觉语言框架
IF 18.6 Pub Date : 2025-08-22 DOI: 10.1109/TPAMI.2025.3597436
Xinyao Li;Jingjing Li;Zhekai Du;Lei Zhu;Heng Tao Shen
Unsupervised domain adaptation (UDA) enables models trained on a labeled source domain to handle new unlabeled domains. Recently, pre-trained vision-language models (VLMs) have demonstrated promising zero-shot performance by leveraging semantic information to facilitate target tasks. By aligning vision and text embeddings, VLMs have shown notable success in bridging domain gaps. However, inherent differences naturally exist between modalities, which is known as modality gap. Our findings reveal that direct UDA with the presence of modality gap only transfers modality-invariant knowledge, leading to suboptimal target performance. To address this limitation, we propose a unified modality separation framework that accommodates both modality-specific and modality-invariant components. During training, different modality components are disentangled from VLM features then handled separately in a unified manner. At test time, modality-adaptive ensemble weights are automatically determined to maximize the synergy of different components. To evaluate instance-level modality characteristics, we design a modality discrepancy metric to categorize samples into modality-invariant, modality-specific, and uncertain ones. The modality-invariant samples are exploited to facilitate cross-modal alignment, while uncertain ones are annotated to enhance model capabilities. Building upon prompt tuning techniques, our methods achieve up to 9% performance gain with 9 times of computational efficiencies. Extensive experiments and analysis across various backbones, baselines, datasets and adaptation settings demonstrate the efficacy of our design.
无监督域适应(UDA)使在标记源域上训练的模型能够处理新的未标记域。最近,预训练的视觉语言模型(VLMs)通过利用语义信息来促进目标任务,显示出了良好的零射击性能。通过对齐视觉和文本嵌入,vlm在弥合领域差距方面取得了显著的成功。然而,情态之间自然存在着固有的差异,这种差异被称为情态差距。我们的研究结果表明,存在模态差距的直接UDA只传递模态不变的知识,导致次优目标性能。为了解决这个限制,我们提出了一个统一的模态分离框架,它可以容纳模态特定组件和模态不变组件。在训练过程中,将不同的模态分量从VLM特征中分离出来,分别进行统一处理。在测试时,自动确定模态自适应集成权重,以最大化不同组件的协同作用。为了评估实例级模态特征,我们设计了一个模态差异度量,将样本分为模态不变、模态特定和不确定三种。利用模态不变的样本来促进跨模态对齐,而对不确定样本进行注释以增强模型能力。基于即时调优技术,我们的方法实现了高达9%的性能增益和9倍的计算效率。对各种主干、基线、数据集和适应设置的广泛实验和分析证明了我们设计的有效性。
{"title":"Unified Modality Separation: A Vision-Language Framework for Unsupervised Domain Adaptation","authors":"Xinyao Li;Jingjing Li;Zhekai Du;Lei Zhu;Heng Tao Shen","doi":"10.1109/TPAMI.2025.3597436","DOIUrl":"10.1109/TPAMI.2025.3597436","url":null,"abstract":"Unsupervised domain adaptation (UDA) enables models trained on a labeled source domain to handle new unlabeled domains. Recently, pre-trained vision-language models (VLMs) have demonstrated promising zero-shot performance by leveraging semantic information to facilitate target tasks. By aligning vision and text embeddings, VLMs have shown notable success in bridging domain gaps. However, inherent differences naturally exist between modalities, which is known as <italic>modality gap</i>. Our findings reveal that direct UDA with the presence of modality gap only transfers modality-invariant knowledge, leading to suboptimal target performance. To address this limitation, we propose a unified modality separation framework that accommodates both modality-specific and modality-invariant components. During training, different modality components are disentangled from VLM features then handled separately in a unified manner. At test time, modality-adaptive ensemble weights are automatically determined to maximize the synergy of different components. To evaluate instance-level modality characteristics, we design a modality discrepancy metric to categorize samples into modality-invariant, modality-specific, and uncertain ones. The modality-invariant samples are exploited to facilitate cross-modal alignment, while uncertain ones are annotated to enhance model capabilities. Building upon prompt tuning techniques, our methods achieve up to 9% performance gain with 9 times of computational efficiencies. Extensive experiments and analysis across various backbones, baselines, datasets and adaptation settings demonstrate the efficacy of our design.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 11","pages":"10604-10618"},"PeriodicalIF":18.6,"publicationDate":"2025-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144900428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Consistent and Optimal Solution to Camera Motion Estimation 摄像机运动估计的一致性和最优解
IF 18.6 Pub Date : 2025-08-21 DOI: 10.1109/TPAMI.2025.3601430
Guangyang Zeng;Qingcheng Zeng;Xinghan Li;Biqiang Mu;Jiming Chen;Ling Shi;Junfeng Wu
Given 2D point correspondences between an image pair, inferring the camera motion is a fundamental issue in the computer vision community. The existing works generally set out from the epipolar constraint and estimate the essential matrix, which is not optimal in the maximum likelihood (ML) sense. In this paper, we dive into the original measurement model with respect to the rotation matrix and normalized translation vector and formulate the ML problem. We then propose an optimal two-step algorithm to solve it: In the first step, we estimate the variance of measurement noises and devise a consistent estimator based on bias elimination; In the second step, we execute a one-step Gauss-Newton iteration on manifold to refine the consistent estimator. We prove that the proposed estimator achieves the same asymptotic statistical properties as the ML estimator: The first is consistency, i.e., the estimator converges to the ground truth as the point number increases; The second is asymptotic efficiency, i.e., the mean squared error of the estimator converges to the theoretical lower bound — Cramer-Rao bound. In addition, we show that our algorithm has linear time complexity. These appealing characteristics endow our estimator with a great advantage in the case of dense point correspondences. Experiments on both synthetic data and real images demonstrate that when the point number reaches the order of hundreds, our estimator outperforms the state-of-the-art ones in terms of estimation accuracy and CPU time.
给定图像对之间的二维点对应关系,推断相机运动是计算机视觉社区的一个基本问题。现有的工作一般从极外约束出发,估计本质矩阵,这在最大似然(ML)意义上不是最优的。在本文中,我们深入研究了关于旋转矩阵和归一化平移向量的原始测量模型,并制定了ML问题。然后,我们提出了一个最优的两步算法来解决这个问题:第一步,我们估计测量噪声的方差,并设计一个基于偏差消除的一致估计器;在第二步中,我们对流形执行一步高斯-牛顿迭代来改进一致性估计。我们证明了所提出的估计量达到了与ML估计量相同的渐近统计性质:一是一致性,即随着点数的增加,估计量收敛于基真值;二是渐近效率,即估计量的均方误差收敛于理论下界- Cramer-Rao界。此外,我们还证明了我们的算法具有线性时间复杂度。这些吸引人的特性使我们的估计器在密集点对应的情况下具有很大的优势。在合成数据和真实图像上的实验表明,当点数达到数百数量级时,我们的估计器在估计精度和CPU时间方面优于最先进的估计器。
{"title":"Consistent and Optimal Solution to Camera Motion Estimation","authors":"Guangyang Zeng;Qingcheng Zeng;Xinghan Li;Biqiang Mu;Jiming Chen;Ling Shi;Junfeng Wu","doi":"10.1109/TPAMI.2025.3601430","DOIUrl":"10.1109/TPAMI.2025.3601430","url":null,"abstract":"Given 2D point correspondences between an image pair, inferring the camera motion is a fundamental issue in the computer vision community. The existing works generally set out from the epipolar constraint and estimate the essential matrix, which is not optimal in the maximum likelihood (ML) sense. In this paper, we dive into the original measurement model with respect to the rotation matrix and normalized translation vector and formulate the ML problem. We then propose an optimal two-step algorithm to solve it: In the first step, we estimate the variance of measurement noises and devise a consistent estimator based on bias elimination; In the second step, we execute a one-step Gauss-Newton iteration on manifold to refine the consistent estimator. We prove that the proposed estimator achieves the same asymptotic statistical properties as the ML estimator: The first is consistency, i.e., the estimator converges to the ground truth as the point number increases; The second is asymptotic efficiency, i.e., the mean squared error of the estimator converges to the theoretical lower bound — Cramer-Rao bound. In addition, we show that our algorithm has linear time complexity. These appealing characteristics endow our estimator with a great advantage in the case of dense point correspondences. Experiments on both synthetic data and real images demonstrate that when the point number reaches the order of hundreds, our estimator outperforms the state-of-the-art ones in terms of estimation accuracy and CPU time.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 12","pages":"12005-12020"},"PeriodicalIF":18.6,"publicationDate":"2025-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144898490","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE transactions on pattern analysis and machine intelligence
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1