Pub Date : 2025-09-16DOI: 10.1109/TPAMI.2025.3610211
Pengfei Wang;Jiantao Song;Shiqing Xin;Shuangmin Chen;Changhe Tu;Wenping Wang;Jiaye Wang
Given a collection of points in $mathbb {R}^{3}$, KD-Tree and R-Tree are well-known nearest neighbor search (NNS) algorithms that rely on spatial partitioning and indexing techniques. However, when the query point is far from the data points or the data points inherently represent a 2-manifold surface, their query performance may degrade. To address this, we propose a novel dynamic programming technique that precomputes a Directed Acyclic Graph (DAG) to encode the proximity structure between data points. More specifically, the DAG captures how the proximity structure evolves during the incremental construction of the Voronoi diagram of the data points. Experimental results demonstrate that our method achieves a speed increase of 1-10x. Furthermore, our algorithm demonstrates significant practical value in diverse applications. We validated its effectiveness through extensive testing in four key applications: Point-to-Mesh Distance Queries, Iterative Closest Point (ICP) Registration, Density Peak Clustering, and Point-to-Segments Distance Queries. A particularly notable feature of our approach is its unique ability to efficiently identify the nearest neighbor among the first $k$ points in the point cloud, a capability that enables substantial acceleration in low-dimensional applications like Density Peak Clustering. As a natural extension of our incremental construction process, our method can also be readily adapted for farthest-point sampling tasks. These experimental results across multiple domains underscore the broad applicability and practical importance of our approach.
{"title":"Efficient Nearest Neighbor Search Using Dynamic Programming","authors":"Pengfei Wang;Jiantao Song;Shiqing Xin;Shuangmin Chen;Changhe Tu;Wenping Wang;Jiaye Wang","doi":"10.1109/TPAMI.2025.3610211","DOIUrl":"10.1109/TPAMI.2025.3610211","url":null,"abstract":"Given a collection of points in <inline-formula><tex-math>$mathbb {R}^{3}$</tex-math></inline-formula>, KD-Tree and R-Tree are well-known nearest neighbor search (NNS) algorithms that rely on spatial partitioning and indexing techniques. However, when the query point is far from the data points or the data points inherently represent a 2-manifold surface, their query performance may degrade. To address this, we propose a novel dynamic programming technique that precomputes a Directed Acyclic Graph (DAG) to encode the proximity structure between data points. More specifically, the DAG captures how the proximity structure evolves during the incremental construction of the Voronoi diagram of the data points. Experimental results demonstrate that our method achieves a speed increase of 1-10x. Furthermore, our algorithm demonstrates significant practical value in diverse applications. We validated its effectiveness through extensive testing in four key applications: Point-to-Mesh Distance Queries, Iterative Closest Point (ICP) Registration, Density Peak Clustering, and Point-to-Segments Distance Queries. A particularly notable feature of our approach is its unique ability to efficiently identify the nearest neighbor among the first <inline-formula><tex-math>$k$</tex-math></inline-formula> points in the point cloud, a capability that enables substantial acceleration in low-dimensional applications like Density Peak Clustering. As a natural extension of our incremental construction process, our method can also be readily adapted for farthest-point sampling tasks. These experimental results across multiple domains underscore the broad applicability and practical importance of our approach.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 1","pages":"999-1014"},"PeriodicalIF":18.6,"publicationDate":"2025-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145072881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-16DOI: 10.1109/TPAMI.2025.3610096
Xixun Lin;Qing Yu;Yanan Cao;Lixin Zou;Chuan Zhou;Jia Wu;Chenliang Li;Peng Zhang;Shirui Pan
Multi-task learning (MTL) is a standard learning paradigm in machine learning. The central idea of MTL is to capture the shared knowledge among multiple tasks for mitigating the problem of data sparsity where the annotated samples for each task are quite limited. Recent studies indicate that graph multi-task learning (GMTL) yields the promising improvement over previous MTL methods. GMTL represents tasks on a task relation graph, and further leverages graph neural networks (GNNs) to learn complex task relationships. Although GMTL achieves the better performance, the construction of task relation graph heavily depends on simple heuristic tricks, which results in the existence of spurious task correlations and the absence of true edges between tasks with strong connections. This problem largely limits the effectiveness of GMTL. To this end, we propose the Generative Causality-driven Network (GCNet), a novel framework that progressively learns the causal structure between tasks to discover which tasks are beneficial to be jointly trained for improving generalization ability and model robustness. To be specific, in the feature space, GCNet first introduces a feature-level generator to generate the structure prior for reducing learning difficulty. Afterwards, GCNet develops a output-level generator which is parameterized as a new causal energy-based model (EBM) to refine the learned structure prior in the output space driven by causality. Benefiting from our proposed causal framework, we theoretically derive an intervention contrastive estimation for training this causal EBM efficiently. Experiments are conducted on multiple synthetic and real-world datasets. Extensive empirical results and model analyses demonstrate the superior performance of GCNet over several competitive MTL baselines.
{"title":"Generative Causality-Driven Network for Graph Multi-Task Learning","authors":"Xixun Lin;Qing Yu;Yanan Cao;Lixin Zou;Chuan Zhou;Jia Wu;Chenliang Li;Peng Zhang;Shirui Pan","doi":"10.1109/TPAMI.2025.3610096","DOIUrl":"10.1109/TPAMI.2025.3610096","url":null,"abstract":"Multi-task learning (MTL) is a standard learning paradigm in machine learning. The central idea of MTL is to capture the shared knowledge among multiple tasks for mitigating the problem of data sparsity where the annotated samples for each task are quite limited. Recent studies indicate that graph multi-task learning (GMTL) yields the promising improvement over previous MTL methods. GMTL represents tasks on a task relation graph, and further leverages graph neural networks (GNNs) to learn complex task relationships. Although GMTL achieves the better performance, the construction of task relation graph heavily depends on simple heuristic tricks, which results in the existence of spurious task correlations and the absence of true edges between tasks with strong connections. This problem largely limits the effectiveness of GMTL. To this end, we propose the Generative Causality-driven Network (GCNet), a novel framework that progressively learns the causal structure between tasks to discover which tasks are beneficial to be jointly trained for improving generalization ability and model robustness. To be specific, in the feature space, GCNet first introduces a feature-level generator to generate the structure prior for reducing learning difficulty. Afterwards, GCNet develops a output-level generator which is parameterized as a new causal energy-based model (EBM) to refine the learned structure prior in the output space driven by causality. Benefiting from our proposed causal framework, we theoretically derive an intervention contrastive estimation for training this causal EBM efficiently. Experiments are conducted on multiple synthetic and real-world datasets. Extensive empirical results and model analyses demonstrate the superior performance of GCNet over several competitive MTL baselines.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 1","pages":"1029-1044"},"PeriodicalIF":18.6,"publicationDate":"2025-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145071894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-16DOI: 10.1109/TPAMI.2025.3610243
Yuqi Jiang;Ying Fu;Qiankun Liu;Jun Zhang
Multispectral filter array (MSFA) camera is increasingly used due to its compact size and fast capturing speed. However, because of its narrow-band property, it often suffers from the light-deficient problem, and images captured are easily overwhelmed by noise. As a type of commonly used denoising method, neural networks have shown their power to achieve satisfactory denoising results. However, their performance highly depends on high-quality noisy-clean image pairs. For the task of MSFA image denoising, there is currently neither a paired real dataset nor an accurate noise model capable of generating realistic noisy images. To this end, we present a physics-based noise model that is capable to match the real noise distribution and synthesize realistic noisy images. In our noise model, those different types of noise can be divided into SimpleDist component and ComplexDist component. The former contains all the types of noise that can be described using a simple probability distribution like Gaussian or Poisson distribution, and the latter contains the complicated color bias noise that cannot be modeled using a simple probability distribution. Besides, we design a noise-decoupled network consisting of a SimpleDist noise removal network (SNRNet) and a ComplexDist noise removal network (CNRNet) to sequentially remove each component. Moreover, according to the non-uniformity of color bias noise in our noise model, we introduce a learnable position embedding in CNRNet to indicate the position information. To verify the effectiveness of our physics-based noise model and noise-decoupled network, we collect a real MSFA denoising dataset with paired long-exposure clean images and short-exposure noisy images. Experiments are conducted to prove that the network trained using synthetic data generated by our noise model performs as well as trained using paired real data, and our noise-decoupled network outperforms other state-of-the-art denoising methods.
{"title":"MSFA Image Denoising Using Physics-Based Noise Model and Noise-Decoupled Network","authors":"Yuqi Jiang;Ying Fu;Qiankun Liu;Jun Zhang","doi":"10.1109/TPAMI.2025.3610243","DOIUrl":"10.1109/TPAMI.2025.3610243","url":null,"abstract":"Multispectral filter array (MSFA) camera is increasingly used due to its compact size and fast capturing speed. However, because of its narrow-band property, it often suffers from the light-deficient problem, and images captured are easily overwhelmed by noise. As a type of commonly used denoising method, neural networks have shown their power to achieve satisfactory denoising results. However, their performance highly depends on high-quality noisy-clean image pairs. For the task of MSFA image denoising, there is currently neither a paired real dataset nor an accurate noise model capable of generating realistic noisy images. To this end, we present a physics-based noise model that is capable to match the real noise distribution and synthesize realistic noisy images. In our noise model, those different types of noise can be divided into <italic>SimpleDist</i> component and <italic>ComplexDist</i> component. The former contains all the types of noise that can be described using a simple probability distribution like Gaussian or Poisson distribution, and the latter contains the complicated color bias noise that cannot be modeled using a simple probability distribution. Besides, we design a noise-decoupled network consisting of a SimpleDist noise removal network (SNRNet) and a ComplexDist noise removal network (CNRNet) to sequentially remove each component. Moreover, according to the non-uniformity of color bias noise in our noise model, we introduce a learnable position embedding in CNRNet to indicate the position information. To verify the effectiveness of our physics-based noise model and noise-decoupled network, we collect a real MSFA denoising dataset with paired long-exposure clean images and short-exposure noisy images. Experiments are conducted to prove that the network trained using synthetic data generated by our noise model performs as well as trained using paired real data, and our noise-decoupled network outperforms other state-of-the-art denoising methods.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 1","pages":"859-875"},"PeriodicalIF":18.6,"publicationDate":"2025-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145071715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-16DOI: 10.1109/TPAMI.2025.3610500
Jingjia Shi;Shuaifeng Zhi;Kai Xu
The challenging task of 3D planar reconstruction from images involves several sub-tasks including frame-wise plane detection, segmentation, parameter regression and possibly depth prediction, along with cross-frame plane correspondence and relative camera pose estimation. Previous works adopt a divide and conquer strategy, addressing above sub-tasks with distinct network modules in a two-stage paradigm. Specifically, given an initial camera pose and per-frame plane predictions from the first stage, further exclusively designed modules relying on external plane correspondence labeling are applied to merge multi-view plane entities and produce refined camera pose. Notably, existing work fails to integrate these closely related sub-tasks into a unified framework, and instead addresses them separately and sequentially, which we identify as a primary source of performance limitations. Motivated by this finding and the success of query-based learning in enriching reasoning among semantic entities, in this paper, we propose PlaneRecTR++, a Transformer-based architecture, which for the first time unifies all tasks of multi-view planar reconstruction and pose estimation within a compact single-stage framework, eliminating the need for the initial pose estimation and supervision of plane correspondence. Extensive quantitative and qualitative experiments demonstrate that our proposed unified learning achieves mutual benefits across sub-tasks, achieving a new state-of-the-art performance on the public ScanNetv1, ScanNetv2, NYUv2-Plane, and MatterPort3D datasets.
{"title":"PlaneRecTR++: Unified Query Learning for Joint 3D Planar Reconstruction and Pose Estimation","authors":"Jingjia Shi;Shuaifeng Zhi;Kai Xu","doi":"10.1109/TPAMI.2025.3610500","DOIUrl":"10.1109/TPAMI.2025.3610500","url":null,"abstract":"The challenging task of 3D planar reconstruction from images involves several sub-tasks including frame-wise plane detection, segmentation, parameter regression and possibly depth prediction, along with cross-frame plane correspondence and relative camera pose estimation. Previous works adopt a divide and conquer strategy, addressing above sub-tasks with distinct network modules in a two-stage paradigm. Specifically, given an initial camera pose and per-frame plane predictions from the first stage, further exclusively designed modules relying on external plane correspondence labeling are applied to merge multi-view plane entities and produce refined camera pose. Notably, existing work fails to integrate these closely related sub-tasks into a unified framework, and instead addresses them separately and sequentially, which we identify as a primary source of performance limitations. Motivated by this finding and the success of query-based learning in enriching reasoning among semantic entities, in this paper, we propose PlaneRecTR++, a Transformer-based architecture, which for the first time unifies all tasks of multi-view planar reconstruction and pose estimation within a compact single-stage framework, eliminating the need for the initial pose estimation and supervision of plane correspondence. Extensive quantitative and qualitative experiments demonstrate that our proposed unified learning achieves mutual benefits across sub-tasks, achieving a new state-of-the-art performance on the public ScanNetv1, ScanNetv2, NYUv2-Plane, and MatterPort3D datasets.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 1","pages":"962-981"},"PeriodicalIF":18.6,"publicationDate":"2025-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145071895","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
LiDAR-based 3D object detection has recently seen significant advancements through active learning (AL), attaining satisfactory performance by training on a small fraction of strategically selected point clouds. However, in real-world deployments where streaming point clouds may include unknown or novel objects, the ability of current AL methods to capture such objects remains unexplored. This paper investigates a more practical and challenging research task: Open World Active Learning for 3D Object Detection (OWAL-3D), aimed at acquiring informative point clouds with new concepts. To tackle this challenge, we propose a simple yet effective strategy called Open Label Conciseness (OLC), which mines novel 3D objects with minimal annotation costs. Our empirical results show that OLC successfully adapts the 3D detection model to the open world scenario with just a single round of selection. Any generic AL policy can then be integrated with the proposed OLC to efficiently address the OWAL-3D problem. Based on this, we introduce the Open-CRB framework, which seamlessly integrates OLC with our preliminary AL method, CRB, designed specifically for 3D object detection. We develop a comprehensive codebase for easy reproducing and future research, supporting 15 baseline methods (i.e., active learning, out-of-distribution detection and open world detection), 2 types of modern 3D detectors (i.e., one-stage SECOND and two-stage PV-RCNN) and 3 benchmark 3D datasets (i.e., KITTI, nuScenes and Waymo). Extensive experiments evidence that the proposed Open-CRB demonstrates superiority and flexibility in recognizing both novel and known classes with very limited labeling costs, compared to state-of-the-art baselines.
{"title":"Open-CRB: Toward Open World Active Learning for 3D Object Detection","authors":"Zhuoxiao Chen;Yadan Luo;Zixin Wang;Zijian Wang;Zi Huang","doi":"10.1109/TPAMI.2025.3575756","DOIUrl":"https://doi.org/10.1109/TPAMI.2025.3575756","url":null,"abstract":"LiDAR-based 3D object detection has recently seen significant advancements through active learning (AL), attaining satisfactory performance by training on a small fraction of strategically selected point clouds. However, in real-world deployments where streaming point clouds may include unknown or novel objects, the ability of current AL methods to capture such objects remains unexplored. This paper investigates a more practical and challenging research task: Open World Active Learning for 3D Object Detection (OWAL-3D), aimed at acquiring informative point clouds with new concepts. To tackle this challenge, we propose a simple yet effective strategy called Open Label Conciseness (OLC), which mines novel 3D objects with minimal annotation costs. Our empirical results show that OLC successfully adapts the 3D detection model to the open world scenario with just a single round of selection. Any generic AL policy can then be integrated with the proposed OLC to efficiently address the OWAL-3D problem. Based on this, we introduce the Open-CRB framework, which seamlessly integrates OLC with our preliminary AL method, CRB, designed specifically for 3D object detection. We develop a comprehensive codebase for easy reproducing and future research, supporting 15 baseline methods (i.e., active learning, out-of-distribution detection and open world detection), 2 types of modern 3D detectors (i.e., one-stage SECOND and two-stage PV-RCNN) and 3 benchmark 3D datasets (i.e., KITTI, nuScenes and Waymo). Extensive experiments evidence that the proposed Open-CRB demonstrates superiority and flexibility in recognizing both novel and known classes with very limited labeling costs, compared to state-of-the-art baselines.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 10","pages":"8336-8350"},"PeriodicalIF":18.6,"publicationDate":"2025-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145036795","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-11DOI: 10.1109/TPAMI.2025.3609288
Jiang Liu;Bobo Li;Xinran Yang;Na Yang;Hao Fei;Mingyao Zhang;Fei Li;Donghong Ji
Multimodal information extraction (IE) tasks have attracted increasing attention because many studies have shown that multimodal information benefits text information extraction. However, existing multimodal IE datasets mainly focus on sentence-level image-facilitated IE in English text, and pay little attention to video-based multimodal IE and fine-grained visual grounding. Therefore, in order to promote the development of multimodal IE, we constructed a multimodal multilingual multitask dataset, named M$^{3}$D, which has the following features: (1) It contains paired document-level text and video to enrich multimodal information; (2) It supports two widely-used languages, namely English and Chinese; (3) It includes more multimodal IE tasks such as entity recognition, entity chain extraction, relation extraction and visual grounding. In addition, our dataset introduces an unexplored theme, i.e., biography, enriching the domains of multimodal IE resources. To establish a benchmark for our dataset, we propose an innovative hierarchical multimodal IE model. This model effectively leverages and integrates multimodal information through a Denoised Feature Fusion Module (DFFM). Furthermore, in non-ideal scenarios, modal information is often incomplete. Thus, we designed a Missing Modality Construction Module (MMCM) to alleviate the issues caused by missing modalities. Our model achieved an average performance of 53.80% and 53.77% on four tasks in English and Chinese datasets, respectively, which set a reasonable standard for subsequent research. In addition, we conducted more analytical experiments to verify the effectiveness of our proposed module. We believe that our work can promote the development of the field of multimodal IE.
{"title":"M$^{3}$3D: A Multimodal, Multilingual and Multitask Dataset for Grounded Document-Level Information Extraction","authors":"Jiang Liu;Bobo Li;Xinran Yang;Na Yang;Hao Fei;Mingyao Zhang;Fei Li;Donghong Ji","doi":"10.1109/TPAMI.2025.3609288","DOIUrl":"10.1109/TPAMI.2025.3609288","url":null,"abstract":"Multimodal information extraction (IE) tasks have attracted increasing attention because many studies have shown that multimodal information benefits text information extraction. However, existing multimodal IE datasets mainly focus on sentence-level image-facilitated IE in English text, and pay little attention to video-based multimodal IE and fine-grained visual grounding. Therefore, in order to promote the development of multimodal IE, we constructed a multimodal multilingual multitask dataset, named M<inline-formula><tex-math>$^{3}$</tex-math></inline-formula>D, which has the following features: (1) It contains paired document-level text and video to enrich multimodal information; (2) It supports two widely-used languages, namely English and Chinese; (3) It includes more multimodal IE tasks such as entity recognition, entity chain extraction, relation extraction and visual grounding. In addition, our dataset introduces an unexplored theme, i.e., biography, enriching the domains of multimodal IE resources. To establish a benchmark for our dataset, we propose an innovative hierarchical multimodal IE model. This model effectively leverages and integrates multimodal information through a Denoised Feature Fusion Module (DFFM). Furthermore, in non-ideal scenarios, modal information is often incomplete. Thus, we designed a Missing Modality Construction Module (MMCM) to alleviate the issues caused by missing modalities. Our model achieved an average performance of 53.80% and 53.77% on four tasks in English and Chinese datasets, respectively, which set a reasonable standard for subsequent research. In addition, we conducted more analytical experiments to verify the effectiveness of our proposed module. We believe that our work can promote the development of the field of multimodal IE.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 1","pages":"807-823"},"PeriodicalIF":18.6,"publicationDate":"2025-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145035494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Camouflaged Object Segmentation (COS) faces significant challenges due to the scarcity of annotated data, where meticulous pixel-level annotation is both labor-intensive and costly, primarily due to the intricate object-background boundaries. Addressing the core question, “Can COS be effectively achieved in a zero-shot manner without manual annotations for any camouflaged object?”, we propose an affirmative solution. We examine the learned attention patterns for camouflaged objects and introduce a robust zero-shot COS framework. Our findings reveal that while transformer models for salient object segmentation (SOS) prioritize global features in their attention mechanisms, camouflaged object segmentation exhibits both global and local attention biases. Based on these findings, we design a framework that adapts with the inherent local pattern bias of COS while incorporating global attention patterns and a broad semantic feature space derived from SOS. This enables efficient zero-shot transfer for COS. Specifically, We incorporate a Masked Image Modeling (MIM) based image encoder optimized for Parameter-Efficient Fine-Tuning (PEFT), a Multimodal Large Language Model (M-LLM), and a Multi-scale Fine-grained Alignment (MFA) mechanism. The MIM encoder captures essential local features, while the PEFT module learns global and semantic representations from SOS datasets. To further enhance semantic granularity, we leverage the M-LLM to generate caption embeddings conditioned on visual cues, which are meticulously aligned with multi-scale visual features via MFA. This alignment enables precise interpretation of complex semantic contexts. Moreover, we introduce a learnable codebook to represent the M-LLM during inference, significantly reducing computational demands while maintaining performance. Our framework demonstrates its versatility and efficacy through rigorous experimentation, achieving state-of-the-art performance in zero-shot COS with $F_{beta }^{w}$ scores of 72.9% on CAMO and 71.7% on COD10K. By removing the M-LLM during inference, we achieve an inference speed comparable to that of traditional end-to-end models, reaching 18.1 FPS. Additionally, our method excels in polyp segmentation, and underwater scene segmentation, outperforming challenging baselines in both zero-shot and supervised settings, thereby implying its potentiality in various segmentation tasks.
{"title":"Towards Real Zero-Shot Camouflaged Object Segmentation Without Camouflaged Annotations","authors":"Cheng Lei;Jie Fan;Xinran Li;Tian-Zhu Xiang;Ao Li;Ce Zhu;Le Zhang","doi":"10.1109/TPAMI.2025.3600461","DOIUrl":"10.1109/TPAMI.2025.3600461","url":null,"abstract":"Camouflaged Object Segmentation (COS) faces significant challenges due to the scarcity of annotated data, where meticulous pixel-level annotation is both labor-intensive and costly, primarily due to the intricate object-background boundaries. Addressing the core question, “Can COS be effectively achieved in a zero-shot manner without manual annotations for any camouflaged object?”, we propose an affirmative solution. We examine the learned attention patterns for camouflaged objects and introduce a robust zero-shot COS framework. Our findings reveal that while transformer models for salient object segmentation (SOS) prioritize global features in their attention mechanisms, camouflaged object segmentation exhibits both global and local attention biases. Based on these findings, we design a framework that adapts with the inherent local pattern bias of COS while incorporating global attention patterns and a broad semantic feature space derived from SOS. This enables efficient zero-shot transfer for COS. Specifically, We incorporate a Masked Image Modeling (MIM) based image encoder optimized for Parameter-Efficient Fine-Tuning (PEFT), a Multimodal Large Language Model (M-LLM), and a Multi-scale Fine-grained Alignment (MFA) mechanism. The MIM encoder captures essential local features, while the PEFT module learns global and semantic representations from SOS datasets. To further enhance semantic granularity, we leverage the M-LLM to generate caption embeddings conditioned on visual cues, which are meticulously aligned with multi-scale visual features via MFA. This alignment enables precise interpretation of complex semantic contexts. Moreover, we introduce a learnable codebook to represent the M-LLM during inference, significantly reducing computational demands while maintaining performance. Our framework demonstrates its versatility and efficacy through rigorous experimentation, achieving state-of-the-art performance in zero-shot COS with <inline-formula><tex-math>$F_{beta }^{w}$</tex-math></inline-formula> scores of 72.9% on CAMO and 71.7% on COD10K. By removing the M-LLM during inference, we achieve an inference speed comparable to that of traditional end-to-end models, reaching 18.1 FPS. Additionally, our method excels in polyp segmentation, and underwater scene segmentation, outperforming challenging baselines in both zero-shot and supervised settings, thereby implying its potentiality in various segmentation tasks.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 12","pages":"11990-12004"},"PeriodicalIF":18.6,"publicationDate":"2025-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145034509","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As XR technology continues to advance rapidly, 3D generation and editing are increasingly crucial. Among these, stylization plays a key role in enhancing the appearance of 3D models. By utilizing stylization, users can achieve consistent artistic effects in 3D editing using a single reference style image, making it a user-friendly editing method. However, recent NeRF-based 3D stylization methods encounter efficiency issues that impact the user experience, and their implicit nature limits their ability to accurately transfer geometric pattern styles. Additionally, the ability for artists to apply flexible control over stylized scenes is considered highly desirable to foster an environment conducive to creative exploration. To address the above issues, we introduce StylizedGS, an efficient 3D neural style transfer framework with adaptable control over perceptual factors based on 3D Gaussian Splatting representation. We propose a filter-based refinement to eliminate floaters that affect the stylization effects in the scene reconstruction process. The nearest neighbor-based style loss is introduced to achieve stylization by fine-tuning the geometry and color parameters of 3DGS, while a depth preservation loss with other regularizations is proposed to prevent the tampering of geometry content. Moreover, facilitated by specially designed losses, StylizedGS enables users to control color, stylized scale, and regions during the stylization to possess customization capabilities. Our method achieves high-quality stylization results characterized by faithful brushstrokes and geometric consistency with flexible controls. Extensive experiments across various scenes and styles demonstrate the effectiveness and efficiency of our method concerning both stylization quality and inference speed.
{"title":"StylizedGS: Controllable Stylization for 3D Gaussian Splatting","authors":"Dingxi Zhang;Yu-Jie Yuan;Zhuoxun Chen;Fang-Lue Zhang;Zhenliang He;Shiguang Shan;Lin Gao","doi":"10.1109/TPAMI.2025.3604010","DOIUrl":"10.1109/TPAMI.2025.3604010","url":null,"abstract":"As XR technology continues to advance rapidly, 3D generation and editing are increasingly crucial. Among these, stylization plays a key role in enhancing the appearance of 3D models. By utilizing stylization, users can achieve consistent artistic effects in 3D editing using a single reference style image, making it a user-friendly editing method. However, recent NeRF-based 3D stylization methods encounter efficiency issues that impact the user experience, and their implicit nature limits their ability to accurately transfer geometric pattern styles. Additionally, the ability for artists to apply flexible control over stylized scenes is considered highly desirable to foster an environment conducive to creative exploration. To address the above issues, we introduce StylizedGS, an efficient 3D neural style transfer framework with adaptable control over perceptual factors based on 3D Gaussian Splatting representation. We propose a filter-based refinement to eliminate floaters that affect the stylization effects in the scene reconstruction process. The nearest neighbor-based style loss is introduced to achieve stylization by fine-tuning the geometry and color parameters of 3DGS, while a depth preservation loss with other regularizations is proposed to prevent the tampering of geometry content. Moreover, facilitated by specially designed losses, StylizedGS enables users to control color, stylized scale, and regions during the stylization to possess customization capabilities. Our method achieves high-quality stylization results characterized by faithful brushstrokes and geometric consistency with flexible controls. Extensive experiments across various scenes and styles demonstrate the effectiveness and efficiency of our method concerning both stylization quality and inference speed.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 12","pages":"11961-11973"},"PeriodicalIF":18.6,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144915471","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Black-Box Knowledge Distillation (B2KD) is a conservative task in cloud-to-edge model compression, emphasizing the protection of data privacy and model copyrights on both the cloud and edge. With invisible data and models hosted on the server, B2KD aims to utilize only the API queries of the teacher model’s inference results in the cloud to effectively distill a lightweight student model deployed on edge devices. B2KD faces challenges such as limited Internet exchange and edge-cloud disparity in data distribution. To address these issues, we theoretically provide a new optimization direction from logits to cell boundary, different from direct logits alignment, and formalize a workflow comprising deprivatization, distillation, and adaptation at test time. Guided by this, we propose a method, Mapping-Emulation KD (MEKD), to enhance the robust prediction and anti-interference capabilities of the student model on edge devices for any unknown data distribution in real-world scenarios. Our method does not differentiate between treating soft or hard responses and consists of: 1) deprivatization: emulating the inverse mapping of the teacher function with a generator, 2) distillation: aligning low-dimensional logits of the teacher and student models by reducing the distance of high-dimensional image points, and 3) adaptation: correcting the student’s online prediction bias through a graph propagation-based only-forward test-time adaptation algorithm. Our method demonstrates inspiring performance for edge model distillation and adaptation across different teacher-student pairs. We validate the effectiveness of our method on multiple image recognition benchmarks and various Deep Neural Network models, achieving state-of-the-art performance and showcasing its practical value in remote sensing image recognition applications.
{"title":"Aligning Logits Generatively for Principled Black-Box Knowledge Distillation in the Wild","authors":"Xiang Xiang;Jing Ma;Dongrui Wu;Zhigang Zeng;Xilin Chen","doi":"10.1109/TPAMI.2025.3602663","DOIUrl":"10.1109/TPAMI.2025.3602663","url":null,"abstract":"Black-Box Knowledge Distillation (B2KD) is a conservative task in cloud-to-edge model compression, emphasizing the protection of data privacy and model copyrights on both the cloud and edge. With invisible data and models hosted on the server, B2KD aims to utilize only the API queries of the teacher model’s inference results in the cloud to effectively distill a lightweight student model deployed on edge devices. B2KD faces challenges such as limited Internet exchange and edge-cloud disparity in data distribution. To address these issues, we theoretically provide a new optimization direction from logits to cell boundary, different from direct logits alignment, and formalize a workflow comprising deprivatization, distillation, and adaptation at test time. Guided by this, we propose a method, Mapping-Emulation KD (MEKD), to enhance the robust prediction and anti-interference capabilities of the student model on edge devices for any unknown data distribution in real-world scenarios. Our method does not differentiate between treating soft or hard responses and consists of: 1) deprivatization: emulating the inverse mapping of the teacher function with a generator, 2) distillation: aligning low-dimensional logits of the teacher and student models by reducing the distance of high-dimensional image points, and 3) adaptation: correcting the student’s online prediction bias through a graph propagation-based only-forward test-time adaptation algorithm. Our method demonstrates inspiring performance for edge model distillation and adaptation across different teacher-student pairs. We validate the effectiveness of our method on multiple image recognition benchmarks and various Deep Neural Network models, achieving state-of-the-art performance and showcasing its practical value in remote sensing image recognition applications.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 12","pages":"11929-11945"},"PeriodicalIF":18.6,"publicationDate":"2025-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144900423","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-22DOI: 10.1109/TPAMI.2025.3602282
Tehrim Yoon;Minyoung Hwang;Eunho Yang
Modern generative models, particularly denoising diffusion probabilistic models (DDPMs), provide high-quality synthetic images, enabling users to generate diverse images and videos that are realistic. However, in a number of situations, edge devices or individual institutions may possess locally collected data that is highly sensitive and should ensure data privacy, such as in the field of healthcare and finance. Under such federated learning (FL) settings, various methods on training generative models have been studied, but most of them assume generative adversarial networks (GANs), and the algorithms are specific to GANs and not other forms of generative models such as DDPM. This paper proposes a new algorithm for training DDPMs under federated learning settings, VQ-FedDiff, which provides a personalized algorithm for training diffusion models that can generate higher-quality images FID while still keeping risk of breaching sensitive information as low as locally-trained secure models. We demonstrate that VQ-FedDiff shows state-of-the-art performance on existing federated learning of diffusion models in both IID and non-IID settings, and in benchmark photorealistic and medical image datasets. Our results show that diffusion models can efficiently learn with decentralized, sensitive data, generating high-quality images while preserving data privacy.
{"title":"VQ-FedDiff: Federated Learning Algorithm of Diffusion Models With Client-Specific Vector-Quantized Conditioning","authors":"Tehrim Yoon;Minyoung Hwang;Eunho Yang","doi":"10.1109/TPAMI.2025.3602282","DOIUrl":"10.1109/TPAMI.2025.3602282","url":null,"abstract":"Modern generative models, particularly denoising diffusion probabilistic models (DDPMs), provide high-quality synthetic images, enabling users to generate diverse images and videos that are realistic. However, in a number of situations, edge devices or individual institutions may possess locally collected data that is highly sensitive and should ensure data privacy, such as in the field of healthcare and finance. Under such federated learning (FL) settings, various methods on training generative models have been studied, but most of them assume generative adversarial networks (GANs), and the algorithms are specific to GANs and not other forms of generative models such as DDPM. This paper proposes a new algorithm for training DDPMs under federated learning settings, VQ-FedDiff, which provides a personalized algorithm for training diffusion models that can generate higher-quality images FID while still keeping risk of breaching sensitive information as low as locally-trained secure models. We demonstrate that VQ-FedDiff shows state-of-the-art performance on existing federated learning of diffusion models in both IID and non-IID settings, and in benchmark photorealistic and medical image datasets. Our results show that diffusion models can efficiently learn with decentralized, sensitive data, generating high-quality images while preserving data privacy.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 12","pages":"11863-11873"},"PeriodicalIF":18.6,"publicationDate":"2025-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144900424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}