Pub Date : 2025-09-16DOI: 10.1109/TPAMI.2025.3610243
Yuqi Jiang;Ying Fu;Qiankun Liu;Jun Zhang
Multispectral filter array (MSFA) camera is increasingly used due to its compact size and fast capturing speed. However, because of its narrow-band property, it often suffers from the light-deficient problem, and images captured are easily overwhelmed by noise. As a type of commonly used denoising method, neural networks have shown their power to achieve satisfactory denoising results. However, their performance highly depends on high-quality noisy-clean image pairs. For the task of MSFA image denoising, there is currently neither a paired real dataset nor an accurate noise model capable of generating realistic noisy images. To this end, we present a physics-based noise model that is capable to match the real noise distribution and synthesize realistic noisy images. In our noise model, those different types of noise can be divided into SimpleDist component and ComplexDist component. The former contains all the types of noise that can be described using a simple probability distribution like Gaussian or Poisson distribution, and the latter contains the complicated color bias noise that cannot be modeled using a simple probability distribution. Besides, we design a noise-decoupled network consisting of a SimpleDist noise removal network (SNRNet) and a ComplexDist noise removal network (CNRNet) to sequentially remove each component. Moreover, according to the non-uniformity of color bias noise in our noise model, we introduce a learnable position embedding in CNRNet to indicate the position information. To verify the effectiveness of our physics-based noise model and noise-decoupled network, we collect a real MSFA denoising dataset with paired long-exposure clean images and short-exposure noisy images. Experiments are conducted to prove that the network trained using synthetic data generated by our noise model performs as well as trained using paired real data, and our noise-decoupled network outperforms other state-of-the-art denoising methods.
{"title":"MSFA Image Denoising Using Physics-Based Noise Model and Noise-Decoupled Network","authors":"Yuqi Jiang;Ying Fu;Qiankun Liu;Jun Zhang","doi":"10.1109/TPAMI.2025.3610243","DOIUrl":"10.1109/TPAMI.2025.3610243","url":null,"abstract":"Multispectral filter array (MSFA) camera is increasingly used due to its compact size and fast capturing speed. However, because of its narrow-band property, it often suffers from the light-deficient problem, and images captured are easily overwhelmed by noise. As a type of commonly used denoising method, neural networks have shown their power to achieve satisfactory denoising results. However, their performance highly depends on high-quality noisy-clean image pairs. For the task of MSFA image denoising, there is currently neither a paired real dataset nor an accurate noise model capable of generating realistic noisy images. To this end, we present a physics-based noise model that is capable to match the real noise distribution and synthesize realistic noisy images. In our noise model, those different types of noise can be divided into <italic>SimpleDist</i> component and <italic>ComplexDist</i> component. The former contains all the types of noise that can be described using a simple probability distribution like Gaussian or Poisson distribution, and the latter contains the complicated color bias noise that cannot be modeled using a simple probability distribution. Besides, we design a noise-decoupled network consisting of a SimpleDist noise removal network (SNRNet) and a ComplexDist noise removal network (CNRNet) to sequentially remove each component. Moreover, according to the non-uniformity of color bias noise in our noise model, we introduce a learnable position embedding in CNRNet to indicate the position information. To verify the effectiveness of our physics-based noise model and noise-decoupled network, we collect a real MSFA denoising dataset with paired long-exposure clean images and short-exposure noisy images. Experiments are conducted to prove that the network trained using synthetic data generated by our noise model performs as well as trained using paired real data, and our noise-decoupled network outperforms other state-of-the-art denoising methods.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 1","pages":"859-875"},"PeriodicalIF":18.6,"publicationDate":"2025-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145071715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-16DOI: 10.1109/TPAMI.2025.3610500
Jingjia Shi;Shuaifeng Zhi;Kai Xu
The challenging task of 3D planar reconstruction from images involves several sub-tasks including frame-wise plane detection, segmentation, parameter regression and possibly depth prediction, along with cross-frame plane correspondence and relative camera pose estimation. Previous works adopt a divide and conquer strategy, addressing above sub-tasks with distinct network modules in a two-stage paradigm. Specifically, given an initial camera pose and per-frame plane predictions from the first stage, further exclusively designed modules relying on external plane correspondence labeling are applied to merge multi-view plane entities and produce refined camera pose. Notably, existing work fails to integrate these closely related sub-tasks into a unified framework, and instead addresses them separately and sequentially, which we identify as a primary source of performance limitations. Motivated by this finding and the success of query-based learning in enriching reasoning among semantic entities, in this paper, we propose PlaneRecTR++, a Transformer-based architecture, which for the first time unifies all tasks of multi-view planar reconstruction and pose estimation within a compact single-stage framework, eliminating the need for the initial pose estimation and supervision of plane correspondence. Extensive quantitative and qualitative experiments demonstrate that our proposed unified learning achieves mutual benefits across sub-tasks, achieving a new state-of-the-art performance on the public ScanNetv1, ScanNetv2, NYUv2-Plane, and MatterPort3D datasets.
{"title":"PlaneRecTR++: Unified Query Learning for Joint 3D Planar Reconstruction and Pose Estimation","authors":"Jingjia Shi;Shuaifeng Zhi;Kai Xu","doi":"10.1109/TPAMI.2025.3610500","DOIUrl":"10.1109/TPAMI.2025.3610500","url":null,"abstract":"The challenging task of 3D planar reconstruction from images involves several sub-tasks including frame-wise plane detection, segmentation, parameter regression and possibly depth prediction, along with cross-frame plane correspondence and relative camera pose estimation. Previous works adopt a divide and conquer strategy, addressing above sub-tasks with distinct network modules in a two-stage paradigm. Specifically, given an initial camera pose and per-frame plane predictions from the first stage, further exclusively designed modules relying on external plane correspondence labeling are applied to merge multi-view plane entities and produce refined camera pose. Notably, existing work fails to integrate these closely related sub-tasks into a unified framework, and instead addresses them separately and sequentially, which we identify as a primary source of performance limitations. Motivated by this finding and the success of query-based learning in enriching reasoning among semantic entities, in this paper, we propose PlaneRecTR++, a Transformer-based architecture, which for the first time unifies all tasks of multi-view planar reconstruction and pose estimation within a compact single-stage framework, eliminating the need for the initial pose estimation and supervision of plane correspondence. Extensive quantitative and qualitative experiments demonstrate that our proposed unified learning achieves mutual benefits across sub-tasks, achieving a new state-of-the-art performance on the public ScanNetv1, ScanNetv2, NYUv2-Plane, and MatterPort3D datasets.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 1","pages":"962-981"},"PeriodicalIF":18.6,"publicationDate":"2025-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145071895","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
LiDAR-based 3D object detection has recently seen significant advancements through active learning (AL), attaining satisfactory performance by training on a small fraction of strategically selected point clouds. However, in real-world deployments where streaming point clouds may include unknown or novel objects, the ability of current AL methods to capture such objects remains unexplored. This paper investigates a more practical and challenging research task: Open World Active Learning for 3D Object Detection (OWAL-3D), aimed at acquiring informative point clouds with new concepts. To tackle this challenge, we propose a simple yet effective strategy called Open Label Conciseness (OLC), which mines novel 3D objects with minimal annotation costs. Our empirical results show that OLC successfully adapts the 3D detection model to the open world scenario with just a single round of selection. Any generic AL policy can then be integrated with the proposed OLC to efficiently address the OWAL-3D problem. Based on this, we introduce the Open-CRB framework, which seamlessly integrates OLC with our preliminary AL method, CRB, designed specifically for 3D object detection. We develop a comprehensive codebase for easy reproducing and future research, supporting 15 baseline methods (i.e., active learning, out-of-distribution detection and open world detection), 2 types of modern 3D detectors (i.e., one-stage SECOND and two-stage PV-RCNN) and 3 benchmark 3D datasets (i.e., KITTI, nuScenes and Waymo). Extensive experiments evidence that the proposed Open-CRB demonstrates superiority and flexibility in recognizing both novel and known classes with very limited labeling costs, compared to state-of-the-art baselines.
{"title":"Open-CRB: Toward Open World Active Learning for 3D Object Detection","authors":"Zhuoxiao Chen;Yadan Luo;Zixin Wang;Zijian Wang;Zi Huang","doi":"10.1109/TPAMI.2025.3575756","DOIUrl":"https://doi.org/10.1109/TPAMI.2025.3575756","url":null,"abstract":"LiDAR-based 3D object detection has recently seen significant advancements through active learning (AL), attaining satisfactory performance by training on a small fraction of strategically selected point clouds. However, in real-world deployments where streaming point clouds may include unknown or novel objects, the ability of current AL methods to capture such objects remains unexplored. This paper investigates a more practical and challenging research task: Open World Active Learning for 3D Object Detection (OWAL-3D), aimed at acquiring informative point clouds with new concepts. To tackle this challenge, we propose a simple yet effective strategy called Open Label Conciseness (OLC), which mines novel 3D objects with minimal annotation costs. Our empirical results show that OLC successfully adapts the 3D detection model to the open world scenario with just a single round of selection. Any generic AL policy can then be integrated with the proposed OLC to efficiently address the OWAL-3D problem. Based on this, we introduce the Open-CRB framework, which seamlessly integrates OLC with our preliminary AL method, CRB, designed specifically for 3D object detection. We develop a comprehensive codebase for easy reproducing and future research, supporting 15 baseline methods (i.e., active learning, out-of-distribution detection and open world detection), 2 types of modern 3D detectors (i.e., one-stage SECOND and two-stage PV-RCNN) and 3 benchmark 3D datasets (i.e., KITTI, nuScenes and Waymo). Extensive experiments evidence that the proposed Open-CRB demonstrates superiority and flexibility in recognizing both novel and known classes with very limited labeling costs, compared to state-of-the-art baselines.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 10","pages":"8336-8350"},"PeriodicalIF":18.6,"publicationDate":"2025-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145036795","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-11DOI: 10.1109/TPAMI.2025.3609288
Jiang Liu;Bobo Li;Xinran Yang;Na Yang;Hao Fei;Mingyao Zhang;Fei Li;Donghong Ji
Multimodal information extraction (IE) tasks have attracted increasing attention because many studies have shown that multimodal information benefits text information extraction. However, existing multimodal IE datasets mainly focus on sentence-level image-facilitated IE in English text, and pay little attention to video-based multimodal IE and fine-grained visual grounding. Therefore, in order to promote the development of multimodal IE, we constructed a multimodal multilingual multitask dataset, named M$^{3}$D, which has the following features: (1) It contains paired document-level text and video to enrich multimodal information; (2) It supports two widely-used languages, namely English and Chinese; (3) It includes more multimodal IE tasks such as entity recognition, entity chain extraction, relation extraction and visual grounding. In addition, our dataset introduces an unexplored theme, i.e., biography, enriching the domains of multimodal IE resources. To establish a benchmark for our dataset, we propose an innovative hierarchical multimodal IE model. This model effectively leverages and integrates multimodal information through a Denoised Feature Fusion Module (DFFM). Furthermore, in non-ideal scenarios, modal information is often incomplete. Thus, we designed a Missing Modality Construction Module (MMCM) to alleviate the issues caused by missing modalities. Our model achieved an average performance of 53.80% and 53.77% on four tasks in English and Chinese datasets, respectively, which set a reasonable standard for subsequent research. In addition, we conducted more analytical experiments to verify the effectiveness of our proposed module. We believe that our work can promote the development of the field of multimodal IE.
{"title":"M$^{3}$3D: A Multimodal, Multilingual and Multitask Dataset for Grounded Document-Level Information Extraction","authors":"Jiang Liu;Bobo Li;Xinran Yang;Na Yang;Hao Fei;Mingyao Zhang;Fei Li;Donghong Ji","doi":"10.1109/TPAMI.2025.3609288","DOIUrl":"10.1109/TPAMI.2025.3609288","url":null,"abstract":"Multimodal information extraction (IE) tasks have attracted increasing attention because many studies have shown that multimodal information benefits text information extraction. However, existing multimodal IE datasets mainly focus on sentence-level image-facilitated IE in English text, and pay little attention to video-based multimodal IE and fine-grained visual grounding. Therefore, in order to promote the development of multimodal IE, we constructed a multimodal multilingual multitask dataset, named M<inline-formula><tex-math>$^{3}$</tex-math></inline-formula>D, which has the following features: (1) It contains paired document-level text and video to enrich multimodal information; (2) It supports two widely-used languages, namely English and Chinese; (3) It includes more multimodal IE tasks such as entity recognition, entity chain extraction, relation extraction and visual grounding. In addition, our dataset introduces an unexplored theme, i.e., biography, enriching the domains of multimodal IE resources. To establish a benchmark for our dataset, we propose an innovative hierarchical multimodal IE model. This model effectively leverages and integrates multimodal information through a Denoised Feature Fusion Module (DFFM). Furthermore, in non-ideal scenarios, modal information is often incomplete. Thus, we designed a Missing Modality Construction Module (MMCM) to alleviate the issues caused by missing modalities. Our model achieved an average performance of 53.80% and 53.77% on four tasks in English and Chinese datasets, respectively, which set a reasonable standard for subsequent research. In addition, we conducted more analytical experiments to verify the effectiveness of our proposed module. We believe that our work can promote the development of the field of multimodal IE.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"48 1","pages":"807-823"},"PeriodicalIF":18.6,"publicationDate":"2025-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145035494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Camouflaged Object Segmentation (COS) faces significant challenges due to the scarcity of annotated data, where meticulous pixel-level annotation is both labor-intensive and costly, primarily due to the intricate object-background boundaries. Addressing the core question, “Can COS be effectively achieved in a zero-shot manner without manual annotations for any camouflaged object?”, we propose an affirmative solution. We examine the learned attention patterns for camouflaged objects and introduce a robust zero-shot COS framework. Our findings reveal that while transformer models for salient object segmentation (SOS) prioritize global features in their attention mechanisms, camouflaged object segmentation exhibits both global and local attention biases. Based on these findings, we design a framework that adapts with the inherent local pattern bias of COS while incorporating global attention patterns and a broad semantic feature space derived from SOS. This enables efficient zero-shot transfer for COS. Specifically, We incorporate a Masked Image Modeling (MIM) based image encoder optimized for Parameter-Efficient Fine-Tuning (PEFT), a Multimodal Large Language Model (M-LLM), and a Multi-scale Fine-grained Alignment (MFA) mechanism. The MIM encoder captures essential local features, while the PEFT module learns global and semantic representations from SOS datasets. To further enhance semantic granularity, we leverage the M-LLM to generate caption embeddings conditioned on visual cues, which are meticulously aligned with multi-scale visual features via MFA. This alignment enables precise interpretation of complex semantic contexts. Moreover, we introduce a learnable codebook to represent the M-LLM during inference, significantly reducing computational demands while maintaining performance. Our framework demonstrates its versatility and efficacy through rigorous experimentation, achieving state-of-the-art performance in zero-shot COS with $F_{beta }^{w}$ scores of 72.9% on CAMO and 71.7% on COD10K. By removing the M-LLM during inference, we achieve an inference speed comparable to that of traditional end-to-end models, reaching 18.1 FPS. Additionally, our method excels in polyp segmentation, and underwater scene segmentation, outperforming challenging baselines in both zero-shot and supervised settings, thereby implying its potentiality in various segmentation tasks.
{"title":"Towards Real Zero-Shot Camouflaged Object Segmentation Without Camouflaged Annotations","authors":"Cheng Lei;Jie Fan;Xinran Li;Tian-Zhu Xiang;Ao Li;Ce Zhu;Le Zhang","doi":"10.1109/TPAMI.2025.3600461","DOIUrl":"10.1109/TPAMI.2025.3600461","url":null,"abstract":"Camouflaged Object Segmentation (COS) faces significant challenges due to the scarcity of annotated data, where meticulous pixel-level annotation is both labor-intensive and costly, primarily due to the intricate object-background boundaries. Addressing the core question, “Can COS be effectively achieved in a zero-shot manner without manual annotations for any camouflaged object?”, we propose an affirmative solution. We examine the learned attention patterns for camouflaged objects and introduce a robust zero-shot COS framework. Our findings reveal that while transformer models for salient object segmentation (SOS) prioritize global features in their attention mechanisms, camouflaged object segmentation exhibits both global and local attention biases. Based on these findings, we design a framework that adapts with the inherent local pattern bias of COS while incorporating global attention patterns and a broad semantic feature space derived from SOS. This enables efficient zero-shot transfer for COS. Specifically, We incorporate a Masked Image Modeling (MIM) based image encoder optimized for Parameter-Efficient Fine-Tuning (PEFT), a Multimodal Large Language Model (M-LLM), and a Multi-scale Fine-grained Alignment (MFA) mechanism. The MIM encoder captures essential local features, while the PEFT module learns global and semantic representations from SOS datasets. To further enhance semantic granularity, we leverage the M-LLM to generate caption embeddings conditioned on visual cues, which are meticulously aligned with multi-scale visual features via MFA. This alignment enables precise interpretation of complex semantic contexts. Moreover, we introduce a learnable codebook to represent the M-LLM during inference, significantly reducing computational demands while maintaining performance. Our framework demonstrates its versatility and efficacy through rigorous experimentation, achieving state-of-the-art performance in zero-shot COS with <inline-formula><tex-math>$F_{beta }^{w}$</tex-math></inline-formula> scores of 72.9% on CAMO and 71.7% on COD10K. By removing the M-LLM during inference, we achieve an inference speed comparable to that of traditional end-to-end models, reaching 18.1 FPS. Additionally, our method excels in polyp segmentation, and underwater scene segmentation, outperforming challenging baselines in both zero-shot and supervised settings, thereby implying its potentiality in various segmentation tasks.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 12","pages":"11990-12004"},"PeriodicalIF":18.6,"publicationDate":"2025-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145034509","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As XR technology continues to advance rapidly, 3D generation and editing are increasingly crucial. Among these, stylization plays a key role in enhancing the appearance of 3D models. By utilizing stylization, users can achieve consistent artistic effects in 3D editing using a single reference style image, making it a user-friendly editing method. However, recent NeRF-based 3D stylization methods encounter efficiency issues that impact the user experience, and their implicit nature limits their ability to accurately transfer geometric pattern styles. Additionally, the ability for artists to apply flexible control over stylized scenes is considered highly desirable to foster an environment conducive to creative exploration. To address the above issues, we introduce StylizedGS, an efficient 3D neural style transfer framework with adaptable control over perceptual factors based on 3D Gaussian Splatting representation. We propose a filter-based refinement to eliminate floaters that affect the stylization effects in the scene reconstruction process. The nearest neighbor-based style loss is introduced to achieve stylization by fine-tuning the geometry and color parameters of 3DGS, while a depth preservation loss with other regularizations is proposed to prevent the tampering of geometry content. Moreover, facilitated by specially designed losses, StylizedGS enables users to control color, stylized scale, and regions during the stylization to possess customization capabilities. Our method achieves high-quality stylization results characterized by faithful brushstrokes and geometric consistency with flexible controls. Extensive experiments across various scenes and styles demonstrate the effectiveness and efficiency of our method concerning both stylization quality and inference speed.
{"title":"StylizedGS: Controllable Stylization for 3D Gaussian Splatting","authors":"Dingxi Zhang;Yu-Jie Yuan;Zhuoxun Chen;Fang-Lue Zhang;Zhenliang He;Shiguang Shan;Lin Gao","doi":"10.1109/TPAMI.2025.3604010","DOIUrl":"10.1109/TPAMI.2025.3604010","url":null,"abstract":"As XR technology continues to advance rapidly, 3D generation and editing are increasingly crucial. Among these, stylization plays a key role in enhancing the appearance of 3D models. By utilizing stylization, users can achieve consistent artistic effects in 3D editing using a single reference style image, making it a user-friendly editing method. However, recent NeRF-based 3D stylization methods encounter efficiency issues that impact the user experience, and their implicit nature limits their ability to accurately transfer geometric pattern styles. Additionally, the ability for artists to apply flexible control over stylized scenes is considered highly desirable to foster an environment conducive to creative exploration. To address the above issues, we introduce StylizedGS, an efficient 3D neural style transfer framework with adaptable control over perceptual factors based on 3D Gaussian Splatting representation. We propose a filter-based refinement to eliminate floaters that affect the stylization effects in the scene reconstruction process. The nearest neighbor-based style loss is introduced to achieve stylization by fine-tuning the geometry and color parameters of 3DGS, while a depth preservation loss with other regularizations is proposed to prevent the tampering of geometry content. Moreover, facilitated by specially designed losses, StylizedGS enables users to control color, stylized scale, and regions during the stylization to possess customization capabilities. Our method achieves high-quality stylization results characterized by faithful brushstrokes and geometric consistency with flexible controls. Extensive experiments across various scenes and styles demonstrate the effectiveness and efficiency of our method concerning both stylization quality and inference speed.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 12","pages":"11961-11973"},"PeriodicalIF":18.6,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144915471","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Black-Box Knowledge Distillation (B2KD) is a conservative task in cloud-to-edge model compression, emphasizing the protection of data privacy and model copyrights on both the cloud and edge. With invisible data and models hosted on the server, B2KD aims to utilize only the API queries of the teacher model’s inference results in the cloud to effectively distill a lightweight student model deployed on edge devices. B2KD faces challenges such as limited Internet exchange and edge-cloud disparity in data distribution. To address these issues, we theoretically provide a new optimization direction from logits to cell boundary, different from direct logits alignment, and formalize a workflow comprising deprivatization, distillation, and adaptation at test time. Guided by this, we propose a method, Mapping-Emulation KD (MEKD), to enhance the robust prediction and anti-interference capabilities of the student model on edge devices for any unknown data distribution in real-world scenarios. Our method does not differentiate between treating soft or hard responses and consists of: 1) deprivatization: emulating the inverse mapping of the teacher function with a generator, 2) distillation: aligning low-dimensional logits of the teacher and student models by reducing the distance of high-dimensional image points, and 3) adaptation: correcting the student’s online prediction bias through a graph propagation-based only-forward test-time adaptation algorithm. Our method demonstrates inspiring performance for edge model distillation and adaptation across different teacher-student pairs. We validate the effectiveness of our method on multiple image recognition benchmarks and various Deep Neural Network models, achieving state-of-the-art performance and showcasing its practical value in remote sensing image recognition applications.
{"title":"Aligning Logits Generatively for Principled Black-Box Knowledge Distillation in the Wild","authors":"Xiang Xiang;Jing Ma;Dongrui Wu;Zhigang Zeng;Xilin Chen","doi":"10.1109/TPAMI.2025.3602663","DOIUrl":"10.1109/TPAMI.2025.3602663","url":null,"abstract":"Black-Box Knowledge Distillation (B2KD) is a conservative task in cloud-to-edge model compression, emphasizing the protection of data privacy and model copyrights on both the cloud and edge. With invisible data and models hosted on the server, B2KD aims to utilize only the API queries of the teacher model’s inference results in the cloud to effectively distill a lightweight student model deployed on edge devices. B2KD faces challenges such as limited Internet exchange and edge-cloud disparity in data distribution. To address these issues, we theoretically provide a new optimization direction from logits to cell boundary, different from direct logits alignment, and formalize a workflow comprising deprivatization, distillation, and adaptation at test time. Guided by this, we propose a method, Mapping-Emulation KD (MEKD), to enhance the robust prediction and anti-interference capabilities of the student model on edge devices for any unknown data distribution in real-world scenarios. Our method does not differentiate between treating soft or hard responses and consists of: 1) deprivatization: emulating the inverse mapping of the teacher function with a generator, 2) distillation: aligning low-dimensional logits of the teacher and student models by reducing the distance of high-dimensional image points, and 3) adaptation: correcting the student’s online prediction bias through a graph propagation-based only-forward test-time adaptation algorithm. Our method demonstrates inspiring performance for edge model distillation and adaptation across different teacher-student pairs. We validate the effectiveness of our method on multiple image recognition benchmarks and various Deep Neural Network models, achieving state-of-the-art performance and showcasing its practical value in remote sensing image recognition applications.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 12","pages":"11929-11945"},"PeriodicalIF":18.6,"publicationDate":"2025-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144900423","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-22DOI: 10.1109/TPAMI.2025.3602282
Tehrim Yoon;Minyoung Hwang;Eunho Yang
Modern generative models, particularly denoising diffusion probabilistic models (DDPMs), provide high-quality synthetic images, enabling users to generate diverse images and videos that are realistic. However, in a number of situations, edge devices or individual institutions may possess locally collected data that is highly sensitive and should ensure data privacy, such as in the field of healthcare and finance. Under such federated learning (FL) settings, various methods on training generative models have been studied, but most of them assume generative adversarial networks (GANs), and the algorithms are specific to GANs and not other forms of generative models such as DDPM. This paper proposes a new algorithm for training DDPMs under federated learning settings, VQ-FedDiff, which provides a personalized algorithm for training diffusion models that can generate higher-quality images FID while still keeping risk of breaching sensitive information as low as locally-trained secure models. We demonstrate that VQ-FedDiff shows state-of-the-art performance on existing federated learning of diffusion models in both IID and non-IID settings, and in benchmark photorealistic and medical image datasets. Our results show that diffusion models can efficiently learn with decentralized, sensitive data, generating high-quality images while preserving data privacy.
{"title":"VQ-FedDiff: Federated Learning Algorithm of Diffusion Models With Client-Specific Vector-Quantized Conditioning","authors":"Tehrim Yoon;Minyoung Hwang;Eunho Yang","doi":"10.1109/TPAMI.2025.3602282","DOIUrl":"10.1109/TPAMI.2025.3602282","url":null,"abstract":"Modern generative models, particularly denoising diffusion probabilistic models (DDPMs), provide high-quality synthetic images, enabling users to generate diverse images and videos that are realistic. However, in a number of situations, edge devices or individual institutions may possess locally collected data that is highly sensitive and should ensure data privacy, such as in the field of healthcare and finance. Under such federated learning (FL) settings, various methods on training generative models have been studied, but most of them assume generative adversarial networks (GANs), and the algorithms are specific to GANs and not other forms of generative models such as DDPM. This paper proposes a new algorithm for training DDPMs under federated learning settings, VQ-FedDiff, which provides a personalized algorithm for training diffusion models that can generate higher-quality images FID while still keeping risk of breaching sensitive information as low as locally-trained secure models. We demonstrate that VQ-FedDiff shows state-of-the-art performance on existing federated learning of diffusion models in both IID and non-IID settings, and in benchmark photorealistic and medical image datasets. Our results show that diffusion models can efficiently learn with decentralized, sensitive data, generating high-quality images while preserving data privacy.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 12","pages":"11863-11873"},"PeriodicalIF":18.6,"publicationDate":"2025-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144900424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-22DOI: 10.1109/TPAMI.2025.3597436
Xinyao Li;Jingjing Li;Zhekai Du;Lei Zhu;Heng Tao Shen
Unsupervised domain adaptation (UDA) enables models trained on a labeled source domain to handle new unlabeled domains. Recently, pre-trained vision-language models (VLMs) have demonstrated promising zero-shot performance by leveraging semantic information to facilitate target tasks. By aligning vision and text embeddings, VLMs have shown notable success in bridging domain gaps. However, inherent differences naturally exist between modalities, which is known as modality gap. Our findings reveal that direct UDA with the presence of modality gap only transfers modality-invariant knowledge, leading to suboptimal target performance. To address this limitation, we propose a unified modality separation framework that accommodates both modality-specific and modality-invariant components. During training, different modality components are disentangled from VLM features then handled separately in a unified manner. At test time, modality-adaptive ensemble weights are automatically determined to maximize the synergy of different components. To evaluate instance-level modality characteristics, we design a modality discrepancy metric to categorize samples into modality-invariant, modality-specific, and uncertain ones. The modality-invariant samples are exploited to facilitate cross-modal alignment, while uncertain ones are annotated to enhance model capabilities. Building upon prompt tuning techniques, our methods achieve up to 9% performance gain with 9 times of computational efficiencies. Extensive experiments and analysis across various backbones, baselines, datasets and adaptation settings demonstrate the efficacy of our design.
{"title":"Unified Modality Separation: A Vision-Language Framework for Unsupervised Domain Adaptation","authors":"Xinyao Li;Jingjing Li;Zhekai Du;Lei Zhu;Heng Tao Shen","doi":"10.1109/TPAMI.2025.3597436","DOIUrl":"10.1109/TPAMI.2025.3597436","url":null,"abstract":"Unsupervised domain adaptation (UDA) enables models trained on a labeled source domain to handle new unlabeled domains. Recently, pre-trained vision-language models (VLMs) have demonstrated promising zero-shot performance by leveraging semantic information to facilitate target tasks. By aligning vision and text embeddings, VLMs have shown notable success in bridging domain gaps. However, inherent differences naturally exist between modalities, which is known as <italic>modality gap</i>. Our findings reveal that direct UDA with the presence of modality gap only transfers modality-invariant knowledge, leading to suboptimal target performance. To address this limitation, we propose a unified modality separation framework that accommodates both modality-specific and modality-invariant components. During training, different modality components are disentangled from VLM features then handled separately in a unified manner. At test time, modality-adaptive ensemble weights are automatically determined to maximize the synergy of different components. To evaluate instance-level modality characteristics, we design a modality discrepancy metric to categorize samples into modality-invariant, modality-specific, and uncertain ones. The modality-invariant samples are exploited to facilitate cross-modal alignment, while uncertain ones are annotated to enhance model capabilities. Building upon prompt tuning techniques, our methods achieve up to 9% performance gain with 9 times of computational efficiencies. Extensive experiments and analysis across various backbones, baselines, datasets and adaptation settings demonstrate the efficacy of our design.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 11","pages":"10604-10618"},"PeriodicalIF":18.6,"publicationDate":"2025-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144900428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Given 2D point correspondences between an image pair, inferring the camera motion is a fundamental issue in the computer vision community. The existing works generally set out from the epipolar constraint and estimate the essential matrix, which is not optimal in the maximum likelihood (ML) sense. In this paper, we dive into the original measurement model with respect to the rotation matrix and normalized translation vector and formulate the ML problem. We then propose an optimal two-step algorithm to solve it: In the first step, we estimate the variance of measurement noises and devise a consistent estimator based on bias elimination; In the second step, we execute a one-step Gauss-Newton iteration on manifold to refine the consistent estimator. We prove that the proposed estimator achieves the same asymptotic statistical properties as the ML estimator: The first is consistency, i.e., the estimator converges to the ground truth as the point number increases; The second is asymptotic efficiency, i.e., the mean squared error of the estimator converges to the theoretical lower bound — Cramer-Rao bound. In addition, we show that our algorithm has linear time complexity. These appealing characteristics endow our estimator with a great advantage in the case of dense point correspondences. Experiments on both synthetic data and real images demonstrate that when the point number reaches the order of hundreds, our estimator outperforms the state-of-the-art ones in terms of estimation accuracy and CPU time.
{"title":"Consistent and Optimal Solution to Camera Motion Estimation","authors":"Guangyang Zeng;Qingcheng Zeng;Xinghan Li;Biqiang Mu;Jiming Chen;Ling Shi;Junfeng Wu","doi":"10.1109/TPAMI.2025.3601430","DOIUrl":"10.1109/TPAMI.2025.3601430","url":null,"abstract":"Given 2D point correspondences between an image pair, inferring the camera motion is a fundamental issue in the computer vision community. The existing works generally set out from the epipolar constraint and estimate the essential matrix, which is not optimal in the maximum likelihood (ML) sense. In this paper, we dive into the original measurement model with respect to the rotation matrix and normalized translation vector and formulate the ML problem. We then propose an optimal two-step algorithm to solve it: In the first step, we estimate the variance of measurement noises and devise a consistent estimator based on bias elimination; In the second step, we execute a one-step Gauss-Newton iteration on manifold to refine the consistent estimator. We prove that the proposed estimator achieves the same asymptotic statistical properties as the ML estimator: The first is consistency, i.e., the estimator converges to the ground truth as the point number increases; The second is asymptotic efficiency, i.e., the mean squared error of the estimator converges to the theoretical lower bound — Cramer-Rao bound. In addition, we show that our algorithm has linear time complexity. These appealing characteristics endow our estimator with a great advantage in the case of dense point correspondences. Experiments on both synthetic data and real images demonstrate that when the point number reaches the order of hundreds, our estimator outperforms the state-of-the-art ones in terms of estimation accuracy and CPU time.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 12","pages":"12005-12020"},"PeriodicalIF":18.6,"publicationDate":"2025-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144898490","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}