Pub Date : 2026-01-15DOI: 10.1109/tpami.2026.3654264
Siao Cai,Zhicheng Yu,Shaobing Gao,Zeyu Chen,Yiguang Liu
Single-photon imaging uses single-photon-sensitive picosecond-resolution sensors to capture 3D structure and supports diverse applications, but success remains mostly limited to simple scenes. In complex scenarios, traditional methods degrade and deep learning methods lack flexibility and generalization. Here, we propose a physics-informed deep neural network (PIDNN) framework that effectively addresses both aspects, adapting to complex and variable sensing environments by embedding imaging physics into the deep neural network for unsupervised learning. Within this framework, by tailoring the number of U-Net skip connections, we impose multi-scale spatiotemporal priors that improve photon-utilization efficiency, laying the foundation for addressing the inherent low-signal-to-background ratio (SBR) problem in subsequent complex scenarios. Additionally, we introduce volume rendering into the PIDNN framework and design a dual-branch structure, further extending its applicability to multiple-depth and fog occlusion. We validated the performance of this method in various complex environments through numerical simulations and real-world experiments. The results of photon-efficient imaging with multiple returns show robust performance under low SBR and large fields of view. The method attains lower root mean-squared error than traditional methods and exhibits stronger generalization than supervised approaches. Further multiple depths and fog interference experiments confirm that its reconstruction quality surpasses existing techniques, demonstrating its flexibility and scalability. Both simulation and experimental results validate its exceptional reconstruction performance and flexibility.
{"title":"Single-Photon Imaging in Complex Scenarios via Physics-Informed Deep Neural Networks.","authors":"Siao Cai,Zhicheng Yu,Shaobing Gao,Zeyu Chen,Yiguang Liu","doi":"10.1109/tpami.2026.3654264","DOIUrl":"https://doi.org/10.1109/tpami.2026.3654264","url":null,"abstract":"Single-photon imaging uses single-photon-sensitive picosecond-resolution sensors to capture 3D structure and supports diverse applications, but success remains mostly limited to simple scenes. In complex scenarios, traditional methods degrade and deep learning methods lack flexibility and generalization. Here, we propose a physics-informed deep neural network (PIDNN) framework that effectively addresses both aspects, adapting to complex and variable sensing environments by embedding imaging physics into the deep neural network for unsupervised learning. Within this framework, by tailoring the number of U-Net skip connections, we impose multi-scale spatiotemporal priors that improve photon-utilization efficiency, laying the foundation for addressing the inherent low-signal-to-background ratio (SBR) problem in subsequent complex scenarios. Additionally, we introduce volume rendering into the PIDNN framework and design a dual-branch structure, further extending its applicability to multiple-depth and fog occlusion. We validated the performance of this method in various complex environments through numerical simulations and real-world experiments. The results of photon-efficient imaging with multiple returns show robust performance under low SBR and large fields of view. The method attains lower root mean-squared error than traditional methods and exhibits stronger generalization than supervised approaches. Further multiple depths and fog interference experiments confirm that its reconstruction quality surpasses existing techniques, demonstrating its flexibility and scalability. Both simulation and experimental results validate its exceptional reconstruction performance and flexibility.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"48 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145971968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-15DOI: 10.1109/tpami.2026.3654426
Jingchun Zhou,Chunjiang Liu,Qiuping Jiang,Xianping Fu,Junhui Hou,Xuelong Li
Underwater image quality assessment (UIQA) is hindered by complex degradation and domain shifts across aquatic environments. Existing no-reference IQA methods rely on costly and subjective mean opinion scores (MOS), which limit their generalization to unseen domains. To overcome these challenges, we propose SCUIA, an unsupervised UIQA framework leveraging semantic contrastive learning for quality prediction without human annotations. Specifically, we introduce a vision-language contrastive learning strategy that aligns image features with textual embeddings in a unified semantic space, capturing implicit degradation-quality correlations. We further enhance quality discrimination with a hierarchical contrastive learning mechanism that combines image-specific statistical priors and semantic prompts. A triplet-based inter-group contrastive loss explicitly models relative quality relationships. To tackle cross-domain variations, we develop an unsupervised domain adaptation module that uses local statistical features to guide CLIP fine-tuning to disentangle domain-invariant quality representations from domain-specific noise. This enables zero-shot cross-domain quality prediction without labeled data. Extensive experiments on public UIQA benchmarks demonstrate significant improvements over existing methods, highlighting superior generalization and domain adaptability.
{"title":"Semantic Contrast for Domain-Robust Underwater Image Quality Assessment.","authors":"Jingchun Zhou,Chunjiang Liu,Qiuping Jiang,Xianping Fu,Junhui Hou,Xuelong Li","doi":"10.1109/tpami.2026.3654426","DOIUrl":"https://doi.org/10.1109/tpami.2026.3654426","url":null,"abstract":"Underwater image quality assessment (UIQA) is hindered by complex degradation and domain shifts across aquatic environments. Existing no-reference IQA methods rely on costly and subjective mean opinion scores (MOS), which limit their generalization to unseen domains. To overcome these challenges, we propose SCUIA, an unsupervised UIQA framework leveraging semantic contrastive learning for quality prediction without human annotations. Specifically, we introduce a vision-language contrastive learning strategy that aligns image features with textual embeddings in a unified semantic space, capturing implicit degradation-quality correlations. We further enhance quality discrimination with a hierarchical contrastive learning mechanism that combines image-specific statistical priors and semantic prompts. A triplet-based inter-group contrastive loss explicitly models relative quality relationships. To tackle cross-domain variations, we develop an unsupervised domain adaptation module that uses local statistical features to guide CLIP fine-tuning to disentangle domain-invariant quality representations from domain-specific noise. This enables zero-shot cross-domain quality prediction without labeled data. Extensive experiments on public UIQA benchmarks demonstrate significant improvements over existing methods, highlighting superior generalization and domain adaptability.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"29 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145971890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the success of the 3D deep learning models, various perception technologies for autonomous driving have been developed in the LiDAR domain. While these models perform well in the trained source domain, they struggle in unseen domains with a domain gap. In this paper, we propose a representation learning approach for domain generalization in LiDAR semantic segmentation, termed DGLSS++, which is designed to ensure robust performance in both the source domain and unseen domains despite training exclusively on the source domain. Our approach focuses on generalizing from a single source domain, addressing the domain shift caused by variations in LiDAR sensor configurations and scene distributions. To tackle both sparse-to-dense and dense-to-sparse generalization scenarios, we simulate unseen domains by generating sparsely and densely augmented domains. With the augmented domain, we introduce two constraints for generalizable representation learning: generalized masked sparsity invariant feature consistency (GMSIFC) and localized semantic correlation consistency (LSCC). GMSIFC aligns the internal sparse features of the source domain with those of the augmented domain at different sparsity, introducing a novel masking strategy to exclude voxel features associated with multiple inconsistent classes. For LSCC, class prototypes from spatially local regions are constrained to maintain similar correlations across all local regions, regardless of the scene or domain. In addition, we establish standardized training and evaluation protocols utilizing four real-world datasets and implement several baseline methods. Extensive experiments demonstrate our approach outperforms both UDA and DG baselines. The code is available at https://github.com/gzgzys9887/DGLSS.
{"title":"Towards Enhanced Representation Learning for Single-Source Domain Generalization in LiDAR Semantic Segmentation.","authors":"Hyeonseong Kim,Yoonsu Kang,Changgyoon Oh,Kuk-Jin Yoon","doi":"10.1109/tpami.2026.3654352","DOIUrl":"https://doi.org/10.1109/tpami.2026.3654352","url":null,"abstract":"With the success of the 3D deep learning models, various perception technologies for autonomous driving have been developed in the LiDAR domain. While these models perform well in the trained source domain, they struggle in unseen domains with a domain gap. In this paper, we propose a representation learning approach for domain generalization in LiDAR semantic segmentation, termed DGLSS++, which is designed to ensure robust performance in both the source domain and unseen domains despite training exclusively on the source domain. Our approach focuses on generalizing from a single source domain, addressing the domain shift caused by variations in LiDAR sensor configurations and scene distributions. To tackle both sparse-to-dense and dense-to-sparse generalization scenarios, we simulate unseen domains by generating sparsely and densely augmented domains. With the augmented domain, we introduce two constraints for generalizable representation learning: generalized masked sparsity invariant feature consistency (GMSIFC) and localized semantic correlation consistency (LSCC). GMSIFC aligns the internal sparse features of the source domain with those of the augmented domain at different sparsity, introducing a novel masking strategy to exclude voxel features associated with multiple inconsistent classes. For LSCC, class prototypes from spatially local regions are constrained to maintain similar correlations across all local regions, regardless of the scene or domain. In addition, we establish standardized training and evaluation protocols utilizing four real-world datasets and implement several baseline methods. Extensive experiments demonstrate our approach outperforms both UDA and DG baselines. The code is available at https://github.com/gzgzys9887/DGLSS.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"100 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145971891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-15DOI: 10.1109/tpami.2026.3654201
Wangbo Zhao,Yizeng Han,Jiasheng Tang,Kai Wang,Hao Luo,Yibing Song,Gao Huang,Fan Wang,Yang You
Diffusion Transformer (DiT), an emerging diffusion model for visual generation, has demonstrated superior perfor mance but suffers from substantial computational costs. Our investigations reveal that these costs primarily stem from the static inference paradigm, which inevitably introduces redundant computation in certain diffusion timesteps and spatial regions. To overcome this inefficiency, we propose Dynamic Diffusion Transformer (DyDiT), an architecture that dynamically adjusts its computation along both timestep and spatial dimensions. Specifically, we introduce a Timestep-wise Dynamic Width (TDW) approach that adapts model width conditioned on the generation timesteps. In addition, we design a Spatial-wise Dynamic Token (SDT) strategy to avoid redundant computation at unnecessary spatial locations. TDW and SDT can be seamlessly integrated into DiT and significantly accelerate the generation process. Building on these designs, we present an extended version, DyDiT++, with improvements in three key aspects. First, it extends the generation mechanism of DyDiT beyond diffusion to flow matching, demon strating that our method can also accelerate flow-matching based generation, enhancing its versatility. Furthermore, we enhance DyDiT to tackle more complex visual generation tasks, including video generation and text-to-image generation, thereby broadening its real-world applications. Finally, to address the high cost of full fine-tuning and democratize technology access, we investigate the feasibility of training DyDiT in a parameter efficient manner and introduce timestep-based dynamic LoRA (TD-LoRA). Extensive experiments on diverse visual generation models, including DiT, SiT, Latte, and FLUX, demonstrate the effectiveness of DyDiT++. Remarkably, with <3% additional f ine-tuning iterations, our approach reduces the FLOPs of DiT XL by 51%, yielding 1.73× realistic speedup on hardware, and achieves a competitive FID score of 2.07 on ImageNet. The code is available at https://github.com/alibaba-damo-academy/DyDiT.
{"title":"DyDiT++: Diffusion Transformers with Timestep and Spatial Dynamics for Efficient Visual Generation.","authors":"Wangbo Zhao,Yizeng Han,Jiasheng Tang,Kai Wang,Hao Luo,Yibing Song,Gao Huang,Fan Wang,Yang You","doi":"10.1109/tpami.2026.3654201","DOIUrl":"https://doi.org/10.1109/tpami.2026.3654201","url":null,"abstract":"Diffusion Transformer (DiT), an emerging diffusion model for visual generation, has demonstrated superior perfor mance but suffers from substantial computational costs. Our investigations reveal that these costs primarily stem from the static inference paradigm, which inevitably introduces redundant computation in certain diffusion timesteps and spatial regions. To overcome this inefficiency, we propose Dynamic Diffusion Transformer (DyDiT), an architecture that dynamically adjusts its computation along both timestep and spatial dimensions. Specifically, we introduce a Timestep-wise Dynamic Width (TDW) approach that adapts model width conditioned on the generation timesteps. In addition, we design a Spatial-wise Dynamic Token (SDT) strategy to avoid redundant computation at unnecessary spatial locations. TDW and SDT can be seamlessly integrated into DiT and significantly accelerate the generation process. Building on these designs, we present an extended version, DyDiT++, with improvements in three key aspects. First, it extends the generation mechanism of DyDiT beyond diffusion to flow matching, demon strating that our method can also accelerate flow-matching based generation, enhancing its versatility. Furthermore, we enhance DyDiT to tackle more complex visual generation tasks, including video generation and text-to-image generation, thereby broadening its real-world applications. Finally, to address the high cost of full fine-tuning and democratize technology access, we investigate the feasibility of training DyDiT in a parameter efficient manner and introduce timestep-based dynamic LoRA (TD-LoRA). Extensive experiments on diverse visual generation models, including DiT, SiT, Latte, and FLUX, demonstrate the effectiveness of DyDiT++. Remarkably, with <3% additional f ine-tuning iterations, our approach reduces the FLOPs of DiT XL by 51%, yielding 1.73× realistic speedup on hardware, and achieves a competitive FID score of 2.07 on ImageNet. The code is available at https://github.com/alibaba-damo-academy/DyDiT.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"47 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145971970","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}