Pub Date : 2026-01-15DOI: 10.1109/tpami.2026.3654426
Jingchun Zhou,Chunjiang Liu,Qiuping Jiang,Xianping Fu,Junhui Hou,Xuelong Li
Underwater image quality assessment (UIQA) is hindered by complex degradation and domain shifts across aquatic environments. Existing no-reference IQA methods rely on costly and subjective mean opinion scores (MOS), which limit their generalization to unseen domains. To overcome these challenges, we propose SCUIA, an unsupervised UIQA framework leveraging semantic contrastive learning for quality prediction without human annotations. Specifically, we introduce a vision-language contrastive learning strategy that aligns image features with textual embeddings in a unified semantic space, capturing implicit degradation-quality correlations. We further enhance quality discrimination with a hierarchical contrastive learning mechanism that combines image-specific statistical priors and semantic prompts. A triplet-based inter-group contrastive loss explicitly models relative quality relationships. To tackle cross-domain variations, we develop an unsupervised domain adaptation module that uses local statistical features to guide CLIP fine-tuning to disentangle domain-invariant quality representations from domain-specific noise. This enables zero-shot cross-domain quality prediction without labeled data. Extensive experiments on public UIQA benchmarks demonstrate significant improvements over existing methods, highlighting superior generalization and domain adaptability.
{"title":"Semantic Contrast for Domain-Robust Underwater Image Quality Assessment.","authors":"Jingchun Zhou,Chunjiang Liu,Qiuping Jiang,Xianping Fu,Junhui Hou,Xuelong Li","doi":"10.1109/tpami.2026.3654426","DOIUrl":"https://doi.org/10.1109/tpami.2026.3654426","url":null,"abstract":"Underwater image quality assessment (UIQA) is hindered by complex degradation and domain shifts across aquatic environments. Existing no-reference IQA methods rely on costly and subjective mean opinion scores (MOS), which limit their generalization to unseen domains. To overcome these challenges, we propose SCUIA, an unsupervised UIQA framework leveraging semantic contrastive learning for quality prediction without human annotations. Specifically, we introduce a vision-language contrastive learning strategy that aligns image features with textual embeddings in a unified semantic space, capturing implicit degradation-quality correlations. We further enhance quality discrimination with a hierarchical contrastive learning mechanism that combines image-specific statistical priors and semantic prompts. A triplet-based inter-group contrastive loss explicitly models relative quality relationships. To tackle cross-domain variations, we develop an unsupervised domain adaptation module that uses local statistical features to guide CLIP fine-tuning to disentangle domain-invariant quality representations from domain-specific noise. This enables zero-shot cross-domain quality prediction without labeled data. Extensive experiments on public UIQA benchmarks demonstrate significant improvements over existing methods, highlighting superior generalization and domain adaptability.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"29 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145971890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the success of the 3D deep learning models, various perception technologies for autonomous driving have been developed in the LiDAR domain. While these models perform well in the trained source domain, they struggle in unseen domains with a domain gap. In this paper, we propose a representation learning approach for domain generalization in LiDAR semantic segmentation, termed DGLSS++, which is designed to ensure robust performance in both the source domain and unseen domains despite training exclusively on the source domain. Our approach focuses on generalizing from a single source domain, addressing the domain shift caused by variations in LiDAR sensor configurations and scene distributions. To tackle both sparse-to-dense and dense-to-sparse generalization scenarios, we simulate unseen domains by generating sparsely and densely augmented domains. With the augmented domain, we introduce two constraints for generalizable representation learning: generalized masked sparsity invariant feature consistency (GMSIFC) and localized semantic correlation consistency (LSCC). GMSIFC aligns the internal sparse features of the source domain with those of the augmented domain at different sparsity, introducing a novel masking strategy to exclude voxel features associated with multiple inconsistent classes. For LSCC, class prototypes from spatially local regions are constrained to maintain similar correlations across all local regions, regardless of the scene or domain. In addition, we establish standardized training and evaluation protocols utilizing four real-world datasets and implement several baseline methods. Extensive experiments demonstrate our approach outperforms both UDA and DG baselines. The code is available at https://github.com/gzgzys9887/DGLSS.
{"title":"Towards Enhanced Representation Learning for Single-Source Domain Generalization in LiDAR Semantic Segmentation.","authors":"Hyeonseong Kim,Yoonsu Kang,Changgyoon Oh,Kuk-Jin Yoon","doi":"10.1109/tpami.2026.3654352","DOIUrl":"https://doi.org/10.1109/tpami.2026.3654352","url":null,"abstract":"With the success of the 3D deep learning models, various perception technologies for autonomous driving have been developed in the LiDAR domain. While these models perform well in the trained source domain, they struggle in unseen domains with a domain gap. In this paper, we propose a representation learning approach for domain generalization in LiDAR semantic segmentation, termed DGLSS++, which is designed to ensure robust performance in both the source domain and unseen domains despite training exclusively on the source domain. Our approach focuses on generalizing from a single source domain, addressing the domain shift caused by variations in LiDAR sensor configurations and scene distributions. To tackle both sparse-to-dense and dense-to-sparse generalization scenarios, we simulate unseen domains by generating sparsely and densely augmented domains. With the augmented domain, we introduce two constraints for generalizable representation learning: generalized masked sparsity invariant feature consistency (GMSIFC) and localized semantic correlation consistency (LSCC). GMSIFC aligns the internal sparse features of the source domain with those of the augmented domain at different sparsity, introducing a novel masking strategy to exclude voxel features associated with multiple inconsistent classes. For LSCC, class prototypes from spatially local regions are constrained to maintain similar correlations across all local regions, regardless of the scene or domain. In addition, we establish standardized training and evaluation protocols utilizing four real-world datasets and implement several baseline methods. Extensive experiments demonstrate our approach outperforms both UDA and DG baselines. The code is available at https://github.com/gzgzys9887/DGLSS.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"100 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145971891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-15DOI: 10.1109/tpami.2026.3654201
Wangbo Zhao,Yizeng Han,Jiasheng Tang,Kai Wang,Hao Luo,Yibing Song,Gao Huang,Fan Wang,Yang You
Diffusion Transformer (DiT), an emerging diffusion model for visual generation, has demonstrated superior perfor mance but suffers from substantial computational costs. Our investigations reveal that these costs primarily stem from the static inference paradigm, which inevitably introduces redundant computation in certain diffusion timesteps and spatial regions. To overcome this inefficiency, we propose Dynamic Diffusion Transformer (DyDiT), an architecture that dynamically adjusts its computation along both timestep and spatial dimensions. Specifically, we introduce a Timestep-wise Dynamic Width (TDW) approach that adapts model width conditioned on the generation timesteps. In addition, we design a Spatial-wise Dynamic Token (SDT) strategy to avoid redundant computation at unnecessary spatial locations. TDW and SDT can be seamlessly integrated into DiT and significantly accelerate the generation process. Building on these designs, we present an extended version, DyDiT++, with improvements in three key aspects. First, it extends the generation mechanism of DyDiT beyond diffusion to flow matching, demon strating that our method can also accelerate flow-matching based generation, enhancing its versatility. Furthermore, we enhance DyDiT to tackle more complex visual generation tasks, including video generation and text-to-image generation, thereby broadening its real-world applications. Finally, to address the high cost of full fine-tuning and democratize technology access, we investigate the feasibility of training DyDiT in a parameter efficient manner and introduce timestep-based dynamic LoRA (TD-LoRA). Extensive experiments on diverse visual generation models, including DiT, SiT, Latte, and FLUX, demonstrate the effectiveness of DyDiT++. Remarkably, with <3% additional f ine-tuning iterations, our approach reduces the FLOPs of DiT XL by 51%, yielding 1.73× realistic speedup on hardware, and achieves a competitive FID score of 2.07 on ImageNet. The code is available at https://github.com/alibaba-damo-academy/DyDiT.
{"title":"DyDiT++: Diffusion Transformers with Timestep and Spatial Dynamics for Efficient Visual Generation.","authors":"Wangbo Zhao,Yizeng Han,Jiasheng Tang,Kai Wang,Hao Luo,Yibing Song,Gao Huang,Fan Wang,Yang You","doi":"10.1109/tpami.2026.3654201","DOIUrl":"https://doi.org/10.1109/tpami.2026.3654201","url":null,"abstract":"Diffusion Transformer (DiT), an emerging diffusion model for visual generation, has demonstrated superior perfor mance but suffers from substantial computational costs. Our investigations reveal that these costs primarily stem from the static inference paradigm, which inevitably introduces redundant computation in certain diffusion timesteps and spatial regions. To overcome this inefficiency, we propose Dynamic Diffusion Transformer (DyDiT), an architecture that dynamically adjusts its computation along both timestep and spatial dimensions. Specifically, we introduce a Timestep-wise Dynamic Width (TDW) approach that adapts model width conditioned on the generation timesteps. In addition, we design a Spatial-wise Dynamic Token (SDT) strategy to avoid redundant computation at unnecessary spatial locations. TDW and SDT can be seamlessly integrated into DiT and significantly accelerate the generation process. Building on these designs, we present an extended version, DyDiT++, with improvements in three key aspects. First, it extends the generation mechanism of DyDiT beyond diffusion to flow matching, demon strating that our method can also accelerate flow-matching based generation, enhancing its versatility. Furthermore, we enhance DyDiT to tackle more complex visual generation tasks, including video generation and text-to-image generation, thereby broadening its real-world applications. Finally, to address the high cost of full fine-tuning and democratize technology access, we investigate the feasibility of training DyDiT in a parameter efficient manner and introduce timestep-based dynamic LoRA (TD-LoRA). Extensive experiments on diverse visual generation models, including DiT, SiT, Latte, and FLUX, demonstrate the effectiveness of DyDiT++. Remarkably, with <3% additional f ine-tuning iterations, our approach reduces the FLOPs of DiT XL by 51%, yielding 1.73× realistic speedup on hardware, and achieves a competitive FID score of 2.07 on ImageNet. The code is available at https://github.com/alibaba-damo-academy/DyDiT.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"47 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145971970","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-14DOI: 10.1109/tpami.2026.3654243
Hao Wang, Keyan Hu, Xin Guo, Haifeng Li, Chao Tao
{"title":"A Gift from the Integration of Discriminative and Diffusion-based Generative Learning: Boundary Refinement Remote Sensing Semantic Segmentation","authors":"Hao Wang, Keyan Hu, Xin Guo, Haifeng Li, Chao Tao","doi":"10.1109/tpami.2026.3654243","DOIUrl":"https://doi.org/10.1109/tpami.2026.3654243","url":null,"abstract":"","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"5 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145972414","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-14DOI: 10.1109/tpami.2026.3653989
Renlang Huang,Li Chai,Yufan Tang,Zhoujian Li,Jiming Chen,Liang Li
Deep learning-based feature matching has showcased great superiority for point cloud registration. While coarse-to-fine matching architectures are prevalent, they typically perform sparse and geometrically inconsistent coarse matching. This forces the subsequent fine matching to rely on computationally expensive optimal transport and hypothesis-and-selection procedures to resolve inconsistencies, leading to inefficiency and poor scalability for large-scale real-time applications. In this paper, we design a consistency-aware spot-guided Transformer (CAST) to enhance the coarse matching by explicitly utilizing geometric consistency via two key sparse attention mechanisms. First, our consistency-aware self-attention selectively computes intra-point-cloud attention to a sparse subset of points with globally consistent correspondences, enabling other points to derive discriminative features through their relationships with these anchors while propagating global consistency for robust correspondence reasoning. Second, our spot-guided cross-attention restricts cross-point-cloud attention to dynamically defined "spots"-the union of correspondence neighborhoods of a query's neighbors in the other point cloud, which are most likely to cover the true correspondence of the query ensured by local consistency, eliminating interference from similar but irrelevant regions. Furthermore, we design a lightweight local attention-based fine matching module to precisely predict dense correspondences and estimate the transformation. Extensive experiments on both outdoor LiDAR datasets and indoor RGB-D camera datasets demonstrate that our method achieves state-of-the-art accuracy, efficiency, and robustness. Besides, our method showcases superior generalization ability on our newly constructed challenging relocalization and loop closing benchmarks in unseen domains. Our code and models are available at https://github.com/RenlangHuang/CASTv2.
{"title":"Consistency-Aware Spot-Guided Transformer for Accurate and Versatile Point Cloud Registration.","authors":"Renlang Huang,Li Chai,Yufan Tang,Zhoujian Li,Jiming Chen,Liang Li","doi":"10.1109/tpami.2026.3653989","DOIUrl":"https://doi.org/10.1109/tpami.2026.3653989","url":null,"abstract":"Deep learning-based feature matching has showcased great superiority for point cloud registration. While coarse-to-fine matching architectures are prevalent, they typically perform sparse and geometrically inconsistent coarse matching. This forces the subsequent fine matching to rely on computationally expensive optimal transport and hypothesis-and-selection procedures to resolve inconsistencies, leading to inefficiency and poor scalability for large-scale real-time applications. In this paper, we design a consistency-aware spot-guided Transformer (CAST) to enhance the coarse matching by explicitly utilizing geometric consistency via two key sparse attention mechanisms. First, our consistency-aware self-attention selectively computes intra-point-cloud attention to a sparse subset of points with globally consistent correspondences, enabling other points to derive discriminative features through their relationships with these anchors while propagating global consistency for robust correspondence reasoning. Second, our spot-guided cross-attention restricts cross-point-cloud attention to dynamically defined \"spots\"-the union of correspondence neighborhoods of a query's neighbors in the other point cloud, which are most likely to cover the true correspondence of the query ensured by local consistency, eliminating interference from similar but irrelevant regions. Furthermore, we design a lightweight local attention-based fine matching module to precisely predict dense correspondences and estimate the transformation. Extensive experiments on both outdoor LiDAR datasets and indoor RGB-D camera datasets demonstrate that our method achieves state-of-the-art accuracy, efficiency, and robustness. Besides, our method showcases superior generalization ability on our newly constructed challenging relocalization and loop closing benchmarks in unseen domains. Our code and models are available at https://github.com/RenlangHuang/CASTv2.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"50 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145971896","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}