首页 > 最新文献

IEEE Transactions on Pattern Analysis and Machine Intelligence最新文献

英文 中文
Single-Photon Imaging in Complex Scenarios via Physics-Informed Deep Neural Networks. 基于物理信息的深度神经网络的复杂场景单光子成像。
IF 23.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-15 DOI: 10.1109/tpami.2026.3654264
Siao Cai,Zhicheng Yu,Shaobing Gao,Zeyu Chen,Yiguang Liu
Single-photon imaging uses single-photon-sensitive picosecond-resolution sensors to capture 3D structure and supports diverse applications, but success remains mostly limited to simple scenes. In complex scenarios, traditional methods degrade and deep learning methods lack flexibility and generalization. Here, we propose a physics-informed deep neural network (PIDNN) framework that effectively addresses both aspects, adapting to complex and variable sensing environments by embedding imaging physics into the deep neural network for unsupervised learning. Within this framework, by tailoring the number of U-Net skip connections, we impose multi-scale spatiotemporal priors that improve photon-utilization efficiency, laying the foundation for addressing the inherent low-signal-to-background ratio (SBR) problem in subsequent complex scenarios. Additionally, we introduce volume rendering into the PIDNN framework and design a dual-branch structure, further extending its applicability to multiple-depth and fog occlusion. We validated the performance of this method in various complex environments through numerical simulations and real-world experiments. The results of photon-efficient imaging with multiple returns show robust performance under low SBR and large fields of view. The method attains lower root mean-squared error than traditional methods and exhibits stronger generalization than supervised approaches. Further multiple depths and fog interference experiments confirm that its reconstruction quality surpasses existing techniques, demonstrating its flexibility and scalability. Both simulation and experimental results validate its exceptional reconstruction performance and flexibility.
单光子成像使用单光子敏感的皮秒分辨率传感器来捕获3D结构,并支持多种应用,但成功仍然主要局限于简单的场景。在复杂的场景下,传统方法会退化,深度学习方法缺乏灵活性和泛化能力。在这里,我们提出了一个物理信息深度神经网络(PIDNN)框架,该框架有效地解决了这两个方面,通过将成像物理嵌入到深度神经网络中进行无监督学习来适应复杂和可变的传感环境。在此框架内,通过调整U-Net跳变连接的数量,我们施加了多尺度时空先验,提高了光子利用效率,为解决后续复杂场景中固有的低信本比(SBR)问题奠定了基础。此外,我们将体绘制引入到PIDNN框架中,并设计了双分支结构,进一步扩展了其对多深度和雾遮挡的适用性。通过数值模拟和实际实验验证了该方法在各种复杂环境下的性能。结果表明,在低SBR和大视场条件下,具有多次回波的光子高效成像具有良好的性能。与传统方法相比,该方法具有更低的均方根误差和更强的泛化能力。进一步的深度和雾干扰实验证实了该方法的重建质量优于现有的技术,显示了其灵活性和可扩展性。仿真和实验结果验证了该方法具有良好的重构性能和灵活性。
{"title":"Single-Photon Imaging in Complex Scenarios via Physics-Informed Deep Neural Networks.","authors":"Siao Cai,Zhicheng Yu,Shaobing Gao,Zeyu Chen,Yiguang Liu","doi":"10.1109/tpami.2026.3654264","DOIUrl":"https://doi.org/10.1109/tpami.2026.3654264","url":null,"abstract":"Single-photon imaging uses single-photon-sensitive picosecond-resolution sensors to capture 3D structure and supports diverse applications, but success remains mostly limited to simple scenes. In complex scenarios, traditional methods degrade and deep learning methods lack flexibility and generalization. Here, we propose a physics-informed deep neural network (PIDNN) framework that effectively addresses both aspects, adapting to complex and variable sensing environments by embedding imaging physics into the deep neural network for unsupervised learning. Within this framework, by tailoring the number of U-Net skip connections, we impose multi-scale spatiotemporal priors that improve photon-utilization efficiency, laying the foundation for addressing the inherent low-signal-to-background ratio (SBR) problem in subsequent complex scenarios. Additionally, we introduce volume rendering into the PIDNN framework and design a dual-branch structure, further extending its applicability to multiple-depth and fog occlusion. We validated the performance of this method in various complex environments through numerical simulations and real-world experiments. The results of photon-efficient imaging with multiple returns show robust performance under low SBR and large fields of view. The method attains lower root mean-squared error than traditional methods and exhibits stronger generalization than supervised approaches. Further multiple depths and fog interference experiments confirm that its reconstruction quality surpasses existing techniques, demonstrating its flexibility and scalability. Both simulation and experimental results validate its exceptional reconstruction performance and flexibility.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"48 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145971968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improving Subgraph Extraction for Graph Invariant Learning via Graph Sinkhorn Attention 利用图下沉角注意改进图不变学习的子图提取
IF 23.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-15 DOI: 10.1109/tpami.2026.3654544
Junchi Yan, Fangyu Ding, Jiawei Sun, Zhaoping Hu, Yunyi Zhou, Lei Zhu
{"title":"Improving Subgraph Extraction for Graph Invariant Learning via Graph Sinkhorn Attention","authors":"Junchi Yan, Fangyu Ding, Jiawei Sun, Zhaoping Hu, Yunyi Zhou, Lei Zhu","doi":"10.1109/tpami.2026.3654544","DOIUrl":"https://doi.org/10.1109/tpami.2026.3654544","url":null,"abstract":"","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"68 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145972436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Interpretable Subspace Clustering 可解释子空间聚类
IF 23.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-15 DOI: 10.1109/tpami.2026.3653776
Zheng Zhang, Peng Zhou, Aiting Yao, Liang Du, Xinwang Liu
{"title":"Interpretable Subspace Clustering","authors":"Zheng Zhang, Peng Zhou, Aiting Yao, Liang Du, Xinwang Liu","doi":"10.1109/tpami.2026.3653776","DOIUrl":"https://doi.org/10.1109/tpami.2026.3653776","url":null,"abstract":"","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"39 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145972438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Semantic Contrast for Domain-Robust Underwater Image Quality Assessment. 基于语义对比的域鲁棒水下图像质量评估。
IF 23.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-15 DOI: 10.1109/tpami.2026.3654426
Jingchun Zhou,Chunjiang Liu,Qiuping Jiang,Xianping Fu,Junhui Hou,Xuelong Li
Underwater image quality assessment (UIQA) is hindered by complex degradation and domain shifts across aquatic environments. Existing no-reference IQA methods rely on costly and subjective mean opinion scores (MOS), which limit their generalization to unseen domains. To overcome these challenges, we propose SCUIA, an unsupervised UIQA framework leveraging semantic contrastive learning for quality prediction without human annotations. Specifically, we introduce a vision-language contrastive learning strategy that aligns image features with textual embeddings in a unified semantic space, capturing implicit degradation-quality correlations. We further enhance quality discrimination with a hierarchical contrastive learning mechanism that combines image-specific statistical priors and semantic prompts. A triplet-based inter-group contrastive loss explicitly models relative quality relationships. To tackle cross-domain variations, we develop an unsupervised domain adaptation module that uses local statistical features to guide CLIP fine-tuning to disentangle domain-invariant quality representations from domain-specific noise. This enables zero-shot cross-domain quality prediction without labeled data. Extensive experiments on public UIQA benchmarks demonstrate significant improvements over existing methods, highlighting superior generalization and domain adaptability.
水下图像质量评估(UIQA)受到水体环境复杂退化和域转移的阻碍。现有的无参考IQA方法依赖于昂贵且主观的平均意见分数(MOS),这限制了它们在未知领域的泛化。为了克服这些挑战,我们提出了SCUIA,这是一个无监督的UIQA框架,利用语义对比学习进行质量预测,无需人工注释。具体来说,我们引入了一种视觉语言对比学习策略,该策略将图像特征与统一语义空间中的文本嵌入对齐,捕获隐含的退化-质量相关性。我们通过结合图像特定统计先验和语义提示的分层对比学习机制进一步增强了质量判别。基于三重的组间对比损失明确地模拟了相对质量关系。为了解决跨领域的变化,我们开发了一个无监督的领域自适应模块,该模块使用局部统计特征来指导CLIP微调,以从特定领域的噪声中分离出领域不变的质量表示。这使得零射击跨域质量预测没有标记的数据。在公共UIQA基准上进行的大量实验表明,与现有方法相比,有了显著的改进,突出了卓越的泛化和领域适应性。
{"title":"Semantic Contrast for Domain-Robust Underwater Image Quality Assessment.","authors":"Jingchun Zhou,Chunjiang Liu,Qiuping Jiang,Xianping Fu,Junhui Hou,Xuelong Li","doi":"10.1109/tpami.2026.3654426","DOIUrl":"https://doi.org/10.1109/tpami.2026.3654426","url":null,"abstract":"Underwater image quality assessment (UIQA) is hindered by complex degradation and domain shifts across aquatic environments. Existing no-reference IQA methods rely on costly and subjective mean opinion scores (MOS), which limit their generalization to unseen domains. To overcome these challenges, we propose SCUIA, an unsupervised UIQA framework leveraging semantic contrastive learning for quality prediction without human annotations. Specifically, we introduce a vision-language contrastive learning strategy that aligns image features with textual embeddings in a unified semantic space, capturing implicit degradation-quality correlations. We further enhance quality discrimination with a hierarchical contrastive learning mechanism that combines image-specific statistical priors and semantic prompts. A triplet-based inter-group contrastive loss explicitly models relative quality relationships. To tackle cross-domain variations, we develop an unsupervised domain adaptation module that uses local statistical features to guide CLIP fine-tuning to disentangle domain-invariant quality representations from domain-specific noise. This enables zero-shot cross-domain quality prediction without labeled data. Extensive experiments on public UIQA benchmarks demonstrate significant improvements over existing methods, highlighting superior generalization and domain adaptability.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"29 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145971890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards Enhanced Representation Learning for Single-Source Domain Generalization in LiDAR Semantic Segmentation. 激光雷达语义分割中单源域泛化的增强表示学习。
IF 23.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-15 DOI: 10.1109/tpami.2026.3654352
Hyeonseong Kim,Yoonsu Kang,Changgyoon Oh,Kuk-Jin Yoon
With the success of the 3D deep learning models, various perception technologies for autonomous driving have been developed in the LiDAR domain. While these models perform well in the trained source domain, they struggle in unseen domains with a domain gap. In this paper, we propose a representation learning approach for domain generalization in LiDAR semantic segmentation, termed DGLSS++, which is designed to ensure robust performance in both the source domain and unseen domains despite training exclusively on the source domain. Our approach focuses on generalizing from a single source domain, addressing the domain shift caused by variations in LiDAR sensor configurations and scene distributions. To tackle both sparse-to-dense and dense-to-sparse generalization scenarios, we simulate unseen domains by generating sparsely and densely augmented domains. With the augmented domain, we introduce two constraints for generalizable representation learning: generalized masked sparsity invariant feature consistency (GMSIFC) and localized semantic correlation consistency (LSCC). GMSIFC aligns the internal sparse features of the source domain with those of the augmented domain at different sparsity, introducing a novel masking strategy to exclude voxel features associated with multiple inconsistent classes. For LSCC, class prototypes from spatially local regions are constrained to maintain similar correlations across all local regions, regardless of the scene or domain. In addition, we establish standardized training and evaluation protocols utilizing four real-world datasets and implement several baseline methods. Extensive experiments demonstrate our approach outperforms both UDA and DG baselines. The code is available at https://github.com/gzgzys9887/DGLSS.
随着3D深度学习模型的成功,各种自动驾驶感知技术在激光雷达领域得到了发展。虽然这些模型在训练好的源域中表现良好,但它们在具有域间隙的未知域中挣扎。在本文中,我们提出了一种用于激光雷达语义分割领域泛化的表示学习方法,称为dglss++,该方法旨在确保在源域和未见域都具有鲁棒性,尽管只在源域进行训练。我们的方法侧重于从单一源域进行泛化,解决由激光雷达传感器配置和场景分布变化引起的域移位。为了处理稀疏到密集和密集到稀疏的泛化场景,我们通过生成稀疏和密集增强的域来模拟看不见的域。在增广域上,我们引入了广义掩模稀疏不变特征一致性(GMSIFC)和局部语义相关一致性(LSCC)两个约束条件。GMSIFC将源域的内部稀疏特征与增强域的内部稀疏特征在不同稀疏度下进行对齐,引入了一种新的掩蔽策略来排除与多个不一致类相关的体素特征。对于LSCC,来自空间局部区域的类原型被限制在所有局部区域之间保持相似的相关性,而不管场景或领域如何。此外,我们利用四个真实世界的数据集建立了标准化的训练和评估协议,并实施了几种基线方法。大量实验表明,我们的方法优于UDA和DG基线。代码可在https://github.com/gzgzys9887/DGLSS上获得。
{"title":"Towards Enhanced Representation Learning for Single-Source Domain Generalization in LiDAR Semantic Segmentation.","authors":"Hyeonseong Kim,Yoonsu Kang,Changgyoon Oh,Kuk-Jin Yoon","doi":"10.1109/tpami.2026.3654352","DOIUrl":"https://doi.org/10.1109/tpami.2026.3654352","url":null,"abstract":"With the success of the 3D deep learning models, various perception technologies for autonomous driving have been developed in the LiDAR domain. While these models perform well in the trained source domain, they struggle in unseen domains with a domain gap. In this paper, we propose a representation learning approach for domain generalization in LiDAR semantic segmentation, termed DGLSS++, which is designed to ensure robust performance in both the source domain and unseen domains despite training exclusively on the source domain. Our approach focuses on generalizing from a single source domain, addressing the domain shift caused by variations in LiDAR sensor configurations and scene distributions. To tackle both sparse-to-dense and dense-to-sparse generalization scenarios, we simulate unseen domains by generating sparsely and densely augmented domains. With the augmented domain, we introduce two constraints for generalizable representation learning: generalized masked sparsity invariant feature consistency (GMSIFC) and localized semantic correlation consistency (LSCC). GMSIFC aligns the internal sparse features of the source domain with those of the augmented domain at different sparsity, introducing a novel masking strategy to exclude voxel features associated with multiple inconsistent classes. For LSCC, class prototypes from spatially local regions are constrained to maintain similar correlations across all local regions, regardless of the scene or domain. In addition, we establish standardized training and evaluation protocols utilizing four real-world datasets and implement several baseline methods. Extensive experiments demonstrate our approach outperforms both UDA and DG baselines. The code is available at https://github.com/gzgzys9887/DGLSS.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"100 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145971891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
BlindU: Blind Machine Unlearning without Revealing Erasing Data 盲机:不显示擦除数据的盲机学习
IF 23.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-15 DOI: 10.1109/tpami.2026.3654093
Weiqi Wang, Zhiyi Tian, Chenhan Zhang, Shui Yu
{"title":"BlindU: Blind Machine Unlearning without Revealing Erasing Data","authors":"Weiqi Wang, Zhiyi Tian, Chenhan Zhang, Shui Yu","doi":"10.1109/tpami.2026.3654093","DOIUrl":"https://doi.org/10.1109/tpami.2026.3654093","url":null,"abstract":"","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"38 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145972435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Practical Continual Forgetting for Pre-trained Vision Models 预训练视觉模型的实际持续遗忘
IF 23.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-15 DOI: 10.1109/tpami.2026.3654115
Hongbo Zhao, Fei Zhu, Bolin Ni, Feng Zhu, Gaofeng Meng, Zhaoxiang Zhang
{"title":"Practical Continual Forgetting for Pre-trained Vision Models","authors":"Hongbo Zhao, Fei Zhu, Bolin Ni, Feng Zhu, Gaofeng Meng, Zhaoxiang Zhang","doi":"10.1109/tpami.2026.3654115","DOIUrl":"https://doi.org/10.1109/tpami.2026.3654115","url":null,"abstract":"","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"26 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145972439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DyDiT++: Diffusion Transformers with Timestep and Spatial Dynamics for Efficient Visual Generation. 具有时间步长和空间动态的扩散变压器,用于高效的视觉生成。
IF 23.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-15 DOI: 10.1109/tpami.2026.3654201
Wangbo Zhao,Yizeng Han,Jiasheng Tang,Kai Wang,Hao Luo,Yibing Song,Gao Huang,Fan Wang,Yang You
Diffusion Transformer (DiT), an emerging diffusion model for visual generation, has demonstrated superior perfor mance but suffers from substantial computational costs. Our investigations reveal that these costs primarily stem from the static inference paradigm, which inevitably introduces redundant computation in certain diffusion timesteps and spatial regions. To overcome this inefficiency, we propose Dynamic Diffusion Transformer (DyDiT), an architecture that dynamically adjusts its computation along both timestep and spatial dimensions. Specifically, we introduce a Timestep-wise Dynamic Width (TDW) approach that adapts model width conditioned on the generation timesteps. In addition, we design a Spatial-wise Dynamic Token (SDT) strategy to avoid redundant computation at unnecessary spatial locations. TDW and SDT can be seamlessly integrated into DiT and significantly accelerate the generation process. Building on these designs, we present an extended version, DyDiT++, with improvements in three key aspects. First, it extends the generation mechanism of DyDiT beyond diffusion to flow matching, demon strating that our method can also accelerate flow-matching based generation, enhancing its versatility. Furthermore, we enhance DyDiT to tackle more complex visual generation tasks, including video generation and text-to-image generation, thereby broadening its real-world applications. Finally, to address the high cost of full fine-tuning and democratize technology access, we investigate the feasibility of training DyDiT in a parameter efficient manner and introduce timestep-based dynamic LoRA (TD-LoRA). Extensive experiments on diverse visual generation models, including DiT, SiT, Latte, and FLUX, demonstrate the effectiveness of DyDiT++. Remarkably, with <3% additional f ine-tuning iterations, our approach reduces the FLOPs of DiT XL by 51%, yielding 1.73× realistic speedup on hardware, and achieves a competitive FID score of 2.07 on ImageNet. The code is available at https://github.com/alibaba-damo-academy/DyDiT.
扩散变压器(Diffusion Transformer, DiT)是一种新兴的用于视觉生成的扩散模型,具有优异的性能,但其计算成本较高。我们的研究表明,这些成本主要源于静态推理范式,这不可避免地在某些扩散时间步长和空间区域引入了冗余计算。为了克服这种低效率,我们提出了动态扩散变压器(DyDiT),一种沿着时间步长和空间维度动态调整其计算的架构。具体来说,我们引入了一种时间步长动态宽度(TDW)方法,该方法根据生成的时间步长来适应模型宽度。此外,我们设计了一种空间智能动态令牌(SDT)策略,以避免在不必要的空间位置进行冗余计算。TDW和SDT可以无缝集成到DiT中,大大加快了生成过程。在这些设计的基础上,我们提出了一个扩展版本dyd++,并在三个关键方面进行了改进。首先,将DyDiT的生成机制从扩散扩展到流匹配,表明该方法还可以加速基于流匹配的生成,增强了其通用性。此外,我们增强了DyDiT来处理更复杂的视觉生成任务,包括视频生成和文本到图像的生成,从而扩展了它在现实世界中的应用。最后,为了解决全微调的高成本和技术获取的民主化,我们研究了以参数有效的方式训练DyDiT的可行性,并引入了基于时间步长的动态LoRA (TD-LoRA)。在DiT、SiT、Latte和FLUX等不同的视觉生成模型上进行的大量实验证明了dydit++的有效性。值得注意的是,通过<3%的额外微调迭代,我们的方法将DiT XL的FLOPs减少了51%,在硬件上产生了1.73倍的实际加速,并在ImageNet上实现了具有竞争力的FID得分2.07。代码可在https://github.com/alibaba-damo-academy/DyDiT上获得。
{"title":"DyDiT++: Diffusion Transformers with Timestep and Spatial Dynamics for Efficient Visual Generation.","authors":"Wangbo Zhao,Yizeng Han,Jiasheng Tang,Kai Wang,Hao Luo,Yibing Song,Gao Huang,Fan Wang,Yang You","doi":"10.1109/tpami.2026.3654201","DOIUrl":"https://doi.org/10.1109/tpami.2026.3654201","url":null,"abstract":"Diffusion Transformer (DiT), an emerging diffusion model for visual generation, has demonstrated superior perfor mance but suffers from substantial computational costs. Our investigations reveal that these costs primarily stem from the static inference paradigm, which inevitably introduces redundant computation in certain diffusion timesteps and spatial regions. To overcome this inefficiency, we propose Dynamic Diffusion Transformer (DyDiT), an architecture that dynamically adjusts its computation along both timestep and spatial dimensions. Specifically, we introduce a Timestep-wise Dynamic Width (TDW) approach that adapts model width conditioned on the generation timesteps. In addition, we design a Spatial-wise Dynamic Token (SDT) strategy to avoid redundant computation at unnecessary spatial locations. TDW and SDT can be seamlessly integrated into DiT and significantly accelerate the generation process. Building on these designs, we present an extended version, DyDiT++, with improvements in three key aspects. First, it extends the generation mechanism of DyDiT beyond diffusion to flow matching, demon strating that our method can also accelerate flow-matching based generation, enhancing its versatility. Furthermore, we enhance DyDiT to tackle more complex visual generation tasks, including video generation and text-to-image generation, thereby broadening its real-world applications. Finally, to address the high cost of full fine-tuning and democratize technology access, we investigate the feasibility of training DyDiT in a parameter efficient manner and introduce timestep-based dynamic LoRA (TD-LoRA). Extensive experiments on diverse visual generation models, including DiT, SiT, Latte, and FLUX, demonstrate the effectiveness of DyDiT++. Remarkably, with <3% additional f ine-tuning iterations, our approach reduces the FLOPs of DiT XL by 51%, yielding 1.73× realistic speedup on hardware, and achieves a competitive FID score of 2.07 on ImageNet. The code is available at https://github.com/alibaba-damo-academy/DyDiT.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"47 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145971970","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Self-Supervised AI-Generated Image Detection: A Camera Metadata Perspective 自监督人工智能生成的图像检测:相机元数据视角
IF 23.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-14 DOI: 10.1109/tpami.2026.3654274
Nan Zhong, Mian Zou, Yiran Xu, Zhenxing Qian, Xinpeng Zhang, Baoyuan Wu, Kede Ma
{"title":"Self-Supervised AI-Generated Image Detection: A Camera Metadata Perspective","authors":"Nan Zhong, Mian Zou, Yiran Xu, Zhenxing Qian, Xinpeng Zhang, Baoyuan Wu, Kede Ma","doi":"10.1109/tpami.2026.3654274","DOIUrl":"https://doi.org/10.1109/tpami.2026.3654274","url":null,"abstract":"","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"34 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145971972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Reinforced Refinement with Self-Aware Expansion for End-to-End Autonomous Driving 基于自我意识扩展的端到端自动驾驶强化细化
IF 23.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-14 DOI: 10.1109/tpami.2026.3653866
Haochen Liu, Tianyu Li, Haohan Yang, Li Chen, Caojun Wang, Ke Guo, Haochen Tian, Hongchen Li, Hongyang Li, Chen Lv
{"title":"Reinforced Refinement with Self-Aware Expansion for End-to-End Autonomous Driving","authors":"Haochen Liu, Tianyu Li, Haohan Yang, Li Chen, Caojun Wang, Ke Guo, Haochen Tian, Hongchen Li, Hongyang Li, Chen Lv","doi":"10.1109/tpami.2026.3653866","DOIUrl":"https://doi.org/10.1109/tpami.2026.3653866","url":null,"abstract":"","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"30 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145972413","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Pattern Analysis and Machine Intelligence
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1