首页 > 最新文献

IEEE Transactions on Pattern Analysis and Machine Intelligence最新文献

英文 中文
DyDiT++: Diffusion Transformers with Timestep and Spatial Dynamics for Efficient Visual Generation. 具有时间步长和空间动态的扩散变压器,用于高效的视觉生成。
IF 23.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-15 DOI: 10.1109/tpami.2026.3654201
Wangbo Zhao,Yizeng Han,Jiasheng Tang,Kai Wang,Hao Luo,Yibing Song,Gao Huang,Fan Wang,Yang You
Diffusion Transformer (DiT), an emerging diffusion model for visual generation, has demonstrated superior perfor mance but suffers from substantial computational costs. Our investigations reveal that these costs primarily stem from the static inference paradigm, which inevitably introduces redundant computation in certain diffusion timesteps and spatial regions. To overcome this inefficiency, we propose Dynamic Diffusion Transformer (DyDiT), an architecture that dynamically adjusts its computation along both timestep and spatial dimensions. Specifically, we introduce a Timestep-wise Dynamic Width (TDW) approach that adapts model width conditioned on the generation timesteps. In addition, we design a Spatial-wise Dynamic Token (SDT) strategy to avoid redundant computation at unnecessary spatial locations. TDW and SDT can be seamlessly integrated into DiT and significantly accelerate the generation process. Building on these designs, we present an extended version, DyDiT++, with improvements in three key aspects. First, it extends the generation mechanism of DyDiT beyond diffusion to flow matching, demon strating that our method can also accelerate flow-matching based generation, enhancing its versatility. Furthermore, we enhance DyDiT to tackle more complex visual generation tasks, including video generation and text-to-image generation, thereby broadening its real-world applications. Finally, to address the high cost of full fine-tuning and democratize technology access, we investigate the feasibility of training DyDiT in a parameter efficient manner and introduce timestep-based dynamic LoRA (TD-LoRA). Extensive experiments on diverse visual generation models, including DiT, SiT, Latte, and FLUX, demonstrate the effectiveness of DyDiT++. Remarkably, with <3% additional f ine-tuning iterations, our approach reduces the FLOPs of DiT XL by 51%, yielding 1.73× realistic speedup on hardware, and achieves a competitive FID score of 2.07 on ImageNet. The code is available at https://github.com/alibaba-damo-academy/DyDiT.
扩散变压器(Diffusion Transformer, DiT)是一种新兴的用于视觉生成的扩散模型,具有优异的性能,但其计算成本较高。我们的研究表明,这些成本主要源于静态推理范式,这不可避免地在某些扩散时间步长和空间区域引入了冗余计算。为了克服这种低效率,我们提出了动态扩散变压器(DyDiT),一种沿着时间步长和空间维度动态调整其计算的架构。具体来说,我们引入了一种时间步长动态宽度(TDW)方法,该方法根据生成的时间步长来适应模型宽度。此外,我们设计了一种空间智能动态令牌(SDT)策略,以避免在不必要的空间位置进行冗余计算。TDW和SDT可以无缝集成到DiT中,大大加快了生成过程。在这些设计的基础上,我们提出了一个扩展版本dyd++,并在三个关键方面进行了改进。首先,将DyDiT的生成机制从扩散扩展到流匹配,表明该方法还可以加速基于流匹配的生成,增强了其通用性。此外,我们增强了DyDiT来处理更复杂的视觉生成任务,包括视频生成和文本到图像的生成,从而扩展了它在现实世界中的应用。最后,为了解决全微调的高成本和技术获取的民主化,我们研究了以参数有效的方式训练DyDiT的可行性,并引入了基于时间步长的动态LoRA (TD-LoRA)。在DiT、SiT、Latte和FLUX等不同的视觉生成模型上进行的大量实验证明了dydit++的有效性。值得注意的是,通过<3%的额外微调迭代,我们的方法将DiT XL的FLOPs减少了51%,在硬件上产生了1.73倍的实际加速,并在ImageNet上实现了具有竞争力的FID得分2.07。代码可在https://github.com/alibaba-damo-academy/DyDiT上获得。
{"title":"DyDiT++: Diffusion Transformers with Timestep and Spatial Dynamics for Efficient Visual Generation.","authors":"Wangbo Zhao,Yizeng Han,Jiasheng Tang,Kai Wang,Hao Luo,Yibing Song,Gao Huang,Fan Wang,Yang You","doi":"10.1109/tpami.2026.3654201","DOIUrl":"https://doi.org/10.1109/tpami.2026.3654201","url":null,"abstract":"Diffusion Transformer (DiT), an emerging diffusion model for visual generation, has demonstrated superior perfor mance but suffers from substantial computational costs. Our investigations reveal that these costs primarily stem from the static inference paradigm, which inevitably introduces redundant computation in certain diffusion timesteps and spatial regions. To overcome this inefficiency, we propose Dynamic Diffusion Transformer (DyDiT), an architecture that dynamically adjusts its computation along both timestep and spatial dimensions. Specifically, we introduce a Timestep-wise Dynamic Width (TDW) approach that adapts model width conditioned on the generation timesteps. In addition, we design a Spatial-wise Dynamic Token (SDT) strategy to avoid redundant computation at unnecessary spatial locations. TDW and SDT can be seamlessly integrated into DiT and significantly accelerate the generation process. Building on these designs, we present an extended version, DyDiT++, with improvements in three key aspects. First, it extends the generation mechanism of DyDiT beyond diffusion to flow matching, demon strating that our method can also accelerate flow-matching based generation, enhancing its versatility. Furthermore, we enhance DyDiT to tackle more complex visual generation tasks, including video generation and text-to-image generation, thereby broadening its real-world applications. Finally, to address the high cost of full fine-tuning and democratize technology access, we investigate the feasibility of training DyDiT in a parameter efficient manner and introduce timestep-based dynamic LoRA (TD-LoRA). Extensive experiments on diverse visual generation models, including DiT, SiT, Latte, and FLUX, demonstrate the effectiveness of DyDiT++. Remarkably, with <3% additional f ine-tuning iterations, our approach reduces the FLOPs of DiT XL by 51%, yielding 1.73× realistic speedup on hardware, and achieves a competitive FID score of 2.07 on ImageNet. The code is available at https://github.com/alibaba-damo-academy/DyDiT.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"47 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145971970","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Self-Supervised AI-Generated Image Detection: A Camera Metadata Perspective 自监督人工智能生成的图像检测:相机元数据视角
IF 23.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-14 DOI: 10.1109/tpami.2026.3654274
Nan Zhong, Mian Zou, Yiran Xu, Zhenxing Qian, Xinpeng Zhang, Baoyuan Wu, Kede Ma
{"title":"Self-Supervised AI-Generated Image Detection: A Camera Metadata Perspective","authors":"Nan Zhong, Mian Zou, Yiran Xu, Zhenxing Qian, Xinpeng Zhang, Baoyuan Wu, Kede Ma","doi":"10.1109/tpami.2026.3654274","DOIUrl":"https://doi.org/10.1109/tpami.2026.3654274","url":null,"abstract":"","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"34 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145971972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Reinforced Refinement with Self-Aware Expansion for End-to-End Autonomous Driving 基于自我意识扩展的端到端自动驾驶强化细化
IF 23.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-14 DOI: 10.1109/tpami.2026.3653866
Haochen Liu, Tianyu Li, Haohan Yang, Li Chen, Caojun Wang, Ke Guo, Haochen Tian, Hongchen Li, Hongyang Li, Chen Lv
{"title":"Reinforced Refinement with Self-Aware Expansion for End-to-End Autonomous Driving","authors":"Haochen Liu, Tianyu Li, Haohan Yang, Li Chen, Caojun Wang, Ke Guo, Haochen Tian, Hongchen Li, Hongyang Li, Chen Lv","doi":"10.1109/tpami.2026.3653866","DOIUrl":"https://doi.org/10.1109/tpami.2026.3653866","url":null,"abstract":"","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"30 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145972413","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Gift from the Integration of Discriminative and Diffusion-based Generative Learning: Boundary Refinement Remote Sensing Semantic Segmentation 基于判别和扩散的生成学习的融合:边界细化遥感语义分割
IF 23.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-14 DOI: 10.1109/tpami.2026.3654243
Hao Wang, Keyan Hu, Xin Guo, Haifeng Li, Chao Tao
{"title":"A Gift from the Integration of Discriminative and Diffusion-based Generative Learning: Boundary Refinement Remote Sensing Semantic Segmentation","authors":"Hao Wang, Keyan Hu, Xin Guo, Haifeng Li, Chao Tao","doi":"10.1109/tpami.2026.3654243","DOIUrl":"https://doi.org/10.1109/tpami.2026.3654243","url":null,"abstract":"","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"5 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145972414","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Consistency-Aware Spot-Guided Transformer for Accurate and Versatile Point Cloud Registration. 一致性感知点导向变压器精确和通用点云注册。
IF 23.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-14 DOI: 10.1109/tpami.2026.3653989
Renlang Huang,Li Chai,Yufan Tang,Zhoujian Li,Jiming Chen,Liang Li
Deep learning-based feature matching has showcased great superiority for point cloud registration. While coarse-to-fine matching architectures are prevalent, they typically perform sparse and geometrically inconsistent coarse matching. This forces the subsequent fine matching to rely on computationally expensive optimal transport and hypothesis-and-selection procedures to resolve inconsistencies, leading to inefficiency and poor scalability for large-scale real-time applications. In this paper, we design a consistency-aware spot-guided Transformer (CAST) to enhance the coarse matching by explicitly utilizing geometric consistency via two key sparse attention mechanisms. First, our consistency-aware self-attention selectively computes intra-point-cloud attention to a sparse subset of points with globally consistent correspondences, enabling other points to derive discriminative features through their relationships with these anchors while propagating global consistency for robust correspondence reasoning. Second, our spot-guided cross-attention restricts cross-point-cloud attention to dynamically defined "spots"-the union of correspondence neighborhoods of a query's neighbors in the other point cloud, which are most likely to cover the true correspondence of the query ensured by local consistency, eliminating interference from similar but irrelevant regions. Furthermore, we design a lightweight local attention-based fine matching module to precisely predict dense correspondences and estimate the transformation. Extensive experiments on both outdoor LiDAR datasets and indoor RGB-D camera datasets demonstrate that our method achieves state-of-the-art accuracy, efficiency, and robustness. Besides, our method showcases superior generalization ability on our newly constructed challenging relocalization and loop closing benchmarks in unseen domains. Our code and models are available at https://github.com/RenlangHuang/CASTv2.
基于深度学习的特征匹配在点云配准方面显示出很大的优势。虽然粗到细的匹配架构很普遍,但它们通常执行稀疏且几何不一致的粗匹配。这迫使后续的精细匹配依赖于计算上昂贵的最优传输和假设-选择过程来解决不一致性,导致大规模实时应用的效率低下和可扩展性差。在本文中,我们设计了一个一致性感知的点引导变压器(CAST),通过两个关键的稀疏注意机制显式地利用几何一致性来增强粗匹配。首先,我们的一致性感知自我注意选择性地计算点云内对具有全局一致对应的点的稀疏子集的注意,使其他点能够通过它们与这些锚的关系获得判别特征,同时传播全局一致性以进行稳健的对应推理。其次,我们的点引导交叉注意将交叉点云的注意限制在动态定义的“点”上——一个查询的邻居在另一个点云中的对应邻域的联合,这些邻域最有可能覆盖由局部一致性保证的查询的真实对应,消除了来自相似但不相关区域的干扰。此外,我们设计了一个轻量级的基于局部关注的精细匹配模块,以精确预测密集对应并估计转换。在室外激光雷达数据集和室内RGB-D相机数据集上进行的大量实验表明,我们的方法达到了最先进的精度、效率和鲁棒性。此外,我们的方法在我们新构建的具有挑战性的重新定位和闭环基准上展示了卓越的泛化能力。我们的代码和模型可在https://github.com/RenlangHuang/CASTv2上获得。
{"title":"Consistency-Aware Spot-Guided Transformer for Accurate and Versatile Point Cloud Registration.","authors":"Renlang Huang,Li Chai,Yufan Tang,Zhoujian Li,Jiming Chen,Liang Li","doi":"10.1109/tpami.2026.3653989","DOIUrl":"https://doi.org/10.1109/tpami.2026.3653989","url":null,"abstract":"Deep learning-based feature matching has showcased great superiority for point cloud registration. While coarse-to-fine matching architectures are prevalent, they typically perform sparse and geometrically inconsistent coarse matching. This forces the subsequent fine matching to rely on computationally expensive optimal transport and hypothesis-and-selection procedures to resolve inconsistencies, leading to inefficiency and poor scalability for large-scale real-time applications. In this paper, we design a consistency-aware spot-guided Transformer (CAST) to enhance the coarse matching by explicitly utilizing geometric consistency via two key sparse attention mechanisms. First, our consistency-aware self-attention selectively computes intra-point-cloud attention to a sparse subset of points with globally consistent correspondences, enabling other points to derive discriminative features through their relationships with these anchors while propagating global consistency for robust correspondence reasoning. Second, our spot-guided cross-attention restricts cross-point-cloud attention to dynamically defined \"spots\"-the union of correspondence neighborhoods of a query's neighbors in the other point cloud, which are most likely to cover the true correspondence of the query ensured by local consistency, eliminating interference from similar but irrelevant regions. Furthermore, we design a lightweight local attention-based fine matching module to precisely predict dense correspondences and estimate the transformation. Extensive experiments on both outdoor LiDAR datasets and indoor RGB-D camera datasets demonstrate that our method achieves state-of-the-art accuracy, efficiency, and robustness. Besides, our method showcases superior generalization ability on our newly constructed challenging relocalization and loop closing benchmarks in unseen domains. Our code and models are available at https://github.com/RenlangHuang/CASTv2.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"50 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145971896","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SLeak: Multi-Target Privacy Stealing Attack against Split Learning. SLeak:针对分裂学习的多目标隐私窃取攻击。
IF 23.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-14 DOI: 10.1109/tpami.2026.3654092
Xiaoyang Xu,Wenzhe Yi,Juan Wang,Hongxin Hu,Mengda Yang,Ziang Li,Yong Zhuang,Yaxin Liu,Mang Ye
Split Learning (SL) is a distributed learning framework that has gained popularity for its privacy-preserving nature and low computational demands. However, recent studies have the potential that a server adversary to carry out inference attacks, compromising the privacy of victim clients. Nevertheless, upon re-evaluating prior studies, we found that existing methods rely on overly strong assumptions to enhance their performance, resulting in a significant decline in effectiveness under more realistic scenarios. In this work, we provide new insights into the inherent vulnerabilities of SL. Specifically, we discover that both the smashed data and the server model contain the client's representation preference, which the server adversary can exploit to build a substitute client that approximates the target client's unique feature extraction behavior. With a well-trained substitute client, the server can perfectly steal the target client's functionality, training data, and labels. Building on this observation, we introduce Split Leakage (SLeak), a new threat that targets multiple privacy stealing objectives against SL. Notably, SLeak does not depend on strong privacy priors and only requires partial same-domain auxiliary public data to conduct the attacks. Experimental results on diverse datasets and target models show that SLeak surpasses the state-of-the-art method across multiple metrics. Moreover, ablation studies further confirm its robustness and applicability under various scenarios and assumptions.
Split Learning (SL)是一种分布式学习框架,因其保护隐私的特性和较低的计算需求而受到欢迎。然而,最近的研究表明,服务器攻击者有可能进行推理攻击,从而损害受害者客户端的隐私。然而,在重新评估之前的研究后,我们发现现有的方法依赖于过于强大的假设来提高其性能,导致在更现实的场景下有效性显著下降。在这项工作中,我们对SL的固有漏洞提供了新的见解。具体来说,我们发现被破坏的数据和服务器模型都包含客户端的表示偏好,服务器攻击者可以利用这些偏好来构建一个替代客户端,该客户端近似于目标客户端的独特特征提取行为。使用训练有素的替代客户机,服务器可以完美地窃取目标客户机的功能、训练数据和标签。在此观察的基础上,我们引入了分裂泄漏(SLeak),这是一种针对SL的多个隐私窃取目标的新威胁。值得注意的是,SLeak不依赖于强隐私先验,只需要部分相同域的辅助公共数据来进行攻击。在不同数据集和目标模型上的实验结果表明,SLeak在多个指标上都优于最先进的方法。此外,烧蚀研究进一步证实了该方法在各种情景和假设下的稳健性和适用性。
{"title":"SLeak: Multi-Target Privacy Stealing Attack against Split Learning.","authors":"Xiaoyang Xu,Wenzhe Yi,Juan Wang,Hongxin Hu,Mengda Yang,Ziang Li,Yong Zhuang,Yaxin Liu,Mang Ye","doi":"10.1109/tpami.2026.3654092","DOIUrl":"https://doi.org/10.1109/tpami.2026.3654092","url":null,"abstract":"Split Learning (SL) is a distributed learning framework that has gained popularity for its privacy-preserving nature and low computational demands. However, recent studies have the potential that a server adversary to carry out inference attacks, compromising the privacy of victim clients. Nevertheless, upon re-evaluating prior studies, we found that existing methods rely on overly strong assumptions to enhance their performance, resulting in a significant decline in effectiveness under more realistic scenarios. In this work, we provide new insights into the inherent vulnerabilities of SL. Specifically, we discover that both the smashed data and the server model contain the client's representation preference, which the server adversary can exploit to build a substitute client that approximates the target client's unique feature extraction behavior. With a well-trained substitute client, the server can perfectly steal the target client's functionality, training data, and labels. Building on this observation, we introduce Split Leakage (SLeak), a new threat that targets multiple privacy stealing objectives against SL. Notably, SLeak does not depend on strong privacy priors and only requires partial same-domain auxiliary public data to conduct the attacks. Experimental results on diverse datasets and target models show that SLeak surpasses the state-of-the-art method across multiple metrics. Moreover, ablation studies further confirm its robustness and applicability under various scenarios and assumptions.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"20 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145971893","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
VRP-UDF: Towards Unbiased Learning of Unsigned Distance Functions from Multi-view Images with Volume Rendering Priors. VRP-UDF:基于体绘制先验的多视图图像无符号距离函数的无偏学习。
IF 23.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-14 DOI: 10.1109/tpami.2026.3653901
Wenyuan Zhang,Chunsheng Wang,Kanle Shi,Yu-Shen Liu,Zhizhong Han
Unsigned distance functions (UDFs) have been a vital representation for open surfaces. With different differentiable renderers, current methods are able to train neural networks to infer a UDF by minimizing the rendering errors with the UDF to the multi-view ground truth. However, these differentiable renderers are mainly handcrafted, which makes them either biased on ray-surface intersections, or sensitive to unsigned distance outliers, or not scalable to large scenes. To resolve these issues, we present a novel differentiable renderer to infer UDFs more accurately. Instead of using handcrafted equations, our differentiable renderer is a neural network which is pre-trained in a data-driven manner. It learns how to render unsigned distances into depth images, leading to a prior knowledge, dubbed volume rendering priors. To infer a UDF for an unseen scene from multiple RGB images, we generalize the learned volume rendering priors to map inferred unsigned distances in alpha blending for RGB image rendering. To reduce the bias of sampling in UDF inference, we utilize an auxiliary point sampling prior as an indicator of ray-surface intersection, and propose novel schemes towards more accurate and uniform sampling near the zero-level sets. We also propose a new strategy that leverages our pretrained volume rendering prior to serve as a general surface refiner, which can be integrated with various Gaussian reconstruction methods to optimize the Gaussian distributions and refine geometric details. Our results show that the learned volume rendering prior is unbiased, robust, scalable, 3D aware, and more importantly, easy to learn. Further experiments show that the volume rendering prior is also a general strategy to enhance other neural implicit representations such as signed distance function and occupancy. We evaluate our method on both widely used benchmarks and real scenes, and report superior performance over the state-of-the-art methods.
无符号距离函数(udf)一直是开放曲面的重要表示形式。使用不同的可微分渲染器,目前的方法能够训练神经网络通过将UDF的渲染错误最小化到多视图的真实情况来推断UDF。然而,这些可微分渲染器主要是手工制作的,这使得它们在光线表面交叉点上有偏差,或者对无符号距离异常值敏感,或者不能扩展到大型场景。为了解决这些问题,我们提出了一种新的可微分呈现器来更准确地推断udf。我们的可微渲染器不是使用手工制作的方程,而是一个以数据驱动的方式预训练的神经网络。它学习如何将无符号距离渲染成深度图像,从而获得先验知识,称为体渲染先验。为了从多个RGB图像中推断出未见场景的UDF,我们将学习到的体渲染先验推广到映射RGB图像渲染中alpha混合中推断的无符号距离。为了减少UDF推理中采样的偏差,我们利用辅助点采样先验作为射线-表面相交的指示,并提出了在零水平集附近更精确和均匀采样的新方案。我们还提出了一种新的策略,利用我们预先训练的体绘制作为一般的表面细化器,它可以与各种高斯重建方法集成,以优化高斯分布和细化几何细节。我们的研究结果表明,学习到的体绘制先验是无偏的、鲁棒的、可扩展的、3D感知的,更重要的是,易于学习。进一步的实验表明,体积绘制先验也是增强其他神经隐式表征(如符号距离函数和占用)的一般策略。我们在广泛使用的基准测试和真实场景中评估了我们的方法,并报告了优于最先进方法的性能。
{"title":"VRP-UDF: Towards Unbiased Learning of Unsigned Distance Functions from Multi-view Images with Volume Rendering Priors.","authors":"Wenyuan Zhang,Chunsheng Wang,Kanle Shi,Yu-Shen Liu,Zhizhong Han","doi":"10.1109/tpami.2026.3653901","DOIUrl":"https://doi.org/10.1109/tpami.2026.3653901","url":null,"abstract":"Unsigned distance functions (UDFs) have been a vital representation for open surfaces. With different differentiable renderers, current methods are able to train neural networks to infer a UDF by minimizing the rendering errors with the UDF to the multi-view ground truth. However, these differentiable renderers are mainly handcrafted, which makes them either biased on ray-surface intersections, or sensitive to unsigned distance outliers, or not scalable to large scenes. To resolve these issues, we present a novel differentiable renderer to infer UDFs more accurately. Instead of using handcrafted equations, our differentiable renderer is a neural network which is pre-trained in a data-driven manner. It learns how to render unsigned distances into depth images, leading to a prior knowledge, dubbed volume rendering priors. To infer a UDF for an unseen scene from multiple RGB images, we generalize the learned volume rendering priors to map inferred unsigned distances in alpha blending for RGB image rendering. To reduce the bias of sampling in UDF inference, we utilize an auxiliary point sampling prior as an indicator of ray-surface intersection, and propose novel schemes towards more accurate and uniform sampling near the zero-level sets. We also propose a new strategy that leverages our pretrained volume rendering prior to serve as a general surface refiner, which can be integrated with various Gaussian reconstruction methods to optimize the Gaussian distributions and refine geometric details. Our results show that the learned volume rendering prior is unbiased, robust, scalable, 3D aware, and more importantly, easy to learn. Further experiments show that the volume rendering prior is also a general strategy to enhance other neural implicit representations such as signed distance function and occupancy. We evaluate our method on both widely used benchmarks and real scenes, and report superior performance over the state-of-the-art methods.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"60 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145971894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Deep Orientational Representation Learning for Ordinal Regression. 有序回归的深度定向表示学习。
IF 23.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-14 DOI: 10.1109/tpami.2026.3654260
Gengyun Jia,Xin Ma,Bing-Kun Bao
Ordinal regression aims to predict ordered classes. Existing methods mainly focus on label distribution shapes and feature distance relationships, while the directional characteristics in the representation space remain underexplored. In this paper, we propose deep orientational representation learning (ORL), aiming to ensure the trajectory of features sequentially connected by ordinal categories approximates a geodesic. We treat the output layer weights as ordinal prototypes and introduce two constraints, the co-directional constraint and the counter-directional constraint. They operate by constraining the angles between pairs of vectors. The former minimizes the angle between vectors with matching start and end categories, while the latter maximizes the angle between vectors whose start categories are the same but whose end categories are on opposite sides. The two constraints optimize the representation from different ordinal directions. ORL is extended to a multi-prototype setting (MORL) to mitigate misalignment between features and oriented prototypes caused by large intra-class variations. Theoretical analysis links ORL to distribution unimodality and distance orderliness, highlighting its advantages. The effectiveness of ORL (MORL) is demonstrated on various tasks including facial age estimation, historical image dating, and aesthetic quality assessment.
序数回归旨在预测有序类。现有的方法主要关注标签分布形状和特征距离关系,而对表示空间中的方向性特征的探索还不够。在本文中,我们提出了深度定向表示学习(ORL),旨在确保由有序类别顺序连接的特征轨迹近似于测地线。我们将输出层权值视为有序原型,并引入了两种约束,共向约束和反向约束。它们的作用是限制向量对之间的夹角。前者最小化起始和结束类别匹配的向量之间的夹角,后者最大化起始类别相同但结束类别相对的向量之间的夹角。这两种约束从不同的顺序方向对表示进行优化。将ORL扩展为多原型设置(MORL),以减轻由于类内变化过大而导致的特征和定向原型之间的不一致。理论分析将ORL与分布单一性和距离有序联系起来,突出了其优势。ORL (MORL)在面部年龄估计、历史图像年代测定和审美质量评价等任务上的有效性得到了证明。
{"title":"Deep Orientational Representation Learning for Ordinal Regression.","authors":"Gengyun Jia,Xin Ma,Bing-Kun Bao","doi":"10.1109/tpami.2026.3654260","DOIUrl":"https://doi.org/10.1109/tpami.2026.3654260","url":null,"abstract":"Ordinal regression aims to predict ordered classes. Existing methods mainly focus on label distribution shapes and feature distance relationships, while the directional characteristics in the representation space remain underexplored. In this paper, we propose deep orientational representation learning (ORL), aiming to ensure the trajectory of features sequentially connected by ordinal categories approximates a geodesic. We treat the output layer weights as ordinal prototypes and introduce two constraints, the co-directional constraint and the counter-directional constraint. They operate by constraining the angles between pairs of vectors. The former minimizes the angle between vectors with matching start and end categories, while the latter maximizes the angle between vectors whose start categories are the same but whose end categories are on opposite sides. The two constraints optimize the representation from different ordinal directions. ORL is extended to a multi-prototype setting (MORL) to mitigate misalignment between features and oriented prototypes caused by large intra-class variations. Theoretical analysis links ORL to distribution unimodality and distance orderliness, highlighting its advantages. The effectiveness of ORL (MORL) is demonstrated on various tasks including facial age estimation, historical image dating, and aesthetic quality assessment.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"56 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145971744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AtomThink: Multimodal Slow Thinking With Atomic Step Reasoning. AtomThink:多模态慢思维与原子步骤推理。
IF 23.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-13 DOI: 10.1109/tpami.2026.3653573
Kun Xiang,Zhili Liu,Terry Jingchen Zhang,Yinya Huang,Yunshuang Nie,Kaixin Cai,Yiyang Yin,Runhui Huang,Hanhui Li,Yihan Zeng,Yu-Jie Yuan,Jianhua Han,Lanqing Hong,Hang Xu,Xiaodan Liang
In this paper, we address the challenging task of multimodal reasoning by incorporating the notion of "slow thinking" into multimodal large language models (MLLMs). Our core idea is that models can learn to adaptively use different levels of reasoning to tackle questions of varying complexity. We propose a novel paradigm of Self-structured Chain of Thought (SCoT), which consists of minimal semantic atomic steps. Unlike existing methods that rely on structured templates or free-form paradigms, our method not only generates flexible CoT structures for various complex tasks but also mitigates the phenomenon of overthinking for easier tasks. To introduce structured reasoning into visual cognition, we design a novel AtomThink framework with four key modules: (i) a data engine to generate high-quality multimodal reasoning paths; (ii) a supervised fine-tuning (SFT) process with serialized inference data; (iii) a policy-guided multi-turn inference method; and (iv) an atomic capability metric to evaluate the single-step utilization rate. Extensive experiments demonstrate that the proposed AtomThink significantly improves the performance of baseline MLLMs, achieving more than 10% average accuracy gains on MathVista and MathVerse. Compared to state-of-the-art structured CoT approaches, our method not only achieves higher accuracy but also improves data utilization by 5 × and boosts inference efficiency by 85.3%. Our code is publicly available at https://github.com/Kun-Xiang/AtomThink.
在本文中,我们通过将“慢思维”的概念纳入多模态大语言模型(mllm)来解决多模态推理的挑战性任务。我们的核心思想是,模型可以学会自适应地使用不同层次的推理来解决不同复杂性的问题。我们提出了一种新的自结构思维链(SCoT)范式,它由最小的语义原子步骤组成。与依赖于结构化模板或自由形式范例的现有方法不同,我们的方法不仅为各种复杂任务生成灵活的CoT结构,而且还减轻了对更简单任务的过度思考现象。为了将结构化推理引入视觉认知,我们设计了一个新的AtomThink框架,其中包含四个关键模块:(i)生成高质量多模态推理路径的数据引擎;(ii)具有序列化推理数据的监督微调(SFT)过程;(iii)政策导向的多回合推理方法;(iv)评估单步利用率的原子能力度量。大量实验表明,所提出的AtomThink显著提高了基线mllm的性能,在MathVista和MathVerse上实现了超过10%的平均精度提升。与目前最先进的结构化CoT方法相比,我们的方法不仅实现了更高的准确性,而且将数据利用率提高了5倍,将推理效率提高了85.3%。我们的代码可以在https://github.com/Kun-Xiang/AtomThink上公开获得。
{"title":"AtomThink: Multimodal Slow Thinking With Atomic Step Reasoning.","authors":"Kun Xiang,Zhili Liu,Terry Jingchen Zhang,Yinya Huang,Yunshuang Nie,Kaixin Cai,Yiyang Yin,Runhui Huang,Hanhui Li,Yihan Zeng,Yu-Jie Yuan,Jianhua Han,Lanqing Hong,Hang Xu,Xiaodan Liang","doi":"10.1109/tpami.2026.3653573","DOIUrl":"https://doi.org/10.1109/tpami.2026.3653573","url":null,"abstract":"In this paper, we address the challenging task of multimodal reasoning by incorporating the notion of \"slow thinking\" into multimodal large language models (MLLMs). Our core idea is that models can learn to adaptively use different levels of reasoning to tackle questions of varying complexity. We propose a novel paradigm of Self-structured Chain of Thought (SCoT), which consists of minimal semantic atomic steps. Unlike existing methods that rely on structured templates or free-form paradigms, our method not only generates flexible CoT structures for various complex tasks but also mitigates the phenomenon of overthinking for easier tasks. To introduce structured reasoning into visual cognition, we design a novel AtomThink framework with four key modules: (i) a data engine to generate high-quality multimodal reasoning paths; (ii) a supervised fine-tuning (SFT) process with serialized inference data; (iii) a policy-guided multi-turn inference method; and (iv) an atomic capability metric to evaluate the single-step utilization rate. Extensive experiments demonstrate that the proposed AtomThink significantly improves the performance of baseline MLLMs, achieving more than 10% average accuracy gains on MathVista and MathVerse. Compared to state-of-the-art structured CoT approaches, our method not only achieves higher accuracy but also improves data utilization by 5 × and boosts inference efficiency by 85.3%. Our code is publicly available at https://github.com/Kun-Xiang/AtomThink.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"259 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145961427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Efficient Multi-Estimation-Based Parameter Centroid Decision Via Linear Regression Approach. 基于多元估计的参数质心线性回归决策。
IF 23.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-13 DOI: 10.1109/tpami.2026.3653765
Yeongyu Choi,Fabien Moutarde,Ju H Park,Ho-Youl Jung
We propose a novel post-processing approach for the local optimization of Locally Optimized RANdom SAmple Consensus (LO-RANSAC), called the Multi-Estimation-based Parameter Centroid (MEPC) decision. It is observed that the optimal thresholds for hypothesis generation and evaluation differ in local optimization with the inner RANSAC. Instead of binary labeling for inliers and outliers, a new ternary labeling for inliers, midliers, and outliers is introduced, using two thresholds. Our experimental results show that the highest-scoring model measured by the ternary method is closer to the real model than that measured by the existing binary method. However, it should be noted that the highest score still does not correspond to the best model due to inaccurate evaluation by data noise. We introduce a new linear model centroid decision method to compensate for the highest-scoring model distorted by noise. In this process, an efficient method for measuring the similarity between two hypotheses is introduced, and candidates close to the real model are found by comparing their similarity with the highest-scoring model. Our approach determines a representative model of the multiple candidate hypotheses, which is defined as the geometric centroid of hyperplanes. We test on various datasets for homography, fundamental, and essential matrices, demonstrating that applying MEPC to existing RANSAC algorithms achieves more accurate and stable model estimation. Moreover, additional experiments on vanishing point detection show the potential of our approach for various model estimation applications.
我们提出了一种新的局部优化随机样本一致性(LO-RANSAC)的局部优化后处理方法,称为基于多估计的参数质心(MEPC)决策。观察到,在局部优化中,假设生成和评估的最优阈值与内部RANSAC不同。用两个阈值代替对内线和离群值的二值标记,引入了一种新的对内线、中线和离群值的三元标记。我们的实验结果表明,与现有的二元方法相比,三元方法测量的最高评分模型更接近真实模型。但是需要注意的是,由于数据噪声的评价不准确,最高的分数仍然不对应最佳模型。提出了一种新的线性模型质心判定方法来补偿受噪声影响的高分模型。在此过程中,引入了一种有效的度量两个假设之间相似度的方法,通过与得分最高的模型的相似度比较,找到接近真实模型的候选模型。我们的方法确定了多个候选假设的代表性模型,该模型被定义为超平面的几何质心。我们在不同的数据集上测试了单应性、基本矩阵和基本矩阵,证明将MEPC应用于现有的RANSAC算法可以获得更准确和稳定的模型估计。此外,关于消失点检测的附加实验显示了我们的方法在各种模型估计应用中的潜力。
{"title":"An Efficient Multi-Estimation-Based Parameter Centroid Decision Via Linear Regression Approach.","authors":"Yeongyu Choi,Fabien Moutarde,Ju H Park,Ho-Youl Jung","doi":"10.1109/tpami.2026.3653765","DOIUrl":"https://doi.org/10.1109/tpami.2026.3653765","url":null,"abstract":"We propose a novel post-processing approach for the local optimization of Locally Optimized RANdom SAmple Consensus (LO-RANSAC), called the Multi-Estimation-based Parameter Centroid (MEPC) decision. It is observed that the optimal thresholds for hypothesis generation and evaluation differ in local optimization with the inner RANSAC. Instead of binary labeling for inliers and outliers, a new ternary labeling for inliers, midliers, and outliers is introduced, using two thresholds. Our experimental results show that the highest-scoring model measured by the ternary method is closer to the real model than that measured by the existing binary method. However, it should be noted that the highest score still does not correspond to the best model due to inaccurate evaluation by data noise. We introduce a new linear model centroid decision method to compensate for the highest-scoring model distorted by noise. In this process, an efficient method for measuring the similarity between two hypotheses is introduced, and candidates close to the real model are found by comparing their similarity with the highest-scoring model. Our approach determines a representative model of the multiple candidate hypotheses, which is defined as the geometric centroid of hyperplanes. We test on various datasets for homography, fundamental, and essential matrices, demonstrating that applying MEPC to existing RANSAC algorithms achieves more accurate and stable model estimation. Moreover, additional experiments on vanishing point detection show the potential of our approach for various model estimation applications.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"52 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145961429","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Pattern Analysis and Machine Intelligence
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1