Pub Date : 2026-01-19DOI: 10.1109/tpami.2026.3653620
Zhendong Mao, Mengqi Huang, Yijing Lin, Quan Wang, Lei Zhang, Yongdong Zhang
{"title":"Toward Accurate Image Generation via Dynamic Generative Image Transformer","authors":"Zhendong Mao, Mengqi Huang, Yijing Lin, Quan Wang, Lei Zhang, Yongdong Zhang","doi":"10.1109/tpami.2026.3653620","DOIUrl":"https://doi.org/10.1109/tpami.2026.3653620","url":null,"abstract":"","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"383 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146000907","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-19DOI: 10.1109/tpami.2026.3655641
Jinhui Yang, Ming Jiang, Qi Zhao
{"title":"Defying Distractions in Multimodal Tasks: A Novel Benchmark for Large Vision-Language Models","authors":"Jinhui Yang, Ming Jiang, Qi Zhao","doi":"10.1109/tpami.2026.3655641","DOIUrl":"https://doi.org/10.1109/tpami.2026.3655641","url":null,"abstract":"","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"91 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146000906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-16DOI: 10.1109/tpami.2026.3654392
Jinming Chai,Licheng Jiao,Xiaoqiang Lu,Lingling Li,Fang Liu,Long Sun,Xu Liu,Wenping Ma,Weibin Li
Referring remote sensing interpretation holds significant application value in various scenarios such as ecological protection, resource exploration, and emergency management. However, referring remote sensing expression comprehension and segmentation (RRSECS) faces critical challenges, including micro-target localization drift problem caused by insufficient extraction of boundary features in existing paradigms. Moreover, when transferred to remote sensing domains, polygon-based methods encounter issues such as contour-boundary misalignment and multi-task co-optimization conflicts problems. In this paper, we propose SeeFormer, a novel contour autoregressive paradigm specifically designed for RRSECS, which accurately locates and segments micro, irregular targets in remote sensing imagery. We first introduce a brain-inspired feature refocus learning (BIFRL) module that progressively attends to effective object features via a coarse-to-fine scheme, significantly boosting small-object localization and segmentation. Next, we present a language-contour enhancer (LCE) that injects shape-aware contour priors, and a corner-based contour sampler (CBCS) to improve mask-polygon reconstruction fidelity. Finally, we develop an autoregressive dual-decoder paradigm (ARDDP) that preserves sequence consistency while alleviating multi-task optimization conflicts. Extensive experiments on RefDIOR, RRSIS-D, and OPT-RSVG datasets under varying scenarios, scales, and task paradigms demonstrate transformative performance gains: compared to the baseline PolyFormer, our proposed SeeFormer improves oIoU and mIoU by 27.58% and 39.37% for referring image segmentation and by 18.94% and 28.90% for visual grounding on the RefDIOR dataset. The code will be publicly accessible at https://github.com/IPIU-XDU/RSFM.
{"title":"Like Human Rethinking: Contour Transformer AutoRegression for Referring Remote Sensing Interpretation.","authors":"Jinming Chai,Licheng Jiao,Xiaoqiang Lu,Lingling Li,Fang Liu,Long Sun,Xu Liu,Wenping Ma,Weibin Li","doi":"10.1109/tpami.2026.3654392","DOIUrl":"https://doi.org/10.1109/tpami.2026.3654392","url":null,"abstract":"Referring remote sensing interpretation holds significant application value in various scenarios such as ecological protection, resource exploration, and emergency management. However, referring remote sensing expression comprehension and segmentation (RRSECS) faces critical challenges, including micro-target localization drift problem caused by insufficient extraction of boundary features in existing paradigms. Moreover, when transferred to remote sensing domains, polygon-based methods encounter issues such as contour-boundary misalignment and multi-task co-optimization conflicts problems. In this paper, we propose SeeFormer, a novel contour autoregressive paradigm specifically designed for RRSECS, which accurately locates and segments micro, irregular targets in remote sensing imagery. We first introduce a brain-inspired feature refocus learning (BIFRL) module that progressively attends to effective object features via a coarse-to-fine scheme, significantly boosting small-object localization and segmentation. Next, we present a language-contour enhancer (LCE) that injects shape-aware contour priors, and a corner-based contour sampler (CBCS) to improve mask-polygon reconstruction fidelity. Finally, we develop an autoregressive dual-decoder paradigm (ARDDP) that preserves sequence consistency while alleviating multi-task optimization conflicts. Extensive experiments on RefDIOR, RRSIS-D, and OPT-RSVG datasets under varying scenarios, scales, and task paradigms demonstrate transformative performance gains: compared to the baseline PolyFormer, our proposed SeeFormer improves oIoU and mIoU by 27.58% and 39.37% for referring image segmentation and by 18.94% and 28.90% for visual grounding on the RefDIOR dataset. The code will be publicly accessible at https://github.com/IPIU-XDU/RSFM.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"20 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145986629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-16DOI: 10.1109/tpami.2026.3653780
Di Wu,Shihui Li,Yi He,Xin Luo,Xinbo Gao
High-dimensional and incomplete (HDI) data are ubiquitous in various Big Data-related industrial applications, such as drug innovation and recommender systems. Hash learning is the most efficient representation learning approach to extract hidden information from HDI data owing to its fast reasoning and low storage. However, an existing hash learning approach commonly employs gradient-based optimization techniques to address the discrete objective caused by the binary nature of hash factors, where the Quantization (i.e., quantizing the real values to binary codes) loss is inevitable, resulting in accuracy loss when representing HDI data. Motivated by these critical and vital issues, this paper proposes a non-gradient hash factor (NGHF) model with three-fold ideas: a) innovating a discrete differential evolution (DDE) algorithm able to simulate the continuous optimization via disabling bits of binary codes based on the projected Hamming dissimilarity, thus enabling an effective discrete optimizer, b) applying the proposed DDE algorithm to directly optimize the discrete learning objective of NGHF defined on HDI data, thereby facilitating its efficient and precise training without any Quantization loss, and c) theoretically proving the convergence of NGHF. As such, NGHF possesses high representation learning ability comparable to that of a real-valued model, making it able to achieve precise binary representation to HDI data. Extensive experimental results on nine real-world datasets demonstrate that NGHF significantly outperforms eight state-of-the-art hash learning models. Moreover, its accuracy is amazingly comparable to that of a real valued model for HDI data representation learning. Such results are inspiring for facilitating hash-learning models with both high accuracy and fast reasoning on HDI data, which is critical for industrial applications. Our source code is shared at the link: https://github.com/wudi1989/NGHF.
高维不完全数据(high dimensional and incomplete, HDI)在各种大数据相关的产业应用中无处不在,例如药物创新和推荐系统。哈希学习具有推理速度快、存储空间小等优点,是从HDI数据中提取隐藏信息最有效的表示学习方法。然而,现有的哈希学习方法通常采用基于梯度的优化技术来解决由哈希因子的二进制性质引起的离散目标,其中量化(即将实值量化为二进制代码)损失是不可避免的,从而导致表示HDI数据时的精度损失。在这些关键问题的激励下,本文提出了一个具有三重思想的非梯度哈希因子(NGHF)模型:a)创新了一种离散差分进化(DDE)算法,该算法能够模拟基于投影的汉明不相似度通过禁用二进制码位进行连续优化,从而实现有效的离散优化器;b)应用所提出的DDE算法直接优化HDI数据上定义的NGHF的离散学习目标,从而在没有任何量化损失的情况下实现NGHF的高效精确训练;c)从理论上证明了NGHF的收敛性。因此,NGHF具有与实值模型相当的高表示学习能力,能够实现对HDI数据的精确二进制表示。在9个真实数据集上的广泛实验结果表明,NGHF显著优于8个最先进的哈希学习模型。此外,它的准确性与HDI数据表示学习的实值模型惊人地相似。这样的结果对于促进对HDI数据进行高精度和快速推理的哈希学习模型是鼓舞人心的,这对于工业应用程序至关重要。我们的源代码共享在这个链接:https://github.com/wudi1989/NGHF。
{"title":"Non-Gradient Hash Factor Learning for High-Dimensional and Incomplete Data Representation Learning.","authors":"Di Wu,Shihui Li,Yi He,Xin Luo,Xinbo Gao","doi":"10.1109/tpami.2026.3653780","DOIUrl":"https://doi.org/10.1109/tpami.2026.3653780","url":null,"abstract":"High-dimensional and incomplete (HDI) data are ubiquitous in various Big Data-related industrial applications, such as drug innovation and recommender systems. Hash learning is the most efficient representation learning approach to extract hidden information from HDI data owing to its fast reasoning and low storage. However, an existing hash learning approach commonly employs gradient-based optimization techniques to address the discrete objective caused by the binary nature of hash factors, where the Quantization (i.e., quantizing the real values to binary codes) loss is inevitable, resulting in accuracy loss when representing HDI data. Motivated by these critical and vital issues, this paper proposes a non-gradient hash factor (NGHF) model with three-fold ideas: a) innovating a discrete differential evolution (DDE) algorithm able to simulate the continuous optimization via disabling bits of binary codes based on the projected Hamming dissimilarity, thus enabling an effective discrete optimizer, b) applying the proposed DDE algorithm to directly optimize the discrete learning objective of NGHF defined on HDI data, thereby facilitating its efficient and precise training without any Quantization loss, and c) theoretically proving the convergence of NGHF. As such, NGHF possesses high representation learning ability comparable to that of a real-valued model, making it able to achieve precise binary representation to HDI data. Extensive experimental results on nine real-world datasets demonstrate that NGHF significantly outperforms eight state-of-the-art hash learning models. Moreover, its accuracy is amazingly comparable to that of a real valued model for HDI data representation learning. Such results are inspiring for facilitating hash-learning models with both high accuracy and fast reasoning on HDI data, which is critical for industrial applications. Our source code is shared at the link: https://github.com/wudi1989/NGHF.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"270 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145986627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-16DOI: 10.1109/tpami.2026.3655110
De Cheng,Yubo Li,Chaowei Fang,Shizhou Zhang,Nannan Wang,Xinbo Gao
Cloth-Changing Person Re-Identification (CC-ReID) aims to recognize individuals across camera views despite clothing variations, a crucial task for surveillance and security systems. Existing methods typically frame it as a cross-modal alignment problem but often overlook explicit modeling of interference factors such as clothing, viewpoints, and pedestrian actions. This oversight can distort their impact, compromising the extraction of robust identity features. To address these challenges, we propose a novel framework that systematically disentangles interference factors from identity features while ensuring the robustness and discriminative power of identity representations. Our approach consists of two key components. First, a dual-stream identity feature learning framework leverages a raw image stream and a cloth-isolated stream, to extract identity representations independent of clothing textures. An adaptive cloth-irrelevant contrastive objective is introduced to mitigate identity feature variations caused by clothing differences. Second, we propose a Text-Driven Conditional Generative Adversarial Interference Disentanglement Network (T-CGAIDN), to further suppress interference factors beyond clothing textures, such as finer clothing patterns, viewpoint, background, and lighting conditions. This network incorporates a multi-granularity interference recognition branch to learn interference-related features, a conditional adversarial module for bidirectional transformation between identity and interference feature spaces, and an interference decoupling objective to eliminate interference dependencies in identity learning. Extensive experiments on public benchmarks demonstrate that our method significantly outperforms state-ofthe- art approaches, highlighting its effectiveness in CC-ReID. Our code is available at https://github.com/yblTech/IIFR-CCReID.
{"title":"Isolating Interference Factors for Robust Cloth-Changing Person Re-Identification.","authors":" De Cheng,Yubo Li,Chaowei Fang,Shizhou Zhang,Nannan Wang,Xinbo Gao","doi":"10.1109/tpami.2026.3655110","DOIUrl":"https://doi.org/10.1109/tpami.2026.3655110","url":null,"abstract":"Cloth-Changing Person Re-Identification (CC-ReID) aims to recognize individuals across camera views despite clothing variations, a crucial task for surveillance and security systems. Existing methods typically frame it as a cross-modal alignment problem but often overlook explicit modeling of interference factors such as clothing, viewpoints, and pedestrian actions. This oversight can distort their impact, compromising the extraction of robust identity features. To address these challenges, we propose a novel framework that systematically disentangles interference factors from identity features while ensuring the robustness and discriminative power of identity representations. Our approach consists of two key components. First, a dual-stream identity feature learning framework leverages a raw image stream and a cloth-isolated stream, to extract identity representations independent of clothing textures. An adaptive cloth-irrelevant contrastive objective is introduced to mitigate identity feature variations caused by clothing differences. Second, we propose a Text-Driven Conditional Generative Adversarial Interference Disentanglement Network (T-CGAIDN), to further suppress interference factors beyond clothing textures, such as finer clothing patterns, viewpoint, background, and lighting conditions. This network incorporates a multi-granularity interference recognition branch to learn interference-related features, a conditional adversarial module for bidirectional transformation between identity and interference feature spaces, and an interference decoupling objective to eliminate interference dependencies in identity learning. Extensive experiments on public benchmarks demonstrate that our method significantly outperforms state-ofthe- art approaches, highlighting its effectiveness in CC-ReID. Our code is available at https://github.com/yblTech/IIFR-CCReID.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"83 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145986628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-15DOI: 10.1109/tpami.2026.3654264
Siao Cai,Zhicheng Yu,Shaobing Gao,Zeyu Chen,Yiguang Liu
Single-photon imaging uses single-photon-sensitive picosecond-resolution sensors to capture 3D structure and supports diverse applications, but success remains mostly limited to simple scenes. In complex scenarios, traditional methods degrade and deep learning methods lack flexibility and generalization. Here, we propose a physics-informed deep neural network (PIDNN) framework that effectively addresses both aspects, adapting to complex and variable sensing environments by embedding imaging physics into the deep neural network for unsupervised learning. Within this framework, by tailoring the number of U-Net skip connections, we impose multi-scale spatiotemporal priors that improve photon-utilization efficiency, laying the foundation for addressing the inherent low-signal-to-background ratio (SBR) problem in subsequent complex scenarios. Additionally, we introduce volume rendering into the PIDNN framework and design a dual-branch structure, further extending its applicability to multiple-depth and fog occlusion. We validated the performance of this method in various complex environments through numerical simulations and real-world experiments. The results of photon-efficient imaging with multiple returns show robust performance under low SBR and large fields of view. The method attains lower root mean-squared error than traditional methods and exhibits stronger generalization than supervised approaches. Further multiple depths and fog interference experiments confirm that its reconstruction quality surpasses existing techniques, demonstrating its flexibility and scalability. Both simulation and experimental results validate its exceptional reconstruction performance and flexibility.
{"title":"Single-Photon Imaging in Complex Scenarios via Physics-Informed Deep Neural Networks.","authors":"Siao Cai,Zhicheng Yu,Shaobing Gao,Zeyu Chen,Yiguang Liu","doi":"10.1109/tpami.2026.3654264","DOIUrl":"https://doi.org/10.1109/tpami.2026.3654264","url":null,"abstract":"Single-photon imaging uses single-photon-sensitive picosecond-resolution sensors to capture 3D structure and supports diverse applications, but success remains mostly limited to simple scenes. In complex scenarios, traditional methods degrade and deep learning methods lack flexibility and generalization. Here, we propose a physics-informed deep neural network (PIDNN) framework that effectively addresses both aspects, adapting to complex and variable sensing environments by embedding imaging physics into the deep neural network for unsupervised learning. Within this framework, by tailoring the number of U-Net skip connections, we impose multi-scale spatiotemporal priors that improve photon-utilization efficiency, laying the foundation for addressing the inherent low-signal-to-background ratio (SBR) problem in subsequent complex scenarios. Additionally, we introduce volume rendering into the PIDNN framework and design a dual-branch structure, further extending its applicability to multiple-depth and fog occlusion. We validated the performance of this method in various complex environments through numerical simulations and real-world experiments. The results of photon-efficient imaging with multiple returns show robust performance under low SBR and large fields of view. The method attains lower root mean-squared error than traditional methods and exhibits stronger generalization than supervised approaches. Further multiple depths and fog interference experiments confirm that its reconstruction quality surpasses existing techniques, demonstrating its flexibility and scalability. Both simulation and experimental results validate its exceptional reconstruction performance and flexibility.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"48 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145971968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}