Pub Date : 2026-01-13DOI: 10.1109/tpami.2026.3653765
Yeongyu Choi,Fabien Moutarde,Ju H Park,Ho-Youl Jung
We propose a novel post-processing approach for the local optimization of Locally Optimized RANdom SAmple Consensus (LO-RANSAC), called the Multi-Estimation-based Parameter Centroid (MEPC) decision. It is observed that the optimal thresholds for hypothesis generation and evaluation differ in local optimization with the inner RANSAC. Instead of binary labeling for inliers and outliers, a new ternary labeling for inliers, midliers, and outliers is introduced, using two thresholds. Our experimental results show that the highest-scoring model measured by the ternary method is closer to the real model than that measured by the existing binary method. However, it should be noted that the highest score still does not correspond to the best model due to inaccurate evaluation by data noise. We introduce a new linear model centroid decision method to compensate for the highest-scoring model distorted by noise. In this process, an efficient method for measuring the similarity between two hypotheses is introduced, and candidates close to the real model are found by comparing their similarity with the highest-scoring model. Our approach determines a representative model of the multiple candidate hypotheses, which is defined as the geometric centroid of hyperplanes. We test on various datasets for homography, fundamental, and essential matrices, demonstrating that applying MEPC to existing RANSAC algorithms achieves more accurate and stable model estimation. Moreover, additional experiments on vanishing point detection show the potential of our approach for various model estimation applications.
{"title":"An Efficient Multi-Estimation-Based Parameter Centroid Decision Via Linear Regression Approach.","authors":"Yeongyu Choi,Fabien Moutarde,Ju H Park,Ho-Youl Jung","doi":"10.1109/tpami.2026.3653765","DOIUrl":"https://doi.org/10.1109/tpami.2026.3653765","url":null,"abstract":"We propose a novel post-processing approach for the local optimization of Locally Optimized RANdom SAmple Consensus (LO-RANSAC), called the Multi-Estimation-based Parameter Centroid (MEPC) decision. It is observed that the optimal thresholds for hypothesis generation and evaluation differ in local optimization with the inner RANSAC. Instead of binary labeling for inliers and outliers, a new ternary labeling for inliers, midliers, and outliers is introduced, using two thresholds. Our experimental results show that the highest-scoring model measured by the ternary method is closer to the real model than that measured by the existing binary method. However, it should be noted that the highest score still does not correspond to the best model due to inaccurate evaluation by data noise. We introduce a new linear model centroid decision method to compensate for the highest-scoring model distorted by noise. In this process, an efficient method for measuring the similarity between two hypotheses is introduced, and candidates close to the real model are found by comparing their similarity with the highest-scoring model. Our approach determines a representative model of the multiple candidate hypotheses, which is defined as the geometric centroid of hyperplanes. We test on various datasets for homography, fundamental, and essential matrices, demonstrating that applying MEPC to existing RANSAC algorithms achieves more accurate and stable model estimation. Moreover, additional experiments on vanishing point detection show the potential of our approach for various model estimation applications.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"52 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145961429","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
High-speed vision tasks have long been a challenge in computer vision. Recently, the spike camera has shown great potential in these tasks due to its high temporal resolution. Unlike traditional cameras, it emits asynchronous spike signals to capture visual information. However, under low-light conditions, spike signals becomehighly sparse, and the sparse spike streamseverely hinders theeffectiveness of existing spike-based methods in high-speed scenarios. To address this challenge,we introduce SS2DS, the first deep learning framework that enhances sparse spike streams into dense spike streams. SS2DS first estimates the spike firing frequency within sparse streams. Subsequently, the spike firing frequency is enhanced by a neural network. Finally, SS2DS decodes the enhanced spike stream from the enhanced spike firing frequency sequence. SS2DS can adjust the temporal distribution of sparse spike streams and improve the performance degradation of existing methods in low-light and high-speed scenarios. In order to evaluate sparse spikestream enhancement,we construct both synthetic and real sparse spike stream datasets. The real dataset iscollected in dynamic scenarios using the third-generation spike camera.By comparing the reconstruction results, enhanced spike streams achieve an average improvement of +0.78 MA, -18.42 BRISQUE, and -1.42 NIQE over sparse spike streams. Moreover, the enhanced spike streams also benefit other spike-based vision tasks, such as 3D reconstruction (+1.325 dB PSNR, +0.005 SSIM, and -0.01 LPIPS) and super-resolution (+0.63 MA, -13.67 BRISQUE, and -1.28 NIQE). Code and datasets will be released after publication.
{"title":"Learn to Enhance Sparse Spike Streams.","authors":"Liwen Hu,Yijia Guo,Mianzhi Liu,Yiming Fan,Rui Ma,Shengbo Chen,Lei Ma,Tiejun Huang","doi":"10.1109/tpami.2026.3653768","DOIUrl":"https://doi.org/10.1109/tpami.2026.3653768","url":null,"abstract":"High-speed vision tasks have long been a challenge in computer vision. Recently, the spike camera has shown great potential in these tasks due to its high temporal resolution. Unlike traditional cameras, it emits asynchronous spike signals to capture visual information. However, under low-light conditions, spike signals becomehighly sparse, and the sparse spike streamseverely hinders theeffectiveness of existing spike-based methods in high-speed scenarios. To address this challenge,we introduce SS2DS, the first deep learning framework that enhances sparse spike streams into dense spike streams. SS2DS first estimates the spike firing frequency within sparse streams. Subsequently, the spike firing frequency is enhanced by a neural network. Finally, SS2DS decodes the enhanced spike stream from the enhanced spike firing frequency sequence. SS2DS can adjust the temporal distribution of sparse spike streams and improve the performance degradation of existing methods in low-light and high-speed scenarios. In order to evaluate sparse spikestream enhancement,we construct both synthetic and real sparse spike stream datasets. The real dataset iscollected in dynamic scenarios using the third-generation spike camera.By comparing the reconstruction results, enhanced spike streams achieve an average improvement of +0.78 MA, -18.42 BRISQUE, and -1.42 NIQE over sparse spike streams. Moreover, the enhanced spike streams also benefit other spike-based vision tasks, such as 3D reconstruction (+1.325 dB PSNR, +0.005 SSIM, and -0.01 LPIPS) and super-resolution (+0.63 MA, -13.67 BRISQUE, and -1.28 NIQE). Code and datasets will be released after publication.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"29 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145961426","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Depth estimation from a monocular 360 image is important to the perception of the entire 3D environment. However, the inherent distortion and large field of view (FoV) in 360 images pose great challenges for this task. To this end, existing mainstream solutions typically introduce additional perspective-based 360 representations (e.g., Cubemap) to achieve effective feature extraction. Nevertheless, regardless of the introduced representations, they eventually need to be unified into the equirectangular projection (ERP) format for the subsequent depth estimation, which inevitably reintroduces additional distortions. In this work, we propose an oriented-distortion-aware Gabor Fusion framework (PGFuse) to address the above challenges. First, we introduce Gabor filters that analyze texture in the frequency domain, extending the receptive fields and enhancing depth cues. To address the reintroduced distortions, we design a latitude-aware distortion representation to generate customized, distortion-aware Gabor filters (PanoGabor filters). Furthermore, we design a channel- wise and spatial- wise unidirectional fusion module (CS-UFM) that integrates the proposed PanoGabor filters to unify other representations into the ERP format, delivering effective and distortion-aware features. Considering the orientation sensitivity of the Gabor transform, we further introduce a spherical gradient constraint to stabilize this sensitivity. Experimental results on three popular indoor 360 benchmarks demonstrate the superiority of the proposed PGFuse to existing state-of-the-art solutions. Code and models will be available at https://github.com/zhijieshen-bjtu/PGFuse.
{"title":"Revisiting 360 Depth Estimation With PanoGabor: A New Fusion Perspective.","authors":"Zhijie Shen,Chunyu Lin,Lang Nie,Kang Liao,Weisi Lin,Yao Zhao","doi":"10.1109/tpami.2026.3653796","DOIUrl":"https://doi.org/10.1109/tpami.2026.3653796","url":null,"abstract":"Depth estimation from a monocular 360 image is important to the perception of the entire 3D environment. However, the inherent distortion and large field of view (FoV) in 360 images pose great challenges for this task. To this end, existing mainstream solutions typically introduce additional perspective-based 360 representations (e.g., Cubemap) to achieve effective feature extraction. Nevertheless, regardless of the introduced representations, they eventually need to be unified into the equirectangular projection (ERP) format for the subsequent depth estimation, which inevitably reintroduces additional distortions. In this work, we propose an oriented-distortion-aware Gabor Fusion framework (PGFuse) to address the above challenges. First, we introduce Gabor filters that analyze texture in the frequency domain, extending the receptive fields and enhancing depth cues. To address the reintroduced distortions, we design a latitude-aware distortion representation to generate customized, distortion-aware Gabor filters (PanoGabor filters). Furthermore, we design a channel- wise and spatial- wise unidirectional fusion module (CS-UFM) that integrates the proposed PanoGabor filters to unify other representations into the ERP format, delivering effective and distortion-aware features. Considering the orientation sensitivity of the Gabor transform, we further introduce a spherical gradient constraint to stabilize this sensitivity. Experimental results on three popular indoor 360 benchmarks demonstrate the superiority of the proposed PGFuse to existing state-of-the-art solutions. Code and models will be available at https://github.com/zhijieshen-bjtu/PGFuse.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"120 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145961425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-13DOI: 10.1109/tpami.2026.3653482
Tianshan Liu,Bing-Kun Bao
With the functions of egocentric observation and multimodal perception equipped in augmented reality (AR) devices, the next generation of smart assistants has the potential to reduce human labor and enhance execution efficiency in assembly tasks. Among diverse assembly activity understanding tasks, anticipating the near future activities is crucial yet challenging, which can assist humans or agents to actively plan and engage in interactions with the environment. However, the existing egocentric activity anticipation methods still struggle to achieve a decent trade-off between accuracy and computational efficiency, hindering them to be deployed in practical applications. To address this dilemma, in this paper, we propose a goal-guided prompting framework with adaptive modality selection (GP-AMS), for assembly activity anticipation in egocentric videos. For bridging the semantic gap between the historical observations and unobserved future activities, we inject the inferred high-level goal clues into the constructed prompts, which are further utilized to guide a pre-trained vision-language (V-L) model to compensate relevant semantics of unseen future. Moreover, a mask-and-predict strategy is adopted with two imposed constraints, i.e., casual masking and probabilistic token-dropping, to mine the intrinsic associations between the assembly activities within a specific procedure. For maintaining the benefits of exploiting multimodal information while avoiding extensively increasing the computational burdens, an adaptive modality selection strategy is designed to train a policy network, which learns to dynamically decide which modalities should be sampled for processing by the anticipation model on a per observation time-step basis. By allocating major computation to the selected indicative modalities on-the-fly, the efficiency of the overall model can be improved, thus paving the way for feasibility on real-world devices. Extensive experimental results on two public data sets validate that the proposed method yields not only consistent improvements in anticipation accuracy, but also significant savings in computation budgets.
{"title":"Goal-guided Prompting with Adaptive Modality Selection for Efficient Assembly Activity Anticipation in Egocentric Videos.","authors":"Tianshan Liu,Bing-Kun Bao","doi":"10.1109/tpami.2026.3653482","DOIUrl":"https://doi.org/10.1109/tpami.2026.3653482","url":null,"abstract":"With the functions of egocentric observation and multimodal perception equipped in augmented reality (AR) devices, the next generation of smart assistants has the potential to reduce human labor and enhance execution efficiency in assembly tasks. Among diverse assembly activity understanding tasks, anticipating the near future activities is crucial yet challenging, which can assist humans or agents to actively plan and engage in interactions with the environment. However, the existing egocentric activity anticipation methods still struggle to achieve a decent trade-off between accuracy and computational efficiency, hindering them to be deployed in practical applications. To address this dilemma, in this paper, we propose a goal-guided prompting framework with adaptive modality selection (GP-AMS), for assembly activity anticipation in egocentric videos. For bridging the semantic gap between the historical observations and unobserved future activities, we inject the inferred high-level goal clues into the constructed prompts, which are further utilized to guide a pre-trained vision-language (V-L) model to compensate relevant semantics of unseen future. Moreover, a mask-and-predict strategy is adopted with two imposed constraints, i.e., casual masking and probabilistic token-dropping, to mine the intrinsic associations between the assembly activities within a specific procedure. For maintaining the benefits of exploiting multimodal information while avoiding extensively increasing the computational burdens, an adaptive modality selection strategy is designed to train a policy network, which learns to dynamically decide which modalities should be sampled for processing by the anticipation model on a per observation time-step basis. By allocating major computation to the selected indicative modalities on-the-fly, the efficiency of the overall model can be improved, thus paving the way for feasibility on real-world devices. Extensive experimental results on two public data sets validate that the proposed method yields not only consistent improvements in anticipation accuracy, but also significant savings in computation budgets.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"54 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145961428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-12DOI: 10.1109/tpami.2026.3651319
Hao Dong, Moru Liu, Kaiyang Zhou, Eleni Chatzi, Juho Kannala, Cyrill Stachniss, Olga Fink
{"title":"Advances in Multimodal Adaptation and Generalization: From Traditional Approaches to Foundation Models","authors":"Hao Dong, Moru Liu, Kaiyang Zhou, Eleni Chatzi, Juho Kannala, Cyrill Stachniss, Olga Fink","doi":"10.1109/tpami.2026.3651319","DOIUrl":"https://doi.org/10.1109/tpami.2026.3651319","url":null,"abstract":"","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"27 1","pages":"1-20"},"PeriodicalIF":23.6,"publicationDate":"2026-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145955304","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Learning Physics-Informed Noise Models from Dark Frames for Low-Light Raw Image Denoising","authors":"Hansen Feng, Lizhi Wang, Yiqi Huang, Yuzhi Wang, Lin Zhu, Hua Huang","doi":"10.1109/tpami.2026.3651447","DOIUrl":"https://doi.org/10.1109/tpami.2026.3651447","url":null,"abstract":"","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"243 1","pages":"1-18"},"PeriodicalIF":23.6,"publicationDate":"2026-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145955308","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}