Pub Date : 2026-01-14DOI: 10.1109/tpami.2026.3654092
Xiaoyang Xu,Wenzhe Yi,Juan Wang,Hongxin Hu,Mengda Yang,Ziang Li,Yong Zhuang,Yaxin Liu,Mang Ye
Split Learning (SL) is a distributed learning framework that has gained popularity for its privacy-preserving nature and low computational demands. However, recent studies have the potential that a server adversary to carry out inference attacks, compromising the privacy of victim clients. Nevertheless, upon re-evaluating prior studies, we found that existing methods rely on overly strong assumptions to enhance their performance, resulting in a significant decline in effectiveness under more realistic scenarios. In this work, we provide new insights into the inherent vulnerabilities of SL. Specifically, we discover that both the smashed data and the server model contain the client's representation preference, which the server adversary can exploit to build a substitute client that approximates the target client's unique feature extraction behavior. With a well-trained substitute client, the server can perfectly steal the target client's functionality, training data, and labels. Building on this observation, we introduce Split Leakage (SLeak), a new threat that targets multiple privacy stealing objectives against SL. Notably, SLeak does not depend on strong privacy priors and only requires partial same-domain auxiliary public data to conduct the attacks. Experimental results on diverse datasets and target models show that SLeak surpasses the state-of-the-art method across multiple metrics. Moreover, ablation studies further confirm its robustness and applicability under various scenarios and assumptions.
{"title":"SLeak: Multi-Target Privacy Stealing Attack against Split Learning.","authors":"Xiaoyang Xu,Wenzhe Yi,Juan Wang,Hongxin Hu,Mengda Yang,Ziang Li,Yong Zhuang,Yaxin Liu,Mang Ye","doi":"10.1109/tpami.2026.3654092","DOIUrl":"https://doi.org/10.1109/tpami.2026.3654092","url":null,"abstract":"Split Learning (SL) is a distributed learning framework that has gained popularity for its privacy-preserving nature and low computational demands. However, recent studies have the potential that a server adversary to carry out inference attacks, compromising the privacy of victim clients. Nevertheless, upon re-evaluating prior studies, we found that existing methods rely on overly strong assumptions to enhance their performance, resulting in a significant decline in effectiveness under more realistic scenarios. In this work, we provide new insights into the inherent vulnerabilities of SL. Specifically, we discover that both the smashed data and the server model contain the client's representation preference, which the server adversary can exploit to build a substitute client that approximates the target client's unique feature extraction behavior. With a well-trained substitute client, the server can perfectly steal the target client's functionality, training data, and labels. Building on this observation, we introduce Split Leakage (SLeak), a new threat that targets multiple privacy stealing objectives against SL. Notably, SLeak does not depend on strong privacy priors and only requires partial same-domain auxiliary public data to conduct the attacks. Experimental results on diverse datasets and target models show that SLeak surpasses the state-of-the-art method across multiple metrics. Moreover, ablation studies further confirm its robustness and applicability under various scenarios and assumptions.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"20 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145971893","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-14DOI: 10.1109/tpami.2026.3653901
Wenyuan Zhang,Chunsheng Wang,Kanle Shi,Yu-Shen Liu,Zhizhong Han
Unsigned distance functions (UDFs) have been a vital representation for open surfaces. With different differentiable renderers, current methods are able to train neural networks to infer a UDF by minimizing the rendering errors with the UDF to the multi-view ground truth. However, these differentiable renderers are mainly handcrafted, which makes them either biased on ray-surface intersections, or sensitive to unsigned distance outliers, or not scalable to large scenes. To resolve these issues, we present a novel differentiable renderer to infer UDFs more accurately. Instead of using handcrafted equations, our differentiable renderer is a neural network which is pre-trained in a data-driven manner. It learns how to render unsigned distances into depth images, leading to a prior knowledge, dubbed volume rendering priors. To infer a UDF for an unseen scene from multiple RGB images, we generalize the learned volume rendering priors to map inferred unsigned distances in alpha blending for RGB image rendering. To reduce the bias of sampling in UDF inference, we utilize an auxiliary point sampling prior as an indicator of ray-surface intersection, and propose novel schemes towards more accurate and uniform sampling near the zero-level sets. We also propose a new strategy that leverages our pretrained volume rendering prior to serve as a general surface refiner, which can be integrated with various Gaussian reconstruction methods to optimize the Gaussian distributions and refine geometric details. Our results show that the learned volume rendering prior is unbiased, robust, scalable, 3D aware, and more importantly, easy to learn. Further experiments show that the volume rendering prior is also a general strategy to enhance other neural implicit representations such as signed distance function and occupancy. We evaluate our method on both widely used benchmarks and real scenes, and report superior performance over the state-of-the-art methods.
{"title":"VRP-UDF: Towards Unbiased Learning of Unsigned Distance Functions from Multi-view Images with Volume Rendering Priors.","authors":"Wenyuan Zhang,Chunsheng Wang,Kanle Shi,Yu-Shen Liu,Zhizhong Han","doi":"10.1109/tpami.2026.3653901","DOIUrl":"https://doi.org/10.1109/tpami.2026.3653901","url":null,"abstract":"Unsigned distance functions (UDFs) have been a vital representation for open surfaces. With different differentiable renderers, current methods are able to train neural networks to infer a UDF by minimizing the rendering errors with the UDF to the multi-view ground truth. However, these differentiable renderers are mainly handcrafted, which makes them either biased on ray-surface intersections, or sensitive to unsigned distance outliers, or not scalable to large scenes. To resolve these issues, we present a novel differentiable renderer to infer UDFs more accurately. Instead of using handcrafted equations, our differentiable renderer is a neural network which is pre-trained in a data-driven manner. It learns how to render unsigned distances into depth images, leading to a prior knowledge, dubbed volume rendering priors. To infer a UDF for an unseen scene from multiple RGB images, we generalize the learned volume rendering priors to map inferred unsigned distances in alpha blending for RGB image rendering. To reduce the bias of sampling in UDF inference, we utilize an auxiliary point sampling prior as an indicator of ray-surface intersection, and propose novel schemes towards more accurate and uniform sampling near the zero-level sets. We also propose a new strategy that leverages our pretrained volume rendering prior to serve as a general surface refiner, which can be integrated with various Gaussian reconstruction methods to optimize the Gaussian distributions and refine geometric details. Our results show that the learned volume rendering prior is unbiased, robust, scalable, 3D aware, and more importantly, easy to learn. Further experiments show that the volume rendering prior is also a general strategy to enhance other neural implicit representations such as signed distance function and occupancy. We evaluate our method on both widely used benchmarks and real scenes, and report superior performance over the state-of-the-art methods.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"60 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145971894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-14DOI: 10.1109/tpami.2026.3654260
Gengyun Jia,Xin Ma,Bing-Kun Bao
Ordinal regression aims to predict ordered classes. Existing methods mainly focus on label distribution shapes and feature distance relationships, while the directional characteristics in the representation space remain underexplored. In this paper, we propose deep orientational representation learning (ORL), aiming to ensure the trajectory of features sequentially connected by ordinal categories approximates a geodesic. We treat the output layer weights as ordinal prototypes and introduce two constraints, the co-directional constraint and the counter-directional constraint. They operate by constraining the angles between pairs of vectors. The former minimizes the angle between vectors with matching start and end categories, while the latter maximizes the angle between vectors whose start categories are the same but whose end categories are on opposite sides. The two constraints optimize the representation from different ordinal directions. ORL is extended to a multi-prototype setting (MORL) to mitigate misalignment between features and oriented prototypes caused by large intra-class variations. Theoretical analysis links ORL to distribution unimodality and distance orderliness, highlighting its advantages. The effectiveness of ORL (MORL) is demonstrated on various tasks including facial age estimation, historical image dating, and aesthetic quality assessment.
{"title":"Deep Orientational Representation Learning for Ordinal Regression.","authors":"Gengyun Jia,Xin Ma,Bing-Kun Bao","doi":"10.1109/tpami.2026.3654260","DOIUrl":"https://doi.org/10.1109/tpami.2026.3654260","url":null,"abstract":"Ordinal regression aims to predict ordered classes. Existing methods mainly focus on label distribution shapes and feature distance relationships, while the directional characteristics in the representation space remain underexplored. In this paper, we propose deep orientational representation learning (ORL), aiming to ensure the trajectory of features sequentially connected by ordinal categories approximates a geodesic. We treat the output layer weights as ordinal prototypes and introduce two constraints, the co-directional constraint and the counter-directional constraint. They operate by constraining the angles between pairs of vectors. The former minimizes the angle between vectors with matching start and end categories, while the latter maximizes the angle between vectors whose start categories are the same but whose end categories are on opposite sides. The two constraints optimize the representation from different ordinal directions. ORL is extended to a multi-prototype setting (MORL) to mitigate misalignment between features and oriented prototypes caused by large intra-class variations. Theoretical analysis links ORL to distribution unimodality and distance orderliness, highlighting its advantages. The effectiveness of ORL (MORL) is demonstrated on various tasks including facial age estimation, historical image dating, and aesthetic quality assessment.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"56 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145971744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we address the challenging task of multimodal reasoning by incorporating the notion of "slow thinking" into multimodal large language models (MLLMs). Our core idea is that models can learn to adaptively use different levels of reasoning to tackle questions of varying complexity. We propose a novel paradigm of Self-structured Chain of Thought (SCoT), which consists of minimal semantic atomic steps. Unlike existing methods that rely on structured templates or free-form paradigms, our method not only generates flexible CoT structures for various complex tasks but also mitigates the phenomenon of overthinking for easier tasks. To introduce structured reasoning into visual cognition, we design a novel AtomThink framework with four key modules: (i) a data engine to generate high-quality multimodal reasoning paths; (ii) a supervised fine-tuning (SFT) process with serialized inference data; (iii) a policy-guided multi-turn inference method; and (iv) an atomic capability metric to evaluate the single-step utilization rate. Extensive experiments demonstrate that the proposed AtomThink significantly improves the performance of baseline MLLMs, achieving more than 10% average accuracy gains on MathVista and MathVerse. Compared to state-of-the-art structured CoT approaches, our method not only achieves higher accuracy but also improves data utilization by 5 × and boosts inference efficiency by 85.3%. Our code is publicly available at https://github.com/Kun-Xiang/AtomThink.
{"title":"AtomThink: Multimodal Slow Thinking With Atomic Step Reasoning.","authors":"Kun Xiang,Zhili Liu,Terry Jingchen Zhang,Yinya Huang,Yunshuang Nie,Kaixin Cai,Yiyang Yin,Runhui Huang,Hanhui Li,Yihan Zeng,Yu-Jie Yuan,Jianhua Han,Lanqing Hong,Hang Xu,Xiaodan Liang","doi":"10.1109/tpami.2026.3653573","DOIUrl":"https://doi.org/10.1109/tpami.2026.3653573","url":null,"abstract":"In this paper, we address the challenging task of multimodal reasoning by incorporating the notion of \"slow thinking\" into multimodal large language models (MLLMs). Our core idea is that models can learn to adaptively use different levels of reasoning to tackle questions of varying complexity. We propose a novel paradigm of Self-structured Chain of Thought (SCoT), which consists of minimal semantic atomic steps. Unlike existing methods that rely on structured templates or free-form paradigms, our method not only generates flexible CoT structures for various complex tasks but also mitigates the phenomenon of overthinking for easier tasks. To introduce structured reasoning into visual cognition, we design a novel AtomThink framework with four key modules: (i) a data engine to generate high-quality multimodal reasoning paths; (ii) a supervised fine-tuning (SFT) process with serialized inference data; (iii) a policy-guided multi-turn inference method; and (iv) an atomic capability metric to evaluate the single-step utilization rate. Extensive experiments demonstrate that the proposed AtomThink significantly improves the performance of baseline MLLMs, achieving more than 10% average accuracy gains on MathVista and MathVerse. Compared to state-of-the-art structured CoT approaches, our method not only achieves higher accuracy but also improves data utilization by 5 × and boosts inference efficiency by 85.3%. Our code is publicly available at https://github.com/Kun-Xiang/AtomThink.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"259 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145961427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-13DOI: 10.1109/tpami.2026.3653765
Yeongyu Choi,Fabien Moutarde,Ju H Park,Ho-Youl Jung
We propose a novel post-processing approach for the local optimization of Locally Optimized RANdom SAmple Consensus (LO-RANSAC), called the Multi-Estimation-based Parameter Centroid (MEPC) decision. It is observed that the optimal thresholds for hypothesis generation and evaluation differ in local optimization with the inner RANSAC. Instead of binary labeling for inliers and outliers, a new ternary labeling for inliers, midliers, and outliers is introduced, using two thresholds. Our experimental results show that the highest-scoring model measured by the ternary method is closer to the real model than that measured by the existing binary method. However, it should be noted that the highest score still does not correspond to the best model due to inaccurate evaluation by data noise. We introduce a new linear model centroid decision method to compensate for the highest-scoring model distorted by noise. In this process, an efficient method for measuring the similarity between two hypotheses is introduced, and candidates close to the real model are found by comparing their similarity with the highest-scoring model. Our approach determines a representative model of the multiple candidate hypotheses, which is defined as the geometric centroid of hyperplanes. We test on various datasets for homography, fundamental, and essential matrices, demonstrating that applying MEPC to existing RANSAC algorithms achieves more accurate and stable model estimation. Moreover, additional experiments on vanishing point detection show the potential of our approach for various model estimation applications.
{"title":"An Efficient Multi-Estimation-Based Parameter Centroid Decision Via Linear Regression Approach.","authors":"Yeongyu Choi,Fabien Moutarde,Ju H Park,Ho-Youl Jung","doi":"10.1109/tpami.2026.3653765","DOIUrl":"https://doi.org/10.1109/tpami.2026.3653765","url":null,"abstract":"We propose a novel post-processing approach for the local optimization of Locally Optimized RANdom SAmple Consensus (LO-RANSAC), called the Multi-Estimation-based Parameter Centroid (MEPC) decision. It is observed that the optimal thresholds for hypothesis generation and evaluation differ in local optimization with the inner RANSAC. Instead of binary labeling for inliers and outliers, a new ternary labeling for inliers, midliers, and outliers is introduced, using two thresholds. Our experimental results show that the highest-scoring model measured by the ternary method is closer to the real model than that measured by the existing binary method. However, it should be noted that the highest score still does not correspond to the best model due to inaccurate evaluation by data noise. We introduce a new linear model centroid decision method to compensate for the highest-scoring model distorted by noise. In this process, an efficient method for measuring the similarity between two hypotheses is introduced, and candidates close to the real model are found by comparing their similarity with the highest-scoring model. Our approach determines a representative model of the multiple candidate hypotheses, which is defined as the geometric centroid of hyperplanes. We test on various datasets for homography, fundamental, and essential matrices, demonstrating that applying MEPC to existing RANSAC algorithms achieves more accurate and stable model estimation. Moreover, additional experiments on vanishing point detection show the potential of our approach for various model estimation applications.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"52 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145961429","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
High-speed vision tasks have long been a challenge in computer vision. Recently, the spike camera has shown great potential in these tasks due to its high temporal resolution. Unlike traditional cameras, it emits asynchronous spike signals to capture visual information. However, under low-light conditions, spike signals becomehighly sparse, and the sparse spike streamseverely hinders theeffectiveness of existing spike-based methods in high-speed scenarios. To address this challenge,we introduce SS2DS, the first deep learning framework that enhances sparse spike streams into dense spike streams. SS2DS first estimates the spike firing frequency within sparse streams. Subsequently, the spike firing frequency is enhanced by a neural network. Finally, SS2DS decodes the enhanced spike stream from the enhanced spike firing frequency sequence. SS2DS can adjust the temporal distribution of sparse spike streams and improve the performance degradation of existing methods in low-light and high-speed scenarios. In order to evaluate sparse spikestream enhancement,we construct both synthetic and real sparse spike stream datasets. The real dataset iscollected in dynamic scenarios using the third-generation spike camera.By comparing the reconstruction results, enhanced spike streams achieve an average improvement of +0.78 MA, -18.42 BRISQUE, and -1.42 NIQE over sparse spike streams. Moreover, the enhanced spike streams also benefit other spike-based vision tasks, such as 3D reconstruction (+1.325 dB PSNR, +0.005 SSIM, and -0.01 LPIPS) and super-resolution (+0.63 MA, -13.67 BRISQUE, and -1.28 NIQE). Code and datasets will be released after publication.
{"title":"Learn to Enhance Sparse Spike Streams.","authors":"Liwen Hu,Yijia Guo,Mianzhi Liu,Yiming Fan,Rui Ma,Shengbo Chen,Lei Ma,Tiejun Huang","doi":"10.1109/tpami.2026.3653768","DOIUrl":"https://doi.org/10.1109/tpami.2026.3653768","url":null,"abstract":"High-speed vision tasks have long been a challenge in computer vision. Recently, the spike camera has shown great potential in these tasks due to its high temporal resolution. Unlike traditional cameras, it emits asynchronous spike signals to capture visual information. However, under low-light conditions, spike signals becomehighly sparse, and the sparse spike streamseverely hinders theeffectiveness of existing spike-based methods in high-speed scenarios. To address this challenge,we introduce SS2DS, the first deep learning framework that enhances sparse spike streams into dense spike streams. SS2DS first estimates the spike firing frequency within sparse streams. Subsequently, the spike firing frequency is enhanced by a neural network. Finally, SS2DS decodes the enhanced spike stream from the enhanced spike firing frequency sequence. SS2DS can adjust the temporal distribution of sparse spike streams and improve the performance degradation of existing methods in low-light and high-speed scenarios. In order to evaluate sparse spikestream enhancement,we construct both synthetic and real sparse spike stream datasets. The real dataset iscollected in dynamic scenarios using the third-generation spike camera.By comparing the reconstruction results, enhanced spike streams achieve an average improvement of +0.78 MA, -18.42 BRISQUE, and -1.42 NIQE over sparse spike streams. Moreover, the enhanced spike streams also benefit other spike-based vision tasks, such as 3D reconstruction (+1.325 dB PSNR, +0.005 SSIM, and -0.01 LPIPS) and super-resolution (+0.63 MA, -13.67 BRISQUE, and -1.28 NIQE). Code and datasets will be released after publication.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"29 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145961426","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Depth estimation from a monocular 360 image is important to the perception of the entire 3D environment. However, the inherent distortion and large field of view (FoV) in 360 images pose great challenges for this task. To this end, existing mainstream solutions typically introduce additional perspective-based 360 representations (e.g., Cubemap) to achieve effective feature extraction. Nevertheless, regardless of the introduced representations, they eventually need to be unified into the equirectangular projection (ERP) format for the subsequent depth estimation, which inevitably reintroduces additional distortions. In this work, we propose an oriented-distortion-aware Gabor Fusion framework (PGFuse) to address the above challenges. First, we introduce Gabor filters that analyze texture in the frequency domain, extending the receptive fields and enhancing depth cues. To address the reintroduced distortions, we design a latitude-aware distortion representation to generate customized, distortion-aware Gabor filters (PanoGabor filters). Furthermore, we design a channel- wise and spatial- wise unidirectional fusion module (CS-UFM) that integrates the proposed PanoGabor filters to unify other representations into the ERP format, delivering effective and distortion-aware features. Considering the orientation sensitivity of the Gabor transform, we further introduce a spherical gradient constraint to stabilize this sensitivity. Experimental results on three popular indoor 360 benchmarks demonstrate the superiority of the proposed PGFuse to existing state-of-the-art solutions. Code and models will be available at https://github.com/zhijieshen-bjtu/PGFuse.
{"title":"Revisiting 360 Depth Estimation With PanoGabor: A New Fusion Perspective.","authors":"Zhijie Shen,Chunyu Lin,Lang Nie,Kang Liao,Weisi Lin,Yao Zhao","doi":"10.1109/tpami.2026.3653796","DOIUrl":"https://doi.org/10.1109/tpami.2026.3653796","url":null,"abstract":"Depth estimation from a monocular 360 image is important to the perception of the entire 3D environment. However, the inherent distortion and large field of view (FoV) in 360 images pose great challenges for this task. To this end, existing mainstream solutions typically introduce additional perspective-based 360 representations (e.g., Cubemap) to achieve effective feature extraction. Nevertheless, regardless of the introduced representations, they eventually need to be unified into the equirectangular projection (ERP) format for the subsequent depth estimation, which inevitably reintroduces additional distortions. In this work, we propose an oriented-distortion-aware Gabor Fusion framework (PGFuse) to address the above challenges. First, we introduce Gabor filters that analyze texture in the frequency domain, extending the receptive fields and enhancing depth cues. To address the reintroduced distortions, we design a latitude-aware distortion representation to generate customized, distortion-aware Gabor filters (PanoGabor filters). Furthermore, we design a channel- wise and spatial- wise unidirectional fusion module (CS-UFM) that integrates the proposed PanoGabor filters to unify other representations into the ERP format, delivering effective and distortion-aware features. Considering the orientation sensitivity of the Gabor transform, we further introduce a spherical gradient constraint to stabilize this sensitivity. Experimental results on three popular indoor 360 benchmarks demonstrate the superiority of the proposed PGFuse to existing state-of-the-art solutions. Code and models will be available at https://github.com/zhijieshen-bjtu/PGFuse.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"120 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145961425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-13DOI: 10.1109/tpami.2026.3653482
Tianshan Liu,Bing-Kun Bao
With the functions of egocentric observation and multimodal perception equipped in augmented reality (AR) devices, the next generation of smart assistants has the potential to reduce human labor and enhance execution efficiency in assembly tasks. Among diverse assembly activity understanding tasks, anticipating the near future activities is crucial yet challenging, which can assist humans or agents to actively plan and engage in interactions with the environment. However, the existing egocentric activity anticipation methods still struggle to achieve a decent trade-off between accuracy and computational efficiency, hindering them to be deployed in practical applications. To address this dilemma, in this paper, we propose a goal-guided prompting framework with adaptive modality selection (GP-AMS), for assembly activity anticipation in egocentric videos. For bridging the semantic gap between the historical observations and unobserved future activities, we inject the inferred high-level goal clues into the constructed prompts, which are further utilized to guide a pre-trained vision-language (V-L) model to compensate relevant semantics of unseen future. Moreover, a mask-and-predict strategy is adopted with two imposed constraints, i.e., casual masking and probabilistic token-dropping, to mine the intrinsic associations between the assembly activities within a specific procedure. For maintaining the benefits of exploiting multimodal information while avoiding extensively increasing the computational burdens, an adaptive modality selection strategy is designed to train a policy network, which learns to dynamically decide which modalities should be sampled for processing by the anticipation model on a per observation time-step basis. By allocating major computation to the selected indicative modalities on-the-fly, the efficiency of the overall model can be improved, thus paving the way for feasibility on real-world devices. Extensive experimental results on two public data sets validate that the proposed method yields not only consistent improvements in anticipation accuracy, but also significant savings in computation budgets.
{"title":"Goal-guided Prompting with Adaptive Modality Selection for Efficient Assembly Activity Anticipation in Egocentric Videos.","authors":"Tianshan Liu,Bing-Kun Bao","doi":"10.1109/tpami.2026.3653482","DOIUrl":"https://doi.org/10.1109/tpami.2026.3653482","url":null,"abstract":"With the functions of egocentric observation and multimodal perception equipped in augmented reality (AR) devices, the next generation of smart assistants has the potential to reduce human labor and enhance execution efficiency in assembly tasks. Among diverse assembly activity understanding tasks, anticipating the near future activities is crucial yet challenging, which can assist humans or agents to actively plan and engage in interactions with the environment. However, the existing egocentric activity anticipation methods still struggle to achieve a decent trade-off between accuracy and computational efficiency, hindering them to be deployed in practical applications. To address this dilemma, in this paper, we propose a goal-guided prompting framework with adaptive modality selection (GP-AMS), for assembly activity anticipation in egocentric videos. For bridging the semantic gap between the historical observations and unobserved future activities, we inject the inferred high-level goal clues into the constructed prompts, which are further utilized to guide a pre-trained vision-language (V-L) model to compensate relevant semantics of unseen future. Moreover, a mask-and-predict strategy is adopted with two imposed constraints, i.e., casual masking and probabilistic token-dropping, to mine the intrinsic associations between the assembly activities within a specific procedure. For maintaining the benefits of exploiting multimodal information while avoiding extensively increasing the computational burdens, an adaptive modality selection strategy is designed to train a policy network, which learns to dynamically decide which modalities should be sampled for processing by the anticipation model on a per observation time-step basis. By allocating major computation to the selected indicative modalities on-the-fly, the efficiency of the overall model can be improved, thus paving the way for feasibility on real-world devices. Extensive experimental results on two public data sets validate that the proposed method yields not only consistent improvements in anticipation accuracy, but also significant savings in computation budgets.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"54 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145961428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}