Pub Date : 2026-01-28DOI: 10.1109/tpami.2026.3658598
Jiawei Mao,Yu Yang,Xuesong Yin,Ling Shao,Hao Tang
Severe weather restoration models often face the simultaneous interaction of multiple degradations in real-world scenarios. Existing approaches typically handle single or composite degradations based on scene descriptors derived from text or image embeddings. However, due to the varying proportions of different degradations within an image, these scene descriptors may not accurately differentiate between degradations, leading to suboptimal restoration in practical applications. To address this issue, we propose a novel Transformer-based restoration framework, AllRestorer, for dealing with four physical severe weather impairments: low-light, haze, rain, and snow. In AllRestorer, we enable the model to adaptively consider all weather impairments, thereby avoiding errors from scene descriptor misdirection. Specifically, we introduce the All-in-One Transformer Block (AiOTB), the core innovation of which is the ability to adaptively handle multiple degradations in a single image, beyond the limitation of existing Transformers that can only handle one type of degradation at a time. To accurately address different variations potentially present within the same type of degradation and minimize ambiguity, AiOTB utilizes a Composite Scene Embedding consisting of both image and text embeddings to define the degradation. Moreover, AiOTB includes an adaptive weight for each degradation, allowing for precise control of the restoration intensity. By leveraging AiOTB, AllRestorer avoids misdirection caused by inaccurate scene descriptors, achieving a 5.00 dB increase in PSNR compared to the baseline on the CDD-11 dataset.
{"title":"All-in-One Transformer for Image Restoration Under Adverse Weather Degradations.","authors":"Jiawei Mao,Yu Yang,Xuesong Yin,Ling Shao,Hao Tang","doi":"10.1109/tpami.2026.3658598","DOIUrl":"https://doi.org/10.1109/tpami.2026.3658598","url":null,"abstract":"Severe weather restoration models often face the simultaneous interaction of multiple degradations in real-world scenarios. Existing approaches typically handle single or composite degradations based on scene descriptors derived from text or image embeddings. However, due to the varying proportions of different degradations within an image, these scene descriptors may not accurately differentiate between degradations, leading to suboptimal restoration in practical applications. To address this issue, we propose a novel Transformer-based restoration framework, AllRestorer, for dealing with four physical severe weather impairments: low-light, haze, rain, and snow. In AllRestorer, we enable the model to adaptively consider all weather impairments, thereby avoiding errors from scene descriptor misdirection. Specifically, we introduce the All-in-One Transformer Block (AiOTB), the core innovation of which is the ability to adaptively handle multiple degradations in a single image, beyond the limitation of existing Transformers that can only handle one type of degradation at a time. To accurately address different variations potentially present within the same type of degradation and minimize ambiguity, AiOTB utilizes a Composite Scene Embedding consisting of both image and text embeddings to define the degradation. Moreover, AiOTB includes an adaptive weight for each degradation, allowing for precise control of the restoration intensity. By leveraging AiOTB, AllRestorer avoids misdirection caused by inaccurate scene descriptors, achieving a 5.00 dB increase in PSNR compared to the baseline on the CDD-11 dataset.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"44 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146069939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data augmentation is crucial for addressing insufficient training data, especially for augmenting positive samples. However, existing methods mostly rely on neural network-based feedback for data augmentation and often overlook the optimization of feature distribution. In this study, we present a practical, distribution-preserving data augmentation pipeline that augments positive samples by optimizing a feature indicator (e.g., two-dimensional entropy), aiming to maintain alignment with the original data distribution. Inspired by the manifold hypothesis, we propose a Manifold Heuristic Optimization Algorithm (MHOA), which augments positive samples by exploring the low-dimensional Euclidean space around object contour pixels instead of the entire decision space. Guided by a "distribution-preservation-first" perspective, our approach explicitly optimizes fidelity to the original data manifold and only retains augmented samples whose feature statistics (e.g., mean, variance) align with the source class. It significantly improves image classification accuracy across neural networks, outperforming state-of-the-art data augmentation methods-especially when the dataset's feature indicator follows a Gaussian distribution. The algorithm's search space, focused on neighborhoods of key feature pixels, is the core driver of its superior performance.
{"title":"Positive Data Augmentation Based on Manifold Heuristic Optimization for Image Classification.","authors":"Fangqing Liu,Han Huang,Fujian Feng,Xueming Yan,Zhifeng Hao","doi":"10.1109/tpami.2026.3657249","DOIUrl":"https://doi.org/10.1109/tpami.2026.3657249","url":null,"abstract":"Data augmentation is crucial for addressing insufficient training data, especially for augmenting positive samples. However, existing methods mostly rely on neural network-based feedback for data augmentation and often overlook the optimization of feature distribution. In this study, we present a practical, distribution-preserving data augmentation pipeline that augments positive samples by optimizing a feature indicator (e.g., two-dimensional entropy), aiming to maintain alignment with the original data distribution. Inspired by the manifold hypothesis, we propose a Manifold Heuristic Optimization Algorithm (MHOA), which augments positive samples by exploring the low-dimensional Euclidean space around object contour pixels instead of the entire decision space. Guided by a \"distribution-preservation-first\" perspective, our approach explicitly optimizes fidelity to the original data manifold and only retains augmented samples whose feature statistics (e.g., mean, variance) align with the source class. It significantly improves image classification accuracy across neural networks, outperforming state-of-the-art data augmentation methods-especially when the dataset's feature indicator follows a Gaussian distribution. The algorithm's search space, focused on neighborhoods of key feature pixels, is the core driver of its superior performance.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"56 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146034078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-22DOI: 10.1109/tpami.2026.3656825
Lin Zhu,Kangmin Jia,Yifan Zhao,Yunshan Qi,Lizhi Wang,Hua Huang
Spike cameras generate binary spikes in response to light intensity changes, enabling high-speed visual perception with unprecedented temporal resolution. However, the unique characteristics of spike stream present significant challenges for reconstructing dense 3D scene representations, particularly in dynamic environments and under non-ideal lighting conditions. In this paper, we introduce DSNeRF, the first method to derive a NeRF-based volumetric scene representation from spike camera data. Our approach leverages NeRF's multi-view consistency to establish robust self-supervision, effectively eliminating erroneous measurements and uncovering coherent structures within exceedingly noisy input amidst diverse real-world illumination scenarios. We propose a novel mapping from pixel rays to the spike domain, integrating the spike generation process directly into NeRF training. Specifically, DSNeRF introduces an integrate-and-fire neuron layer that models non-idealities to capture intrinsic camera noise, including both random and fixed-pattern spike noise, thereby enhancing scene fidelity. Additionally, we propose a motion-guided spiking neuron layer and a long-term rendering photometric loss to better align dynamic spike streams, ensuring accurate scene geometry. Our method optimizes neural radiance fields to render photorealistic novel views from continuous spike streams, demonstrating advantages over other vision sensors in certain scenes. Empirical evaluations on both real and simulated sequences validate the effectiveness of our approach. The dataset and source code will be released at https://github.com/BIT-Vision/DSNeRF.
{"title":"DSNeRF: Dynamic View Synthesis for Ultra-Fast Scenes from Continuous Spike Streams.","authors":"Lin Zhu,Kangmin Jia,Yifan Zhao,Yunshan Qi,Lizhi Wang,Hua Huang","doi":"10.1109/tpami.2026.3656825","DOIUrl":"https://doi.org/10.1109/tpami.2026.3656825","url":null,"abstract":"Spike cameras generate binary spikes in response to light intensity changes, enabling high-speed visual perception with unprecedented temporal resolution. However, the unique characteristics of spike stream present significant challenges for reconstructing dense 3D scene representations, particularly in dynamic environments and under non-ideal lighting conditions. In this paper, we introduce DSNeRF, the first method to derive a NeRF-based volumetric scene representation from spike camera data. Our approach leverages NeRF's multi-view consistency to establish robust self-supervision, effectively eliminating erroneous measurements and uncovering coherent structures within exceedingly noisy input amidst diverse real-world illumination scenarios. We propose a novel mapping from pixel rays to the spike domain, integrating the spike generation process directly into NeRF training. Specifically, DSNeRF introduces an integrate-and-fire neuron layer that models non-idealities to capture intrinsic camera noise, including both random and fixed-pattern spike noise, thereby enhancing scene fidelity. Additionally, we propose a motion-guided spiking neuron layer and a long-term rendering photometric loss to better align dynamic spike streams, ensuring accurate scene geometry. Our method optimizes neural radiance fields to render photorealistic novel views from continuous spike streams, demonstrating advantages over other vision sensors in certain scenes. Empirical evaluations on both real and simulated sequences validate the effectiveness of our approach. The dataset and source code will be released at https://github.com/BIT-Vision/DSNeRF.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"3 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146021642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hair editing is a long-standing problem in computer vision that demands both fine-grained local control and intuitive user interactions across diverse modalities. Despite the remarkable progress of GANs and diffusion models, existing methods still lack a unified framework that simultaneously supports arbitrary interaction modes (e.g., text, sketch, mask, and reference image) while ensuring precise editing and faithful preservation of irrelevant attributes. In this work, we introduce a novel paradigm that reformulates hair editing as proxy-based hair transfer. Specifically, we leverage the dense and semantically disentangled latent space of StyleGAN for precise manipulation and exploit its feature space for disentangled attribute preservation, thereby decoupling the objectives of editing and preservation. Our framework unifies different modalities by converting editing conditions into distinct transfer proxies, whose features are seamlessly blended to achieve global or local edits. Beyond 2D, we extend our paradigm to 3D-aware settings by incorporating EG3D and PanoHead, where we propose a multi-view boosted hair feature localization strategy together with 3D-tailored proxy generation methods that exploit the inherent properties of 3D-aware generative models. Extensive experiments demonstrate that our method consistently outperforms prior approaches in editing effects, attribute preservation, visual naturalness, and multi-view consistency, while offering unprecedented support for multimodal and mixed-modal interactions.
{"title":"Unifying Multi-modal Hair Editing via Proxy Feature Blending.","authors":"Tianyi Wei,Dongdong Chen,Wenbo Zhou,Jing Liao,Can Wang,Weiming Zhang,Gang Hua,Nenghai Yu","doi":"10.1109/tpami.2026.3656763","DOIUrl":"https://doi.org/10.1109/tpami.2026.3656763","url":null,"abstract":"Hair editing is a long-standing problem in computer vision that demands both fine-grained local control and intuitive user interactions across diverse modalities. Despite the remarkable progress of GANs and diffusion models, existing methods still lack a unified framework that simultaneously supports arbitrary interaction modes (e.g., text, sketch, mask, and reference image) while ensuring precise editing and faithful preservation of irrelevant attributes. In this work, we introduce a novel paradigm that reformulates hair editing as proxy-based hair transfer. Specifically, we leverage the dense and semantically disentangled latent space of StyleGAN for precise manipulation and exploit its feature space for disentangled attribute preservation, thereby decoupling the objectives of editing and preservation. Our framework unifies different modalities by converting editing conditions into distinct transfer proxies, whose features are seamlessly blended to achieve global or local edits. Beyond 2D, we extend our paradigm to 3D-aware settings by incorporating EG3D and PanoHead, where we propose a multi-view boosted hair feature localization strategy together with 3D-tailored proxy generation methods that exploit the inherent properties of 3D-aware generative models. Extensive experiments demonstrate that our method consistently outperforms prior approaches in editing effects, attribute preservation, visual naturalness, and multi-view consistency, while offering unprecedented support for multimodal and mixed-modal interactions.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"95 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146015371","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Knowledge graph embeddings (KGE) are effective for representing factual data for numerous applications. However, real-world facts continually evolve, necessitating ongoing updates to knowledge graphs as new information emerges. Under these circumstances, existing KGE models in transductive, inductive, and continual learning settings are prone to catastrophic forgetting or require costly retraining to integrate new information. To address these challenges, we propose a novel model called the Context-aware Adaptive learning model for Knowledge Graph Embeddings (CAKGE). Our model first identifies semantic-relevant entities and uncovers latent relational paths to facilitate the acquisition of new knowledge. To ensure the paths are semantically aligned with the query, we employ a context-aware fusion module, which leverages multiple specialized expert networks to assess and integrate the relevance of these relational paths. Building on this, we introduce an adaptive message aggregation module that incorporates a knowledge replay strategy, enabling the model to integrate both new and existing knowledge efficiently, without retraining the knowledge graph. Additionally, to mitigate catastrophic forgetting, we reformulate the challenge of aligning new with existing knowledge as a graph-matching task using the Fused Gromov-Wasserstein distance, enabling the alignment of old and new knowledge from both semantic and topological perspectives. Furthermore, we provide theoretical guarantees for the expressiveness and reasoning ability of CAKGE, showing that it is the first unified framework tackling transductive, inductive, and continual settings. Extensive experiments show that CAKGE achieves state-of-the-art performance, demonstrating its effectiveness in dynamic KGE modeling.
{"title":"CAKGE: Context-aware Adaptive Learning for Dynamic Knowledge Graph Embeddings.","authors":"Zongsheng Cao,Qianqian Xu,Zhiyong Yang,Xiaochun Cao,Qingming Huang","doi":"10.1109/tpami.2026.3655896","DOIUrl":"https://doi.org/10.1109/tpami.2026.3655896","url":null,"abstract":"Knowledge graph embeddings (KGE) are effective for representing factual data for numerous applications. However, real-world facts continually evolve, necessitating ongoing updates to knowledge graphs as new information emerges. Under these circumstances, existing KGE models in transductive, inductive, and continual learning settings are prone to catastrophic forgetting or require costly retraining to integrate new information. To address these challenges, we propose a novel model called the Context-aware Adaptive learning model for Knowledge Graph Embeddings (CAKGE). Our model first identifies semantic-relevant entities and uncovers latent relational paths to facilitate the acquisition of new knowledge. To ensure the paths are semantically aligned with the query, we employ a context-aware fusion module, which leverages multiple specialized expert networks to assess and integrate the relevance of these relational paths. Building on this, we introduce an adaptive message aggregation module that incorporates a knowledge replay strategy, enabling the model to integrate both new and existing knowledge efficiently, without retraining the knowledge graph. Additionally, to mitigate catastrophic forgetting, we reformulate the challenge of aligning new with existing knowledge as a graph-matching task using the Fused Gromov-Wasserstein distance, enabling the alignment of old and new knowledge from both semantic and topological perspectives. Furthermore, we provide theoretical guarantees for the expressiveness and reasoning ability of CAKGE, showing that it is the first unified framework tackling transductive, inductive, and continual settings. Extensive experiments show that CAKGE achieves state-of-the-art performance, demonstrating its effectiveness in dynamic KGE modeling.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"278 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146015376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-21DOI: 10.1109/tpami.2026.3656494
Wenda Zhao,Yunxiang Li,Haipeng Wang,Huchuan Lu
Remote sensing images exhibit intrinsic domain complexity arising from multi-source sensor variances, which heterogeneity fundamentally challenges conventional cross-domain few-shot methods that assume simple distribution shifts. Addressing this, we propose a first-order Cross-Domain Meta Learning (CDML) for few-shot remote sensing object classification. CDML implements a dual-stage domain adaptation task as the fundamental meta-learning unit, and includes a cross-domain meta-train phase (CDMTrain) and a cross-domain meta-test phase (CDMTest). In CDMTrain, we propose an inner-loop multi-domain few-shot task sampling, which enables a teacher model encapsulate both cross-category discriminative features and authentic inter-domain distributional divergence. This alternating cyclic learning paradigm captures genuine domain shifts, with each update direction progressively guiding the model toward parameters that balance multi-domain performance. In CDMTest, we evaluate a domain diversity enhancement by transferring teacher parameters to the student model for cross-domain capability assessment on the reserved pseudo-unseen domain. The task-level design progressively improves domain generalization through iterative domain adaptive task learning. Meanwhile, to mitigate the conflicts and inadequacies caused by multi-domain scenarios, we propose a learnable affine transformation model. It adaptively learns affine transformation parameters through intermediate layer features to fine-tune the update direction. Extensive experiments on five remote sensing classification benchmarks demonstrate a superior performance of the proposed method compared with the state-of-the-art methods. The code will be released at: https://github.com/lyxdlut/CDML.
{"title":"First-Order Cross-Domain Meta Learning for Few-Shot Remote Sensing Object Classification.","authors":"Wenda Zhao,Yunxiang Li,Haipeng Wang,Huchuan Lu","doi":"10.1109/tpami.2026.3656494","DOIUrl":"https://doi.org/10.1109/tpami.2026.3656494","url":null,"abstract":"Remote sensing images exhibit intrinsic domain complexity arising from multi-source sensor variances, which heterogeneity fundamentally challenges conventional cross-domain few-shot methods that assume simple distribution shifts. Addressing this, we propose a first-order Cross-Domain Meta Learning (CDML) for few-shot remote sensing object classification. CDML implements a dual-stage domain adaptation task as the fundamental meta-learning unit, and includes a cross-domain meta-train phase (CDMTrain) and a cross-domain meta-test phase (CDMTest). In CDMTrain, we propose an inner-loop multi-domain few-shot task sampling, which enables a teacher model encapsulate both cross-category discriminative features and authentic inter-domain distributional divergence. This alternating cyclic learning paradigm captures genuine domain shifts, with each update direction progressively guiding the model toward parameters that balance multi-domain performance. In CDMTest, we evaluate a domain diversity enhancement by transferring teacher parameters to the student model for cross-domain capability assessment on the reserved pseudo-unseen domain. The task-level design progressively improves domain generalization through iterative domain adaptive task learning. Meanwhile, to mitigate the conflicts and inadequacies caused by multi-domain scenarios, we propose a learnable affine transformation model. It adaptively learns affine transformation parameters through intermediate layer features to fine-tune the update direction. Extensive experiments on five remote sensing classification benchmarks demonstrate a superior performance of the proposed method compared with the state-of-the-art methods. The code will be released at: https://github.com/lyxdlut/CDML.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"222 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146015380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Large Language Models (LLMs) exhibit remarkable proficiency in understanding and managing text-based tasks. Many works try to transfer these capabilities to the video domain, which are referred to as Video-LLMs. However, current Video-LLMs can only grasp the coarse-grained semantics and are unable to efficiently handle tasks involving the comprehension or localization of specific video segments. To address these challenges, we propose Momentor, a Video-LLM designed to perform fine-grained temporal understanding tasks. To facilitate the training of Momentor, we develop an automatic data generation engine to build Moment-10 M, a large-scale video instruction dataset with segment-level instruction data. Building upon the foundation of the previously published Momentor and the Moment-10 M dataset, we further extend this work by introducing a Spatio-Temporal Token Consolidation (STTC) method, which can merge redundant visual tokens spatio-temporally in a parameter-free manner, thereby significantly promoting computational efficiency while preserving fine-grained visual details. We integrate STTC with Momentor to develop Momentor++ and validate its performance on various benchmarks. Momentor demonstrates robust capabilities in fine-grained temporal understanding and localization. Further, Momentor++ excels in efficiently processing and analyzing extended videos with complex events, showcasing marked advancements in handling extensive temporal contexts.
{"title":"Momentor++: Advancing Video Large Language Models With Fine-Grained Long Video Reasoning.","authors":"Juncheng Li,Minghe Gao,Xiangnan He,Siliang Tang,Weishi Zheng,Jun Xiao,Meng Wang,Tat-Seng Chua,Yueting Zhuang","doi":"10.1109/tpami.2026.3656169","DOIUrl":"https://doi.org/10.1109/tpami.2026.3656169","url":null,"abstract":"Large Language Models (LLMs) exhibit remarkable proficiency in understanding and managing text-based tasks. Many works try to transfer these capabilities to the video domain, which are referred to as Video-LLMs. However, current Video-LLMs can only grasp the coarse-grained semantics and are unable to efficiently handle tasks involving the comprehension or localization of specific video segments. To address these challenges, we propose Momentor, a Video-LLM designed to perform fine-grained temporal understanding tasks. To facilitate the training of Momentor, we develop an automatic data generation engine to build Moment-10 M, a large-scale video instruction dataset with segment-level instruction data. Building upon the foundation of the previously published Momentor and the Moment-10 M dataset, we further extend this work by introducing a Spatio-Temporal Token Consolidation (STTC) method, which can merge redundant visual tokens spatio-temporally in a parameter-free manner, thereby significantly promoting computational efficiency while preserving fine-grained visual details. We integrate STTC with Momentor to develop Momentor++ and validate its performance on various benchmarks. Momentor demonstrates robust capabilities in fine-grained temporal understanding and localization. Further, Momentor++ excels in efficiently processing and analyzing extended videos with complex events, showcasing marked advancements in handling extensive temporal contexts.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"30 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146005486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}