IEEE transactions on pattern analysis and machine intelligence最新文献_第9页

Spiking Variational Policy Gradient for Brain Inspired Reinforcement Learning 脑启发强化学习的峰值变分策略梯度

IEEE transactions on pattern analysis and machine intelligence

Pub Date : 2024-12-09 DOI: 10.1109/TPAMI.2024.3511936

Zhile Yang;Shangqi Guo;Ying Fang;Zhaofei Yu;Jian K. Liu

Recent studies in reinforcement learning have explored brain-inspired function approximators and learning algorithms to simulate brain intelligence and adapt to neuromorphic hardware. Among these approaches, reward-modulated spike-timing-dependent plasticity (R-STDP) is biologically plausible and energy-efficient, but suffers from a gap between its local learning rules and the global learning objectives, which limits its performance and applicability. In this paper, we design a recurrent winner-take-all network and propose the spiking variational policy gradient (SVPG), a new R-STDP learning method derived theoretically from the global policy gradient. Specifically, the policy inference is derived from an energy-based policy function using mean-field inference, and the policy optimization is based on a last-step approximation of the global policy gradient. These fill the gap between the local learning rules and the global target. In experiments including a challenging ViZDoom vision-based navigation task and two realistic robot control tasks, SVPG successfully solves all the tasks. In addition, SVPG exhibits better inherent robustness to various kinds of input, network parameters, and environmental perturbations than compared methods.

{"title":"Spiking Variational Policy Gradient for Brain Inspired Reinforcement Learning","authors":"Zhile Yang;Shangqi Guo;Ying Fang;Zhaofei Yu;Jian K. Liu","doi":"10.1109/TPAMI.2024.3511936","DOIUrl":"10.1109/TPAMI.2024.3511936","url":null,"abstract":"Recent studies in reinforcement learning have explored brain-inspired function approximators and learning algorithms to simulate brain intelligence and adapt to neuromorphic hardware. Among these approaches, reward-modulated spike-timing-dependent plasticity (R-STDP) is biologically plausible and energy-efficient, but suffers from a gap between its local learning rules and the global learning objectives, which limits its performance and applicability. In this paper, we design a recurrent winner-take-all network and propose the spiking variational policy gradient (SVPG), a new R-STDP learning method derived theoretically from the global policy gradient. Specifically, the policy inference is derived from an energy-based policy function using mean-field inference, and the policy optimization is based on a last-step approximation of the global policy gradient. These fill the gap between the local learning rules and the global target. In experiments including a challenging ViZDoom vision-based navigation task and two realistic robot control tasks, SVPG successfully solves all the tasks. In addition, SVPG exhibits better inherent robustness to various kinds of input, network parameters, and environmental perturbations than compared methods.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 3","pages":"1975-1990"},"PeriodicalIF":0.0,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142797161","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Iteratively Capped Reweighting Norm Minimization With Global Convergence Guarantee for Low-Rank Matrix Learning 具有全局收敛保证的低秩矩阵学习迭代上限重加权范数最小化

IEEE transactions on pattern analysis and machine intelligence

Pub Date : 2024-12-09 DOI: 10.1109/TPAMI.2024.3512458

Zhi Wang;Dong Hu;Zhuo Liu;Chao Gao;Zhen Wang

In recent years, a large number of studies have shown that low rank matrix learning (LRML) has become a popular approach in machine learning and computer vision with many important applications, such as image inpainting, subspace clustering, and recommendation system. The latest LRML methods resort to using some surrogate functions as convex or nonconvex relaxation of the rank function. However, most of these methods ignore the difference between different rank components and can only yield suboptimal solutions. To alleviate this problem, in this paper we propose a novel nonconvex regularizer called capped reweighting norm minimization (CRNM), which not only considers the different contributions of different rank components, but also adaptively truncates sequential singular values. With it, a general LRML model is obtained. Meanwhile, under some mild conditions, the global optimum of CRNM regularized least squares subproblem can be easily obtained in closed-form. Through the analysis of the theoretical properties of CRNM, we develop a high computational efficiency optimization method with convergence guarantee to solve the general LRML model. More importantly, by using the Kurdyka-Łojasiewicz (KŁ) inequality, its local and global convergence properties are established. Finally, we show that the proposed nonconvex regularizer, as well as the optimization approach are suitable for different low rank tasks, such as matrix completion and subspace clustering. Extensive experimental results demonstrate that the constructed models and methods provide significant advantages over several state-of-the-art low rank matrix leaning models and methods.

{"title":"Iteratively Capped Reweighting Norm Minimization With Global Convergence Guarantee for Low-Rank Matrix Learning","authors":"Zhi Wang;Dong Hu;Zhuo Liu;Chao Gao;Zhen Wang","doi":"10.1109/TPAMI.2024.3512458","DOIUrl":"10.1109/TPAMI.2024.3512458","url":null,"abstract":"In recent years, a large number of studies have shown that low rank matrix learning (LRML) has become a popular approach in machine learning and computer vision with many important applications, such as image inpainting, subspace clustering, and recommendation system. The latest LRML methods resort to using some surrogate functions as convex or nonconvex relaxation of the rank function. However, most of these methods ignore the difference between different rank components and can only yield suboptimal solutions. To alleviate this problem, in this paper we propose a novel nonconvex regularizer called capped reweighting norm minimization (CRNM), which not only considers the different contributions of different rank components, but also adaptively truncates sequential singular values. With it, a general LRML model is obtained. Meanwhile, under some mild conditions, the global optimum of CRNM regularized least squares subproblem can be easily obtained in closed-form. Through the analysis of the theoretical properties of CRNM, we develop a high computational efficiency optimization method with convergence guarantee to solve the general LRML model. More importantly, by using the Kurdyka-Łojasiewicz (KŁ) inequality, its local and global convergence properties are established. Finally, we show that the proposed nonconvex regularizer, as well as the optimization approach are suitable for different low rank tasks, such as matrix completion and subspace clustering. Extensive experimental results demonstrate that the constructed models and methods provide significant advantages over several state-of-the-art low rank matrix leaning models and methods.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 3","pages":"1923-1940"},"PeriodicalIF":0.0,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142797160","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Uni-AdaFocus: Spatial-Temporal Dynamic Computation for Video Recognition Uni-AdaFocus：视频识别的时空动态计算

IEEE transactions on pattern analysis and machine intelligence

Pub Date : 2024-12-09 DOI: 10.1109/TPAMI.2024.3514654

Yulin Wang;Haoji Zhang;Yang Yue;Shiji Song;Chao Deng;Junlan Feng;Gao Huang

This paper presents a comprehensive exploration of the phenomenon of data redundancy in video understanding, with the aim to improve computational efficiency. Our investigation commences with an examination of spatial redundancy, which refers to the observation that the most informative region in each video frame usually corresponds to a small image patch, whose shape, size and location shift smoothly across frames. Motivated by this phenomenon, we formulate the patch localization problem as a dynamic decision task, and introduce a spatially adaptive video recognition approach, termed AdaFocus. In specific, a lightweight encoder is first employed to quickly process the full video sequence, whose features are then utilized by a policy network to identify the most task-relevant regions. Subsequently, the selected patches are inferred by a high-capacity deep network for the final prediction. The complete model can be trained conveniently in an end-to-end manner. During inference, once the informative patch sequence has been generated, the bulk of computation can be executed in parallel, rendering it efficient on modern GPU devices. Furthermore, we demonstrate that AdaFocus can be easily extended by further considering the temporal and sample-wise redundancies, i.e., allocating the majority of computation to the most task-relevant video frames, and minimizing the computation spent on relatively “easier” videos. Our resulting algorithm, Uni-AdaFocus, establishes a comprehensive framework that seamlessly integrates spatial, temporal, and sample-wise dynamic computation, while it preserves the merits of AdaFocus in terms of efficient end-to-end training and hardware friendliness. In addition, Uni-AdaFocus is general and flexible as it is compatible with off-the-shelf backbone models (e.g., TSM and X3D), which can be readily deployed as our feature extractor, yielding a significantly improved computational efficiency. Empirically, extensive experiments based on seven widely-used benchmark datasets (i.e., ActivityNet, FCVID, Mini-Kinetics, Something-Something V1&V2, Jester, and Kinetics-400) and three real-world application scenarios (i.e., fine-grained diving action classification, Alzheimer’s and Parkinson’s diseases diagnosis with brain magnetic resonance images (MRI), and violence recognition for online videos) substantiate that Uni-AdaFocus is considerably more efficient than the competitive baselines.

{"title":"Uni-AdaFocus: Spatial-Temporal Dynamic Computation for Video Recognition","authors":"Yulin Wang;Haoji Zhang;Yang Yue;Shiji Song;Chao Deng;Junlan Feng;Gao Huang","doi":"10.1109/TPAMI.2024.3514654","DOIUrl":"10.1109/TPAMI.2024.3514654","url":null,"abstract":"This paper presents a comprehensive exploration of the phenomenon of data redundancy in video understanding, with the aim to improve computational efficiency. Our investigation commences with an examination of <italic>spatial redundancy, which refers to the observation that the most informative region in each video frame usually corresponds to a small image patch, whose shape, size and location shift smoothly across frames. Motivated by this phenomenon, we formulate the patch localization problem as a dynamic decision task, and introduce a spatially adaptive video recognition approach, termed AdaFocus. In specific, a lightweight encoder is first employed to quickly process the full video sequence, whose features are then utilized by a policy network to identify the most task-relevant regions. Subsequently, the selected patches are inferred by a high-capacity deep network for the final prediction. The complete model can be trained conveniently in an end-to-end manner. During inference, once the informative patch sequence has been generated, the bulk of computation can be executed in parallel, rendering it efficient on modern GPU devices. Furthermore, we demonstrate that AdaFocus can be easily extended by further considering the <italic>temporal and <italic>sample-wise redundancies, i.e., allocating the majority of computation to the most task-relevant video frames, and minimizing the computation spent on relatively “easier” videos. Our resulting algorithm, Uni-AdaFocus, establishes a comprehensive framework that seamlessly integrates spatial, temporal, and sample-wise dynamic computation, while it preserves the merits of AdaFocus in terms of efficient end-to-end training and hardware friendliness. In addition, Uni-AdaFocus is general and flexible as it is compatible with off-the-shelf backbone models (e.g., TSM and X3D), which can be readily deployed as our feature extractor, yielding a significantly improved computational efficiency. Empirically, extensive experiments based on seven widely-used benchmark datasets (i.e., ActivityNet, FCVID, Mini-Kinetics, Something-Something V1&V2, Jester, and Kinetics-400) and three real-world application scenarios (i.e., fine-grained diving action classification, Alzheimer’s and Parkinson’s diseases diagnosis with brain magnetic resonance images (MRI), and violence recognition for online videos) substantiate that Uni-AdaFocus is considerably more efficient than the competitive baselines.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 3","pages":"1782-1799"},"PeriodicalIF":0.0,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142797159","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Divide-and-Conquer: Confluent Triple-Flow Network for RGB-T Salient Object Detection 分而治之：RGB-T显著目标检测的融合三流网络

IEEE transactions on pattern analysis and machine intelligence

Pub Date : 2024-12-05 DOI: 10.1109/TPAMI.2024.3511621

Hao Tang;Zechao Li;Dong Zhang;Shengfeng He;Jinhui Tang

RGB-Thermal Salient Object Detection (RGB-T SOD) aims to pinpoint prominent objects within aligned pairs of visible and thermal infrared images. A key challenge lies in bridging the inherent disparities between RGB and Thermal modalities for effective saliency map prediction. Traditional encoder-decoder architectures, while designed for cross-modality feature interactions, may not have adequately considered the robustness against noise originating from defective modalities, thereby leading to suboptimal performance in complex scenarios. Inspired by hierarchical human visual systems, we propose the ConTriNet, a robust Confluent Triple-Flow Network employing a “Divide-and-Conquer” strategy. This framework utilizes a unified encoder with specialized decoders, each addressing different subtasks of exploring modality-specific and modality-complementary information for RGB-T SOD, thereby enhancing the final saliency map prediction. Specifically, ConTriNet comprises three flows: two modality-specific flows explore cues from RGB and Thermal modalities, and a third modality-complementary flow integrates cues from both modalities. ConTriNet presents several notable advantages. It incorporates a Modality-induced Feature Modulator (MFM) in the modality-shared union encoder to minimize inter-modality discrepancies and mitigate the impact of defective samples. Additionally, a foundational Residual Atrous Spatial Pyramid Module (RASPM) in the separated flows enlarges the receptive field, allowing for the capture of multi-scale contextual information. Furthermore, a Modality-aware Dynamic Aggregation Module (MDAM) in the modality-complementary flow dynamically aggregates saliency-related cues from both modality-specific flows. Leveraging the proposed parallel triple-flow framework, we further refine saliency maps derived from different flows through a flow-cooperative fusion strategy, yielding a high-quality, full-resolution saliency map for the final prediction. To evaluate the robustness and stability of our approach, we collect a comprehensive RGB-T SOD benchmark, VT-IMAG, covering various real-world challenging scenarios. Extensive experiments on public benchmarks and our VT-IMAG dataset demonstrate that ConTriNet consistently outperforms state-of-the-art competitors in both common and challenging scenarios, even when dealing with incomplete modality data.

{"title":"Divide-and-Conquer: Confluent Triple-Flow Network for RGB-T Salient Object Detection","authors":"Hao Tang;Zechao Li;Dong Zhang;Shengfeng He;Jinhui Tang","doi":"10.1109/TPAMI.2024.3511621","DOIUrl":"10.1109/TPAMI.2024.3511621","url":null,"abstract":"RGB-Thermal Salient Object Detection (RGB-T SOD) aims to pinpoint prominent objects within aligned pairs of visible and thermal infrared images. A key challenge lies in bridging the inherent disparities between RGB and Thermal modalities for effective saliency map prediction. Traditional encoder-decoder architectures, while designed for cross-modality feature interactions, may not have adequately considered the robustness against noise originating from defective modalities, thereby leading to suboptimal performance in complex scenarios. Inspired by hierarchical human visual systems, we propose the <sc>ConTriNet, a robust Confluent Triple-Flow Network employing a <italic>“Divide-and-Conquer” strategy. This framework utilizes a unified encoder with specialized decoders, each addressing different subtasks of exploring modality-specific and modality-complementary information for RGB-T SOD, thereby enhancing the final saliency map prediction. Specifically, <sc>ConTriNet comprises three flows: two modality-specific flows explore cues from RGB and Thermal modalities, and a third modality-complementary flow integrates cues from both modalities. <sc>ConTriNet presents several notable advantages. It incorporates a <italic>Modality-induced Feature Modulator (MFM) in the modality-shared union encoder to minimize inter-modality discrepancies and mitigate the impact of defective samples. Additionally, a foundational <italic>Residual Atrous Spatial Pyramid Module (RASPM) in the separated flows enlarges the receptive field, allowing for the capture of multi-scale contextual information. Furthermore, a <italic>Modality-aware Dynamic Aggregation Module (MDAM) in the modality-complementary flow dynamically aggregates saliency-related cues from both modality-specific flows. Leveraging the proposed parallel triple-flow framework, we further refine saliency maps derived from different flows through a <italic>flow-cooperative fusion strategy, yielding a high-quality, full-resolution saliency map for the final prediction. To evaluate the robustness and stability of our approach, we collect a comprehensive RGB-T SOD benchmark, <bold>VT-IMAG, covering various real-world challenging scenarios. Extensive experiments on public benchmarks and our VT-IMAG dataset demonstrate that <sc>ConTriNet consistently outperforms state-of-the-art competitors in both common and challenging scenarios, even when dealing with incomplete modality data.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 3","pages":"1958-1974"},"PeriodicalIF":0.0,"publicationDate":"2024-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142782802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

JARVIS-1: Open-World Multi-Task Agents With Memory-Augmented Multimodal Language Models 具有记忆增强多模态语言模型的开放世界多任务代理

IEEE transactions on pattern analysis and machine intelligence

Pub Date : 2024-12-05 DOI: 10.1109/TPAMI.2024.3511593

Zihao Wang;Shaofei Cai;Anji Liu;Yonggang Jin;Jinbing Hou;Bowei Zhang;Haowei Lin;Zhaofeng He;Zilong Zheng;Yaodong Yang;Xiaojian Ma;Yitao Liang

Achieving human-like planning and control with multimodal observations in an open world is a key milestone for more functional generalist agents. Existing approaches can handle certain long-horizon tasks in an open world. However, they still struggle when the number of open-world tasks could potentially be infinite and lack the capability to progressively enhance task completion as game time progresses. We introduce JARVIS-1, an open-world agent that can perceive multimodal input (visual observations and human instructions), generate sophisticated plans, and perform embodied control, all within the popular yet challenging open-world Minecraft universe. Specifically, we develop JARVIS-1 on top of pre-trained multimodal language models, which map visual observations and textual instructions to plans. The plans will be ultimately dispatched to the goal-conditioned controllers. We outfit JARVIS-1 with a multimodal memory, which facilitates planning using both pre-trained knowledge and its actual game survival experiences. JARVIS-1 is the existing most general agent in Minecraft, capable of completing over 200 different tasks using control and observation space similar to humans. These tasks range from short-horizon tasks, e.g., “chopping trees” to long-horizon ones, e.g., “obtaining a diamond pickaxe”. JARVIS-1 performs exceptionally well in short-horizon tasks, achieving nearly perfect performance. In the classic long-term task of ObtainDiamondPickaxe, JARVIS-1 surpasses the reliability of current state-of-the-art agents by 5 times and can successfully complete longer-horizon and more challenging tasks. Furthermore, we show that JARVIS-1 is able to self-improve following a life-long learning paradigm thanks to multimodal memory, sparking a more general intelligence and improved autonomy.

{"title":"JARVIS-1: Open-World Multi-Task Agents With Memory-Augmented Multimodal Language Models","authors":"Zihao Wang;Shaofei Cai;Anji Liu;Yonggang Jin;Jinbing Hou;Bowei Zhang;Haowei Lin;Zhaofeng He;Zilong Zheng;Yaodong Yang;Xiaojian Ma;Yitao Liang","doi":"10.1109/TPAMI.2024.3511593","DOIUrl":"10.1109/TPAMI.2024.3511593","url":null,"abstract":"Achieving human-like planning and control with multimodal observations in an open world is a key milestone for more functional generalist agents. Existing approaches can handle certain long-horizon tasks in an open world. However, they still struggle when the number of open-world tasks could potentially be infinite and lack the capability to progressively enhance task completion as game time progresses. We introduce <bold>JARVIS-1, an open-world agent that can perceive multimodal input (visual observations and human instructions), generate sophisticated plans, and perform embodied control, all within the popular yet challenging open-world Minecraft universe. Specifically, we develop <bold>JARVIS-1 on top of pre-trained multimodal language models, which map visual observations and textual instructions to plans. The plans will be ultimately dispatched to the goal-conditioned controllers. We outfit <bold>JARVIS-1 with a multimodal memory, which facilitates planning using both pre-trained knowledge and its actual game survival experiences. <bold>JARVIS-1 is the existing most general agent in Minecraft, capable of completing over 200 different tasks using control and observation space similar to humans. These tasks range from short-horizon tasks, e.g., “chopping trees” to long-horizon ones, e.g., “obtaining a diamond pickaxe”. <bold>JARVIS-1 performs exceptionally well in short-horizon tasks, achieving nearly perfect performance. In the classic long-term task of <monospace>ObtainDiamondPickaxe</monospace>, <bold>JARVIS-1 surpasses the reliability of current state-of-the-art agents by 5 times and can successfully complete longer-horizon and more challenging tasks. Furthermore, we show that <bold>JARVIS-1 is able to <italic>self-improve following a life-long learning paradigm thanks to multimodal memory, sparking a more general intelligence and improved autonomy.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 3","pages":"1894-1907"},"PeriodicalIF":0.0,"publicationDate":"2024-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142782483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Spectrally-Corrected and Regularized LDA for Spiked Model 尖峰模型的光谱校正和正则化LDA

IEEE transactions on pattern analysis and machine intelligence

Pub Date : 2024-12-04 DOI: 10.1109/TPAMI.2024.3511080

Hua Li;Wenya Luo;Zhidong Bai;Huanchao Zhou;Zhangni Pu

This paper proposes an improved linear discriminant analysis called spectrally-corrected and regularized LDA (SRLDA). This approach incorporates design principles from both the spectrally-corrected covariance matrix and the regularized discriminant analysis. With the support of a large-dimensional random matrix theory, it is demonstrated that SRLDA achieves a globally optimal linear classification solution under the spiked model assumption. According to simulation data analysis, the SRLDA classifier exhibits better performance compared to RLDA and ILDA, closely to the theoretical classifier. Empirical experiments across diverse datasets further reflect that the SRLDA algorithm excels in both classification accuracy and dimensionality reduction, outperforming currently employed tools.

引用次数: 0

Remembering What is Important: A Factorised Multi-Head Retrieval and Auxiliary Memory Stabilisation Scheme for Human Motion Prediction 记住什么是重要的：一个分解的多头检索和辅助记忆稳定方案，用于人体运动预测

IEEE transactions on pattern analysis and machine intelligence

Pub Date : 2024-12-04 DOI: 10.1109/TPAMI.2024.3511393

Tharindu Fernando;Harshala Gammulle;Sridha Sridharan;Simon Denman;Clinton Fookes

Humans exhibit complex motions that vary depending on the activity they are performing, the interactions they engage in, as well as subject-specific preferences. Therefore, forecasting a human’s future pose based on the history of his or her previous motion is a challenging task. This paper presents an innovative auxiliary-memory-powered deep neural network framework to improve the modelling of historical knowledge. Specifically, we disentangle subject-specific, action-specific, and other auxiliary information from the observed pose sequences and utilise these factorised features to query the memory. A novel Multi-Head knowledge retrieval scheme leverages these factorised feature embeddings to perform multiple querying operations over the historical observations captured within the auxiliary memory. Moreover, we propose a dynamic masking strategy to make this feature disentanglement process adaptive. Two novel loss functions are introduced to encourage diversity within the auxiliary memory, while ensuring the stability of the memory content such that it can locate and store salient information that aids the long-term prediction of future motion, irrespective of any data imbalances or the diversity of the input data distribution. Extensive experiments conducted on two public benchmarks, Human3.6M and CMU-Mocap, demonstrate that these design choices collectively allow the proposed approach to outperform the current state-of-the-art methods by significant margins:

$> $

17% on the Human3.6M dataset and

$> $

9% on the CMU-Mocap dataset.

{"title":"Remembering What is Important: A Factorised Multi-Head Retrieval and Auxiliary Memory Stabilisation Scheme for Human Motion Prediction","authors":"Tharindu Fernando;Harshala Gammulle;Sridha Sridharan;Simon Denman;Clinton Fookes","doi":"10.1109/TPAMI.2024.3511393","DOIUrl":"10.1109/TPAMI.2024.3511393","url":null,"abstract":"Humans exhibit complex motions that vary depending on the activity they are performing, the interactions they engage in, as well as subject-specific preferences. Therefore, forecasting a human’s future pose based on the history of his or her previous motion is a challenging task. This paper presents an innovative auxiliary-memory-powered deep neural network framework to improve the modelling of historical knowledge. Specifically, we disentangle subject-specific, action-specific, and other auxiliary information from the observed pose sequences and utilise these factorised features to query the memory. A novel Multi-Head knowledge retrieval scheme leverages these factorised feature embeddings to perform multiple querying operations over the historical observations captured within the auxiliary memory. Moreover, we propose a dynamic masking strategy to make this feature disentanglement process adaptive. Two novel loss functions are introduced to encourage diversity within the auxiliary memory, while ensuring the stability of the memory content such that it can locate and store salient information that aids the long-term prediction of future motion, irrespective of any data imbalances or the diversity of the input data distribution. Extensive experiments conducted on two public benchmarks, Human3.6M and CMU-Mocap, demonstrate that these design choices collectively allow the proposed approach to outperform the current state-of-the-art methods by significant margins: <inline-formula><tex-math>$> $</tex-math></inline-formula> 17% on the Human3.6M dataset and <inline-formula><tex-math>$> $</tex-math></inline-formula> 9% on the CMU-Mocap dataset.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 3","pages":"1941-1957"},"PeriodicalIF":0.0,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142776597","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Survey and Benchmark of Automatic Surface Reconstruction From Point Clouds 基于点云的自动曲面重建技术综述与基准研究

IEEE transactions on pattern analysis and machine intelligence

Pub Date : 2024-12-04 DOI: 10.1109/TPAMI.2024.3510932

Raphael Sulzer;Renaud Marlet;Bruno Vallet;Loic Landrieu

We present a comprehensive survey and benchmark of both traditional and learning-based methods for surface reconstruction from point clouds. This task is particularly challenging for real-world acquisitions due to factors, such as noise, outliers, non-uniform sampling, and missing data. Traditional approaches often simplify the problem by imposing handcrafted priors on either the input point clouds or the resulting surface, a process that can require tedious hyperparameter tuning. In contrast, deep learning models have the capability to directly learn the properties of input point clouds and desired surfaces from data. We study the influence of handcrafted and learned priors on the precision and robustness of surface reconstruction techniques. We evaluate various time-tested and contemporary methods in a standardized manner. When both trained and evaluated on point clouds with identical characteristics, the learning-based models consistently produce higher-quality surfaces compared to their traditional counterparts—even in scenarios involving novel shape categories. However, traditional methods demonstrate greater resilience to the diverse anomalies commonly found in real-world 3D acquisitions. For the benefit of the research community, we make our code and datasets available, inviting further enhancements to learning-based surface reconstruction.

{"title":"A Survey and Benchmark of Automatic Surface Reconstruction From Point Clouds","authors":"Raphael Sulzer;Renaud Marlet;Bruno Vallet;Loic Landrieu","doi":"10.1109/TPAMI.2024.3510932","DOIUrl":"10.1109/TPAMI.2024.3510932","url":null,"abstract":"We present a comprehensive survey and benchmark of both traditional and learning-based methods for surface reconstruction from point clouds. This task is particularly challenging for real-world acquisitions due to factors, such as noise, outliers, non-uniform sampling, and missing data. Traditional approaches often simplify the problem by imposing handcrafted priors on either the input point clouds or the resulting surface, a process that can require tedious hyperparameter tuning. In contrast, deep learning models have the capability to directly learn the properties of input point clouds and desired surfaces from data. We study the influence of handcrafted and learned priors on the precision and robustness of surface reconstruction techniques. We evaluate various time-tested and contemporary methods in a standardized manner. When both trained and evaluated on point clouds with identical characteristics, the learning-based models consistently produce higher-quality surfaces compared to their traditional counterparts—even in scenarios involving novel shape categories. However, traditional methods demonstrate greater resilience to the diverse anomalies commonly found in real-world 3D acquisitions. For the benefit of the research community, we make our code and datasets available, inviting further enhancements to learning-based surface reconstruction.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 3","pages":"2000-2019"},"PeriodicalIF":0.0,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142777176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

WinDB: HMD-Free and Distortion-Free Panoptic Video Fixation Learning WinDB：无hmd和无失真的全景视频固定学习

IEEE transactions on pattern analysis and machine intelligence

Pub Date : 2024-12-04 DOI: 10.1109/TPAMI.2024.3510793

Guotao Wang;Chenglizhao Chen;Aimin Hao;Hong Qin;Deng-Ping Fan

To date, the widely adopted way to perform fixation collection in panoptic video is based on a head-mounted display (HMD), where users’ fixations are collected while wearing a HMD to explore the given panoptic scene freely. However, this widely-used data collection method is insufficient for training deep models to accurately predict which regions in a given panoptic are most important when it contains intermittent salient events. The main reason is that there always exist “blind zooms” when using HMD to collect fixations since the users cannot keep spinning their heads to explore the entire panoptic scene all the time. Consequently, the collected fixations tend to be trapped in some local views, leaving the remaining areas to be the “blind zooms”. Therefore, fixation data collected using HMD-based methods that accumulate local views cannot accurately represent the overall global importance—the main purpose of fixations—of complex panoptic scenes. To conquer, this paper introduces the auxiliary window with a dynamic blurring (WinDB) fixation collection approach for panoptic video, which doesn't need HMD and is able to well reflect the regional-wise importance degree. Using our WinDB approach, we have released a new PanopticVideo-300 dataset, containing 300 panoptic clips covering over 225 categories. Specifically, since using WinDB to collect fixations is blind zoom free, there exists frequent and intensive “fixation shifting”—a very special phenomenon that has long been overlooked by the previous research—in our new set. Thus, we present an effective fixation shifting network (FishNet) to conquer it. All these new fixation collection tool, dataset, and network could be very potential to open a new age for fixation-related research and applications in

$360^mathrm{o}$

environments.

{"title":"WinDB: HMD-Free and Distortion-Free Panoptic Video Fixation Learning","authors":"Guotao Wang;Chenglizhao Chen;Aimin Hao;Hong Qin;Deng-Ping Fan","doi":"10.1109/TPAMI.2024.3510793","DOIUrl":"10.1109/TPAMI.2024.3510793","url":null,"abstract":"To date, the widely adopted way to perform fixation collection in panoptic video is based on a head-mounted display (HMD), where users’ fixations are collected while wearing a HMD to explore the given panoptic scene freely. However, this widely-used data collection method is insufficient for training deep models to accurately predict which regions in a given panoptic are most important when it contains intermittent salient events. The main reason is that there always exist “blind zooms” when using HMD to collect fixations since the users cannot keep spinning their heads to explore the entire panoptic scene all the time. Consequently, the collected fixations tend to be trapped in some local views, leaving the remaining areas to be the “blind zooms”. Therefore, fixation data collected using HMD-based methods that accumulate local views cannot accurately represent the overall global importance—the main purpose of fixations—of complex panoptic scenes. To conquer, this paper introduces the auxiliary window with a dynamic blurring (WinDB) fixation collection approach for panoptic video, which doesn't need HMD and is able to well reflect the regional-wise importance degree. Using our WinDB approach, we have released a new PanopticVideo-300 dataset, containing 300 panoptic clips covering over 225 categories. Specifically, since using WinDB to collect fixations is blind zoom free, there exists frequent and intensive “fixation shifting”—a very special phenomenon that has long been overlooked by the previous research—in our new set. Thus, we present an effective fixation shifting network (FishNet) to conquer it. All these new fixation collection tool, dataset, and network could be very potential to open a new age for fixation-related research and applications in <inline-formula><tex-math>$360^mathrm{o}$</tex-math></inline-formula> environments.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 3","pages":"1694-1713"},"PeriodicalIF":0.0,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142776596","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Deep Loss Convexification for Learning Iterative Models 学习迭代模型的深度损失凸化

IEEE transactions on pattern analysis and machine intelligence

Pub Date : 2024-12-04 DOI: 10.1109/TPAMI.2024.3509860

Ziming Zhang;Yuping Shao;Yiqing Zhang;Fangzhou Lin;Haichong Zhang;Elke Rundensteiner

Iterative methods such as iterative closest point (ICP) for point cloud registration often suffer from bad local optimality (e.g., saddle points), due to the nature of nonconvex optimization. To address this fundamental challenge, in this paper we propose learning to form the loss landscape of a deep iterative method w.r.t. predictions at test time into a convex-like shape locally around each ground truth given data, namely Deep Loss Convexification (DLC), thanks to the overparametrization in neural networks. To this end, we formulate our learning objective based on adversarial training by manipulating the ground-truth predictions, rather than input data. In particular, we propose using star-convexity, a family of structured nonconvex functions that are unimodal on all lines that pass through a global minimizer, as our geometric constraint for reshaping loss landscapes, leading to (1) extra novel hinge losses appended to the original loss and (2) near-optimal predictions. We demonstrate the state-of-the-art performance using DLC with existing network architectures for the tasks of training recurrent neural networks (RNNs), 3D point cloud registration, and multimodel image alignment.

{"title":"Deep Loss Convexification for Learning Iterative Models","authors":"Ziming Zhang;Yuping Shao;Yiqing Zhang;Fangzhou Lin;Haichong Zhang;Elke Rundensteiner","doi":"10.1109/TPAMI.2024.3509860","DOIUrl":"10.1109/TPAMI.2024.3509860","url":null,"abstract":"Iterative methods such as iterative closest point (ICP) for point cloud registration often suffer from bad local optimality (e.g., saddle points), due to the nature of nonconvex optimization. To address this fundamental challenge, in this paper we propose learning to form the loss landscape of a deep iterative method <italic>w.r.t. <italic>predictions at test time into a <italic>convex-like shape locally around each ground truth given data, namely <italic>Deep Loss Convexification (DLC), thanks to the overparametrization in neural networks. To this end, we formulate our learning objective based on adversarial training by manipulating the ground-truth predictions, rather than input data. In particular, we propose using star-convexity, a family of structured nonconvex functions that are unimodal on all lines that pass through a global minimizer, as our geometric constraint for reshaping loss landscapes, leading to (1) extra novel hinge losses appended to the original loss and (2) near-optimal predictions. We demonstrate the state-of-the-art performance using DLC with existing network architectures for the tasks of training recurrent neural networks (RNNs), 3D point cloud registration, and multimodel image alignment.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 3","pages":"1501-1513"},"PeriodicalIF":0.0,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142776595","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0