Remembering What is Important: A Factorised Multi-Head Retrieval and Auxiliary Memory Stabilisation Scheme for Human Motion Prediction

IEEE transactions on pattern analysis and machine intelligence Pub Date : 2024-12-04 DOI:10.1109/TPAMI.2024.3511393

Tharindu Fernando;Harshala Gammulle;Sridha Sridharan;Simon Denman;Clinton Fookes

{"title":"Remembering What is Important: A Factorised Multi-Head Retrieval and Auxiliary Memory Stabilisation Scheme for Human Motion Prediction","authors":"Tharindu Fernando;Harshala Gammulle;Sridha Sridharan;Simon Denman;Clinton Fookes","doi":"10.1109/TPAMI.2024.3511393","DOIUrl":null,"url":null,"abstract":"Humans exhibit complex motions that vary depending on the activity they are performing, the interactions they engage in, as well as subject-specific preferences. Therefore, forecasting a human’s future pose based on the history of his or her previous motion is a challenging task. This paper presents an innovative auxiliary-memory-powered deep neural network framework to improve the modelling of historical knowledge. Specifically, we disentangle subject-specific, action-specific, and other auxiliary information from the observed pose sequences and utilise these factorised features to query the memory. A novel Multi-Head knowledge retrieval scheme leverages these factorised feature embeddings to perform multiple querying operations over the historical observations captured within the auxiliary memory. Moreover, we propose a dynamic masking strategy to make this feature disentanglement process adaptive. Two novel loss functions are introduced to encourage diversity within the auxiliary memory, while ensuring the stability of the memory content such that it can locate and store salient information that aids the long-term prediction of future motion, irrespective of any data imbalances or the diversity of the input data distribution. Extensive experiments conducted on two public benchmarks, Human3.6M and CMU-Mocap, demonstrate that these design choices collectively allow the proposed approach to outperform the current state-of-the-art methods by significant margins: <inline-formula><tex-math>$> $</tex-math></inline-formula> 17% on the Human3.6M dataset and <inline-formula><tex-math>$> $</tex-math></inline-formula> 9% on the CMU-Mocap dataset.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 3","pages":"1941-1957"},"PeriodicalIF":0.0000,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10777031/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Humans exhibit complex motions that vary depending on the activity they are performing, the interactions they engage in, as well as subject-specific preferences. Therefore, forecasting a human’s future pose based on the history of his or her previous motion is a challenging task. This paper presents an innovative auxiliary-memory-powered deep neural network framework to improve the modelling of historical knowledge. Specifically, we disentangle subject-specific, action-specific, and other auxiliary information from the observed pose sequences and utilise these factorised features to query the memory. A novel Multi-Head knowledge retrieval scheme leverages these factorised feature embeddings to perform multiple querying operations over the historical observations captured within the auxiliary memory. Moreover, we propose a dynamic masking strategy to make this feature disentanglement process adaptive. Two novel loss functions are introduced to encourage diversity within the auxiliary memory, while ensuring the stability of the memory content such that it can locate and store salient information that aids the long-term prediction of future motion, irrespective of any data imbalances or the diversity of the input data distribution. Extensive experiments conducted on two public benchmarks, Human3.6M and CMU-Mocap, demonstrate that these design choices collectively allow the proposed approach to outperform the current state-of-the-art methods by significant margins:

$> $

17% on the Human3.6M dataset and

$> $

9% on the CMU-Mocap dataset.

查看原文