{"title":"Massively Multi-Person 3D Human Motion Forecasting with Scene Context","authors":"Felix B Mueller, Julian Tanke, Juergen Gall","doi":"arxiv-2409.12189","DOIUrl":null,"url":null,"abstract":"Forecasting long-term 3D human motion is challenging: the stochasticity of\nhuman behavior makes it hard to generate realistic human motion from the input\nsequence alone. Information on the scene environment and the motion of nearby\npeople can greatly aid the generation process. We propose a scene-aware social\ntransformer model (SAST) to forecast long-term (10s) human motion motion.\nUnlike previous models, our approach can model interactions between both widely\nvarying numbers of people and objects in a scene. We combine a temporal\nconvolutional encoder-decoder architecture with a Transformer-based bottleneck\nthat allows us to efficiently combine motion and scene information. We model\nthe conditional motion distribution using denoising diffusion models. We\nbenchmark our approach on the Humans in Kitchens dataset, which contains 1 to\n16 persons and 29 to 50 objects that are visible simultaneously. Our model\noutperforms other approaches in terms of realism and diversity on different\nmetrics and in a user study. Code is available at\nhttps://github.com/felixbmuller/SAST.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"15 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.12189","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Forecasting long-term 3D human motion is challenging: the stochasticity of
human behavior makes it hard to generate realistic human motion from the input
sequence alone. Information on the scene environment and the motion of nearby
people can greatly aid the generation process. We propose a scene-aware social
transformer model (SAST) to forecast long-term (10s) human motion motion.
Unlike previous models, our approach can model interactions between both widely
varying numbers of people and objects in a scene. We combine a temporal
convolutional encoder-decoder architecture with a Transformer-based bottleneck
that allows us to efficiently combine motion and scene information. We model
the conditional motion distribution using denoising diffusion models. We
benchmark our approach on the Humans in Kitchens dataset, which contains 1 to
16 persons and 29 to 50 objects that are visible simultaneously. Our model
outperforms other approaches in terms of realism and diversity on different
metrics and in a user study. Code is available at
https://github.com/felixbmuller/SAST.