{"title":"Panoptic-Depth Forecasting","authors":"Juana Valeria Hurtado, Riya Mohan, Abhinav Valada","doi":"arxiv-2409.12008","DOIUrl":null,"url":null,"abstract":"Forecasting the semantics and 3D structure of scenes is essential for robots\nto navigate and plan actions safely. Recent methods have explored semantic and\npanoptic scene forecasting; however, they do not consider the geometry of the\nscene. In this work, we propose the panoptic-depth forecasting task for jointly\npredicting the panoptic segmentation and depth maps of unobserved future\nframes, from monocular camera images. To facilitate this work, we extend the\npopular KITTI-360 and Cityscapes benchmarks by computing depth maps from LiDAR\npoint clouds and leveraging sequential labeled data. We also introduce a\nsuitable evaluation metric that quantifies both the panoptic quality and depth\nestimation accuracy of forecasts in a coherent manner. Furthermore, we present\ntwo baselines and propose the novel PDcast architecture that learns rich\nspatio-temporal representations by incorporating a transformer-based encoder, a\nforecasting module, and task-specific decoders to predict future panoptic-depth\noutputs. Extensive evaluations demonstrate the effectiveness of PDcast across\ntwo datasets and three forecasting tasks, consistently addressing the primary\nchallenges. We make the code publicly available at\nhttps://pdcast.cs.uni-freiburg.de.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"188 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.12008","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Forecasting the semantics and 3D structure of scenes is essential for robots
to navigate and plan actions safely. Recent methods have explored semantic and
panoptic scene forecasting; however, they do not consider the geometry of the
scene. In this work, we propose the panoptic-depth forecasting task for jointly
predicting the panoptic segmentation and depth maps of unobserved future
frames, from monocular camera images. To facilitate this work, we extend the
popular KITTI-360 and Cityscapes benchmarks by computing depth maps from LiDAR
point clouds and leveraging sequential labeled data. We also introduce a
suitable evaluation metric that quantifies both the panoptic quality and depth
estimation accuracy of forecasts in a coherent manner. Furthermore, we present
two baselines and propose the novel PDcast architecture that learns rich
spatio-temporal representations by incorporating a transformer-based encoder, a
forecasting module, and task-specific decoders to predict future panoptic-depth
outputs. Extensive evaluations demonstrate the effectiveness of PDcast across
two datasets and three forecasting tasks, consistently addressing the primary
challenges. We make the code publicly available at
https://pdcast.cs.uni-freiburg.de.