{"title":"从以自我为中心的视频中追踪三维场景中的实例","authors":"Yunhan Zhao, Haoyu Ma, Shu Kong, Charless Fowlkes","doi":"arxiv-2312.04117","DOIUrl":null,"url":null,"abstract":"Egocentric sensors such as AR/VR devices capture human-object interactions\nand offer the potential to provide task-assistance by recalling 3D locations of\nobjects of interest in the surrounding environment. This capability requires\ninstance tracking in real-world 3D scenes from egocentric videos (IT3DEgo). We\nexplore this problem by first introducing a new benchmark dataset, consisting\nof RGB and depth videos, per-frame camera pose, and instance-level annotations\nin both 2D camera and 3D world coordinates. We present an evaluation protocol\nwhich evaluates tracking performance in 3D coordinates with two settings for\nenrolling instances to track: (1) single-view online enrollment where an\ninstance is specified on-the-fly based on the human wearer's interactions. and\n(2) multi-view pre-enrollment where images of an instance to be tracked are\nstored in memory ahead of time. To address IT3DEgo, we first re-purpose methods\nfrom relevant areas, e.g., single object tracking (SOT) -- running SOT methods\nto track instances in 2D frames and lifting them to 3D using camera pose and\ndepth. We also present a simple method that leverages pretrained segmentation\nand detection models to generate proposals from RGB frames and match proposals\nwith enrolled instance images. Perhaps surprisingly, our extensive experiments\nshow that our method (with no finetuning) significantly outperforms SOT-based\napproaches. We conclude by arguing that the problem of egocentric instance\ntracking is made easier by leveraging camera pose and using a 3D allocentric\n(world) coordinate representation.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Instance Tracking in 3D Scenes from Egocentric Videos\",\"authors\":\"Yunhan Zhao, Haoyu Ma, Shu Kong, Charless Fowlkes\",\"doi\":\"arxiv-2312.04117\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Egocentric sensors such as AR/VR devices capture human-object interactions\\nand offer the potential to provide task-assistance by recalling 3D locations of\\nobjects of interest in the surrounding environment. This capability requires\\ninstance tracking in real-world 3D scenes from egocentric videos (IT3DEgo). We\\nexplore this problem by first introducing a new benchmark dataset, consisting\\nof RGB and depth videos, per-frame camera pose, and instance-level annotations\\nin both 2D camera and 3D world coordinates. We present an evaluation protocol\\nwhich evaluates tracking performance in 3D coordinates with two settings for\\nenrolling instances to track: (1) single-view online enrollment where an\\ninstance is specified on-the-fly based on the human wearer's interactions. and\\n(2) multi-view pre-enrollment where images of an instance to be tracked are\\nstored in memory ahead of time. To address IT3DEgo, we first re-purpose methods\\nfrom relevant areas, e.g., single object tracking (SOT) -- running SOT methods\\nto track instances in 2D frames and lifting them to 3D using camera pose and\\ndepth. We also present a simple method that leverages pretrained segmentation\\nand detection models to generate proposals from RGB frames and match proposals\\nwith enrolled instance images. Perhaps surprisingly, our extensive experiments\\nshow that our method (with no finetuning) significantly outperforms SOT-based\\napproaches. We conclude by arguing that the problem of egocentric instance\\ntracking is made easier by leveraging camera pose and using a 3D allocentric\\n(world) coordinate representation.\",\"PeriodicalId\":501130,\"journal\":{\"name\":\"arXiv - CS - Computer Vision and Pattern Recognition\",\"volume\":\"1 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-12-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Computer Vision and Pattern Recognition\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2312.04117\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2312.04117","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
AR/VR设备等以自我为中心的传感器可以捕捉人与物体之间的互动,并通过回忆周围环境中感兴趣物体的三维位置来提供任务辅助。要实现这一功能,需要在真实世界的三维场景中通过以自我为中心的视频进行实例跟踪(IT3DEgo)。我们首先引入了一个新的基准数据集来探讨这一问题,该数据集由 RGB 和深度视频、每帧摄像机姿态以及二维摄像机和三维世界坐标中的实例级注释组成。我们提出了一个评估协议,通过两种设置来评估三维坐标下的跟踪性能:(1) 单视角在线注册,即根据佩戴者的交互行为即时指定一个实例;(2) 多视角预注册,即将跟踪实例的图像提前存储在内存中。为了解决 IT3DEgo 问题,我们首先重新利用了相关领域的方法,例如单个物体跟踪 (SOT) -- 使用 SOT 方法跟踪二维帧中的实例,并利用摄像头姿势和深度将其提升到三维。我们还提出了一种简单的方法,利用预训练的分割和检测模型从 RGB 帧生成建议,并将建议与注册的实例图像进行匹配。也许令人惊讶的是,我们的大量实验表明,我们的方法(无需微调)明显优于基于 SOT 的方法。最后,我们认为,通过利用摄像头姿势和使用 3D 分配中心(世界)坐标表示法,可以使以自我为中心的实例跟踪问题变得更加简单。
Instance Tracking in 3D Scenes from Egocentric Videos
Egocentric sensors such as AR/VR devices capture human-object interactions
and offer the potential to provide task-assistance by recalling 3D locations of
objects of interest in the surrounding environment. This capability requires
instance tracking in real-world 3D scenes from egocentric videos (IT3DEgo). We
explore this problem by first introducing a new benchmark dataset, consisting
of RGB and depth videos, per-frame camera pose, and instance-level annotations
in both 2D camera and 3D world coordinates. We present an evaluation protocol
which evaluates tracking performance in 3D coordinates with two settings for
enrolling instances to track: (1) single-view online enrollment where an
instance is specified on-the-fly based on the human wearer's interactions. and
(2) multi-view pre-enrollment where images of an instance to be tracked are
stored in memory ahead of time. To address IT3DEgo, we first re-purpose methods
from relevant areas, e.g., single object tracking (SOT) -- running SOT methods
to track instances in 2D frames and lifting them to 3D using camera pose and
depth. We also present a simple method that leverages pretrained segmentation
and detection models to generate proposals from RGB frames and match proposals
with enrolled instance images. Perhaps surprisingly, our extensive experiments
show that our method (with no finetuning) significantly outperforms SOT-based
approaches. We conclude by arguing that the problem of egocentric instance
tracking is made easier by leveraging camera pose and using a 3D allocentric
(world) coordinate representation.