从以自我为中心的视频中追踪三维场景中的实例

Yunhan Zhao, Haoyu Ma, Shu Kong, Charless Fowlkes
{"title":"从以自我为中心的视频中追踪三维场景中的实例","authors":"Yunhan Zhao, Haoyu Ma, Shu Kong, Charless Fowlkes","doi":"arxiv-2312.04117","DOIUrl":null,"url":null,"abstract":"Egocentric sensors such as AR/VR devices capture human-object interactions\nand offer the potential to provide task-assistance by recalling 3D locations of\nobjects of interest in the surrounding environment. This capability requires\ninstance tracking in real-world 3D scenes from egocentric videos (IT3DEgo). We\nexplore this problem by first introducing a new benchmark dataset, consisting\nof RGB and depth videos, per-frame camera pose, and instance-level annotations\nin both 2D camera and 3D world coordinates. We present an evaluation protocol\nwhich evaluates tracking performance in 3D coordinates with two settings for\nenrolling instances to track: (1) single-view online enrollment where an\ninstance is specified on-the-fly based on the human wearer's interactions. and\n(2) multi-view pre-enrollment where images of an instance to be tracked are\nstored in memory ahead of time. To address IT3DEgo, we first re-purpose methods\nfrom relevant areas, e.g., single object tracking (SOT) -- running SOT methods\nto track instances in 2D frames and lifting them to 3D using camera pose and\ndepth. We also present a simple method that leverages pretrained segmentation\nand detection models to generate proposals from RGB frames and match proposals\nwith enrolled instance images. Perhaps surprisingly, our extensive experiments\nshow that our method (with no finetuning) significantly outperforms SOT-based\napproaches. We conclude by arguing that the problem of egocentric instance\ntracking is made easier by leveraging camera pose and using a 3D allocentric\n(world) coordinate representation.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Instance Tracking in 3D Scenes from Egocentric Videos\",\"authors\":\"Yunhan Zhao, Haoyu Ma, Shu Kong, Charless Fowlkes\",\"doi\":\"arxiv-2312.04117\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Egocentric sensors such as AR/VR devices capture human-object interactions\\nand offer the potential to provide task-assistance by recalling 3D locations of\\nobjects of interest in the surrounding environment. This capability requires\\ninstance tracking in real-world 3D scenes from egocentric videos (IT3DEgo). We\\nexplore this problem by first introducing a new benchmark dataset, consisting\\nof RGB and depth videos, per-frame camera pose, and instance-level annotations\\nin both 2D camera and 3D world coordinates. We present an evaluation protocol\\nwhich evaluates tracking performance in 3D coordinates with two settings for\\nenrolling instances to track: (1) single-view online enrollment where an\\ninstance is specified on-the-fly based on the human wearer's interactions. and\\n(2) multi-view pre-enrollment where images of an instance to be tracked are\\nstored in memory ahead of time. To address IT3DEgo, we first re-purpose methods\\nfrom relevant areas, e.g., single object tracking (SOT) -- running SOT methods\\nto track instances in 2D frames and lifting them to 3D using camera pose and\\ndepth. We also present a simple method that leverages pretrained segmentation\\nand detection models to generate proposals from RGB frames and match proposals\\nwith enrolled instance images. Perhaps surprisingly, our extensive experiments\\nshow that our method (with no finetuning) significantly outperforms SOT-based\\napproaches. We conclude by arguing that the problem of egocentric instance\\ntracking is made easier by leveraging camera pose and using a 3D allocentric\\n(world) coordinate representation.\",\"PeriodicalId\":501130,\"journal\":{\"name\":\"arXiv - CS - Computer Vision and Pattern Recognition\",\"volume\":\"1 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-12-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Computer Vision and Pattern Recognition\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2312.04117\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2312.04117","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

AR/VR设备等以自我为中心的传感器可以捕捉人与物体之间的互动,并通过回忆周围环境中感兴趣物体的三维位置来提供任务辅助。要实现这一功能,需要在真实世界的三维场景中通过以自我为中心的视频进行实例跟踪(IT3DEgo)。我们首先引入了一个新的基准数据集来探讨这一问题,该数据集由 RGB 和深度视频、每帧摄像机姿态以及二维摄像机和三维世界坐标中的实例级注释组成。我们提出了一个评估协议,通过两种设置来评估三维坐标下的跟踪性能:(1) 单视角在线注册,即根据佩戴者的交互行为即时指定一个实例;(2) 多视角预注册,即将跟踪实例的图像提前存储在内存中。为了解决 IT3DEgo 问题,我们首先重新利用了相关领域的方法,例如单个物体跟踪 (SOT) -- 使用 SOT 方法跟踪二维帧中的实例,并利用摄像头姿势和深度将其提升到三维。我们还提出了一种简单的方法,利用预训练的分割和检测模型从 RGB 帧生成建议,并将建议与注册的实例图像进行匹配。也许令人惊讶的是,我们的大量实验表明,我们的方法(无需微调)明显优于基于 SOT 的方法。最后,我们认为,通过利用摄像头姿势和使用 3D 分配中心(世界)坐标表示法,可以使以自我为中心的实例跟踪问题变得更加简单。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Instance Tracking in 3D Scenes from Egocentric Videos
Egocentric sensors such as AR/VR devices capture human-object interactions and offer the potential to provide task-assistance by recalling 3D locations of objects of interest in the surrounding environment. This capability requires instance tracking in real-world 3D scenes from egocentric videos (IT3DEgo). We explore this problem by first introducing a new benchmark dataset, consisting of RGB and depth videos, per-frame camera pose, and instance-level annotations in both 2D camera and 3D world coordinates. We present an evaluation protocol which evaluates tracking performance in 3D coordinates with two settings for enrolling instances to track: (1) single-view online enrollment where an instance is specified on-the-fly based on the human wearer's interactions. and (2) multi-view pre-enrollment where images of an instance to be tracked are stored in memory ahead of time. To address IT3DEgo, we first re-purpose methods from relevant areas, e.g., single object tracking (SOT) -- running SOT methods to track instances in 2D frames and lifting them to 3D using camera pose and depth. We also present a simple method that leverages pretrained segmentation and detection models to generate proposals from RGB frames and match proposals with enrolled instance images. Perhaps surprisingly, our extensive experiments show that our method (with no finetuning) significantly outperforms SOT-based approaches. We conclude by arguing that the problem of egocentric instance tracking is made easier by leveraging camera pose and using a 3D allocentric (world) coordinate representation.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Massively Multi-Person 3D Human Motion Forecasting with Scene Context Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution Precise Forecasting of Sky Images Using Spatial Warping JEAN: Joint Expression and Audio-guided NeRF-based Talking Face Generation Applications of Knowledge Distillation in Remote Sensing: A Survey
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1