COM 厨房:作为视觉语言基准的未编辑俯视视频数据集

Koki Maeda, Tosho Hirasawa, Atsushi Hashimoto, Jun Harashima, Leszek Rybicki, Yusuke Fukasawa, Yoshitaka Ushiku
{"title":"COM 厨房:作为视觉语言基准的未编辑俯视视频数据集","authors":"Koki Maeda, Tosho Hirasawa, Atsushi Hashimoto, Jun Harashima, Leszek Rybicki, Yusuke Fukasawa, Yoshitaka Ushiku","doi":"arxiv-2408.02272","DOIUrl":null,"url":null,"abstract":"Procedural video understanding is gaining attention in the vision and\nlanguage community. Deep learning-based video analysis requires extensive data.\nConsequently, existing works often use web videos as training resources, making\nit challenging to query instructional contents from raw video observations. To\naddress this issue, we propose a new dataset, COM Kitchens. The dataset\nconsists of unedited overhead-view videos captured by smartphones, in which\nparticipants performed food preparation based on given recipes. Fixed-viewpoint\nvideo datasets often lack environmental diversity due to high camera setup\ncosts. We used modern wide-angle smartphone lenses to cover cooking counters\nfrom sink to cooktop in an overhead view, capturing activity without in-person\nassistance. With this setup, we collected a diverse dataset by distributing\nsmartphones to participants. With this dataset, we propose the novel\nvideo-to-text retrieval task Online Recipe Retrieval (OnRR) and new video\ncaptioning domain Dense Video Captioning on unedited Overhead-View videos\n(DVC-OV). Our experiments verified the capabilities and limitations of current\nweb-video-based SOTA methods in handling these tasks.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"467 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"COM Kitchens: An Unedited Overhead-view Video Dataset as a Vision-Language Benchmark\",\"authors\":\"Koki Maeda, Tosho Hirasawa, Atsushi Hashimoto, Jun Harashima, Leszek Rybicki, Yusuke Fukasawa, Yoshitaka Ushiku\",\"doi\":\"arxiv-2408.02272\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Procedural video understanding is gaining attention in the vision and\\nlanguage community. Deep learning-based video analysis requires extensive data.\\nConsequently, existing works often use web videos as training resources, making\\nit challenging to query instructional contents from raw video observations. To\\naddress this issue, we propose a new dataset, COM Kitchens. The dataset\\nconsists of unedited overhead-view videos captured by smartphones, in which\\nparticipants performed food preparation based on given recipes. Fixed-viewpoint\\nvideo datasets often lack environmental diversity due to high camera setup\\ncosts. We used modern wide-angle smartphone lenses to cover cooking counters\\nfrom sink to cooktop in an overhead view, capturing activity without in-person\\nassistance. With this setup, we collected a diverse dataset by distributing\\nsmartphones to participants. With this dataset, we propose the novel\\nvideo-to-text retrieval task Online Recipe Retrieval (OnRR) and new video\\ncaptioning domain Dense Video Captioning on unedited Overhead-View videos\\n(DVC-OV). Our experiments verified the capabilities and limitations of current\\nweb-video-based SOTA methods in handling these tasks.\",\"PeriodicalId\":501480,\"journal\":{\"name\":\"arXiv - CS - Multimedia\",\"volume\":\"467 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Multimedia\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.02272\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.02272","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

程序视频理解在视觉和语言领域越来越受到关注。基于深度学习的视频分析需要大量数据。因此,现有的工作通常使用网络视频作为训练资源,这使得从原始视频观察结果中查询教学内容具有挑战性。为了解决这个问题,我们提出了一个新的数据集 COM Kitchens。该数据集由智能手机拍摄的未经编辑的俯视视频组成,在这些视频中,参与者根据给定的食谱进行食物准备。固定视角视频数据集由于相机设置成本较高,通常缺乏环境多样性。我们使用现代广角智能手机镜头,以俯视视角覆盖从水槽到灶台的烹饪台面,捕捉没有人协助的活动。通过这种设置,我们向参与者分发了智能手机,从而收集了多样化的数据集。利用这个数据集,我们提出了新颖的视频到文本检索任务 "在线食谱检索(OnRR)"和新的视频字幕领域 "未编辑俯视视频上的密集视频字幕"(DVC-OV)。我们的实验验证了当前基于网络视频的 SOTA 方法在处理这些任务时的能力和局限性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
COM Kitchens: An Unedited Overhead-view Video Dataset as a Vision-Language Benchmark
Procedural video understanding is gaining attention in the vision and language community. Deep learning-based video analysis requires extensive data. Consequently, existing works often use web videos as training resources, making it challenging to query instructional contents from raw video observations. To address this issue, we propose a new dataset, COM Kitchens. The dataset consists of unedited overhead-view videos captured by smartphones, in which participants performed food preparation based on given recipes. Fixed-viewpoint video datasets often lack environmental diversity due to high camera setup costs. We used modern wide-angle smartphone lenses to cover cooking counters from sink to cooktop in an overhead view, capturing activity without in-person assistance. With this setup, we collected a diverse dataset by distributing smartphones to participants. With this dataset, we propose the novel video-to-text retrieval task Online Recipe Retrieval (OnRR) and new video captioning domain Dense Video Captioning on unedited Overhead-View videos (DVC-OV). Our experiments verified the capabilities and limitations of current web-video-based SOTA methods in handling these tasks.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Vista3D: Unravel the 3D Darkside of a Single Image MoRAG -- Multi-Fusion Retrieval Augmented Generation for Human Motion Efficient Low-Resolution Face Recognition via Bridge Distillation Enhancing Few-Shot Classification without Forgetting through Multi-Level Contrastive Constraints NVLM: Open Frontier-Class Multimodal LLMs
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1