Koki Maeda, Tosho Hirasawa, Atsushi Hashimoto, Jun Harashima, Leszek Rybicki, Yusuke Fukasawa, Yoshitaka Ushiku
{"title":"COM 厨房:作为视觉语言基准的未编辑俯视视频数据集","authors":"Koki Maeda, Tosho Hirasawa, Atsushi Hashimoto, Jun Harashima, Leszek Rybicki, Yusuke Fukasawa, Yoshitaka Ushiku","doi":"arxiv-2408.02272","DOIUrl":null,"url":null,"abstract":"Procedural video understanding is gaining attention in the vision and\nlanguage community. Deep learning-based video analysis requires extensive data.\nConsequently, existing works often use web videos as training resources, making\nit challenging to query instructional contents from raw video observations. To\naddress this issue, we propose a new dataset, COM Kitchens. The dataset\nconsists of unedited overhead-view videos captured by smartphones, in which\nparticipants performed food preparation based on given recipes. Fixed-viewpoint\nvideo datasets often lack environmental diversity due to high camera setup\ncosts. We used modern wide-angle smartphone lenses to cover cooking counters\nfrom sink to cooktop in an overhead view, capturing activity without in-person\nassistance. With this setup, we collected a diverse dataset by distributing\nsmartphones to participants. With this dataset, we propose the novel\nvideo-to-text retrieval task Online Recipe Retrieval (OnRR) and new video\ncaptioning domain Dense Video Captioning on unedited Overhead-View videos\n(DVC-OV). Our experiments verified the capabilities and limitations of current\nweb-video-based SOTA methods in handling these tasks.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"467 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"COM Kitchens: An Unedited Overhead-view Video Dataset as a Vision-Language Benchmark\",\"authors\":\"Koki Maeda, Tosho Hirasawa, Atsushi Hashimoto, Jun Harashima, Leszek Rybicki, Yusuke Fukasawa, Yoshitaka Ushiku\",\"doi\":\"arxiv-2408.02272\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Procedural video understanding is gaining attention in the vision and\\nlanguage community. Deep learning-based video analysis requires extensive data.\\nConsequently, existing works often use web videos as training resources, making\\nit challenging to query instructional contents from raw video observations. To\\naddress this issue, we propose a new dataset, COM Kitchens. The dataset\\nconsists of unedited overhead-view videos captured by smartphones, in which\\nparticipants performed food preparation based on given recipes. Fixed-viewpoint\\nvideo datasets often lack environmental diversity due to high camera setup\\ncosts. We used modern wide-angle smartphone lenses to cover cooking counters\\nfrom sink to cooktop in an overhead view, capturing activity without in-person\\nassistance. With this setup, we collected a diverse dataset by distributing\\nsmartphones to participants. With this dataset, we propose the novel\\nvideo-to-text retrieval task Online Recipe Retrieval (OnRR) and new video\\ncaptioning domain Dense Video Captioning on unedited Overhead-View videos\\n(DVC-OV). Our experiments verified the capabilities and limitations of current\\nweb-video-based SOTA methods in handling these tasks.\",\"PeriodicalId\":501480,\"journal\":{\"name\":\"arXiv - CS - Multimedia\",\"volume\":\"467 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Multimedia\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.02272\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.02272","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
程序视频理解在视觉和语言领域越来越受到关注。基于深度学习的视频分析需要大量数据。因此,现有的工作通常使用网络视频作为训练资源,这使得从原始视频观察结果中查询教学内容具有挑战性。为了解决这个问题,我们提出了一个新的数据集 COM Kitchens。该数据集由智能手机拍摄的未经编辑的俯视视频组成,在这些视频中,参与者根据给定的食谱进行食物准备。固定视角视频数据集由于相机设置成本较高,通常缺乏环境多样性。我们使用现代广角智能手机镜头,以俯视视角覆盖从水槽到灶台的烹饪台面,捕捉没有人协助的活动。通过这种设置,我们向参与者分发了智能手机,从而收集了多样化的数据集。利用这个数据集,我们提出了新颖的视频到文本检索任务 "在线食谱检索(OnRR)"和新的视频字幕领域 "未编辑俯视视频上的密集视频字幕"(DVC-OV)。我们的实验验证了当前基于网络视频的 SOTA 方法在处理这些任务时的能力和局限性。
COM Kitchens: An Unedited Overhead-view Video Dataset as a Vision-Language Benchmark
Procedural video understanding is gaining attention in the vision and
language community. Deep learning-based video analysis requires extensive data.
Consequently, existing works often use web videos as training resources, making
it challenging to query instructional contents from raw video observations. To
address this issue, we propose a new dataset, COM Kitchens. The dataset
consists of unedited overhead-view videos captured by smartphones, in which
participants performed food preparation based on given recipes. Fixed-viewpoint
video datasets often lack environmental diversity due to high camera setup
costs. We used modern wide-angle smartphone lenses to cover cooking counters
from sink to cooktop in an overhead view, capturing activity without in-person
assistance. With this setup, we collected a diverse dataset by distributing
smartphones to participants. With this dataset, we propose the novel
video-to-text retrieval task Online Recipe Retrieval (OnRR) and new video
captioning domain Dense Video Captioning on unedited Overhead-View videos
(DVC-OV). Our experiments verified the capabilities and limitations of current
web-video-based SOTA methods in handling these tasks.