HVM-1：使用近 5000 小时类人视频数据预训练的大规模视频模型

arXiv - QuanBio - Neurons and Cognition Pub Date : 2024-07-25 DOI:arxiv-2407.18067

A. Emin Orhan

{"title":"HVM-1：使用近 5000 小时类人视频数据预训练的大规模视频模型","authors":"A. Emin Orhan","doi":"arxiv-2407.18067","DOIUrl":null,"url":null,"abstract":"We introduce Human-like Video Models (HVM-1), large-scale video models\npretrained with nearly 5000 hours of curated human-like video data (mostly\negocentric, temporally extended, continuous video recordings), using the\nspatiotemporal masked autoencoder (ST-MAE) algorithm. We release two 633M\nparameter models trained at spatial resolutions of 224x224 and 448x448 pixels.\nWe evaluate the performance of these models in downstream few-shot video and\nimage recognition tasks and compare them against a model pretrained with 1330\nhours of short action-oriented video clips from YouTube (Kinetics-700). HVM-1\nmodels perform competitively against the Kinetics-700 pretrained model in\ndownstream evaluations despite substantial qualitative differences between the\nspatiotemporal characteristics of the corresponding pretraining datasets. HVM-1\nmodels also learn more accurate and more robust object representations compared\nto models pretrained with the image-based MAE algorithm on the same data,\ndemonstrating the potential benefits of learning to predict temporal\nregularities in natural videos for learning better object representations.","PeriodicalId":501517,"journal":{"name":"arXiv - QuanBio - Neurons and Cognition","volume":"53 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"HVM-1: Large-scale video models pretrained with nearly 5000 hours of human-like video data\",\"authors\":\"A. Emin Orhan\",\"doi\":\"arxiv-2407.18067\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We introduce Human-like Video Models (HVM-1), large-scale video models\\npretrained with nearly 5000 hours of curated human-like video data (mostly\\negocentric, temporally extended, continuous video recordings), using the\\nspatiotemporal masked autoencoder (ST-MAE) algorithm. We release two 633M\\nparameter models trained at spatial resolutions of 224x224 and 448x448 pixels.\\nWe evaluate the performance of these models in downstream few-shot video and\\nimage recognition tasks and compare them against a model pretrained with 1330\\nhours of short action-oriented video clips from YouTube (Kinetics-700). HVM-1\\nmodels perform competitively against the Kinetics-700 pretrained model in\\ndownstream evaluations despite substantial qualitative differences between the\\nspatiotemporal characteristics of the corresponding pretraining datasets. HVM-1\\nmodels also learn more accurate and more robust object representations compared\\nto models pretrained with the image-based MAE algorithm on the same data,\\ndemonstrating the potential benefits of learning to predict temporal\\nregularities in natural videos for learning better object representations.\",\"PeriodicalId\":501517,\"journal\":{\"name\":\"arXiv - QuanBio - Neurons and Cognition\",\"volume\":\"53 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-07-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - QuanBio - Neurons and Cognition\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2407.18067\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Neurons and Cognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.18067","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

我们介绍了类人视频模型（HVM-1），它是使用近 5000 个小时的经过整理的类人视频数据（主要是以时间为中心的连续视频记录），并使用时空掩码自动编码器（ST-MAE）算法训练而成的大型视频模型。我们评估了这些模型在下游少镜头视频和图像识别任务中的表现，并将它们与使用 1330 小时 YouTube 短动作导向视频片段（Kinetics-700）预训练的模型进行了比较。在下游评估中，尽管相应预训练数据集的时空特征在质量上存在很大差异，但 HVM-1 模型与 Kinetics-700 预训练模型相比仍具有竞争力。与在相同数据上使用基于图像的 MAE 算法预训练的模型相比，HVM-1 模型还能学习到更准确、更稳健的物体表征，这证明了学习预测自然视频中的时间规律对学习更好的物体表征的潜在好处。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

HVM-1: Large-scale video models pretrained with nearly 5000 hours of human-like video data

We introduce Human-like Video Models (HVM-1), large-scale video models pretrained with nearly 5000 hours of curated human-like video data (mostly egocentric, temporally extended, continuous video recordings), using the spatiotemporal masked autoencoder (ST-MAE) algorithm. We release two 633M parameter models trained at spatial resolutions of 224x224 and 448x448 pixels. We evaluate the performance of these models in downstream few-shot video and image recognition tasks and compare them against a model pretrained with 1330 hours of short action-oriented video clips from YouTube (Kinetics-700). HVM-1 models perform competitively against the Kinetics-700 pretrained model in downstream evaluations despite substantial qualitative differences between the spatiotemporal characteristics of the corresponding pretraining datasets. HVM-1 models also learn more accurate and more robust object representations compared to models pretrained with the image-based MAE algorithm on the same data, demonstrating the potential benefits of learning to predict temporal regularities in natural videos for learning better object representations.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - QuanBio - Neurons and Cognition

自引率

0.00%

发文量