HVM-1: Large-scale video models pretrained with nearly 5000 hours of human-like video data

arXiv - QuanBio - Neurons and Cognition Pub Date : 2024-07-25 DOI:arxiv-2407.18067

A. Emin Orhan

引用次数: 0

Abstract

We introduce Human-like Video Models (HVM-1), large-scale video models pretrained with nearly 5000 hours of curated human-like video data (mostly egocentric, temporally extended, continuous video recordings), using the spatiotemporal masked autoencoder (ST-MAE) algorithm. We release two 633M parameter models trained at spatial resolutions of 224x224 and 448x448 pixels. We evaluate the performance of these models in downstream few-shot video and image recognition tasks and compare them against a model pretrained with 1330 hours of short action-oriented video clips from YouTube (Kinetics-700). HVM-1 models perform competitively against the Kinetics-700 pretrained model in downstream evaluations despite substantial qualitative differences between the spatiotemporal characteristics of the corresponding pretraining datasets. HVM-1 models also learn more accurate and more robust object representations compared to models pretrained with the image-based MAE algorithm on the same data, demonstrating the potential benefits of learning to predict temporal regularities in natural videos for learning better object representations.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

HVM-1：使用近 5000 小时类人视频数据预训练的大规模视频模型

我们介绍了类人视频模型（HVM-1），它是使用近 5000 个小时的经过整理的类人视频数据（主要是以时间为中心的连续视频记录），并使用时空掩码自动编码器（ST-MAE）算法训练而成的大型视频模型。我们评估了这些模型在下游少镜头视频和图像识别任务中的表现，并将它们与使用 1330 小时 YouTube 短动作导向视频片段（Kinetics-700）预训练的模型进行了比较。在下游评估中，尽管相应预训练数据集的时空特征在质量上存在很大差异，但 HVM-1 模型与 Kinetics-700 预训练模型相比仍具有竞争力。与在相同数据上使用基于图像的 MAE 算法预训练的模型相比，HVM-1 模型还能学习到更准确、更稳健的物体表征，这证明了学习预测自然视频中的时间规律对学习更好的物体表征的潜在好处。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - QuanBio - Neurons and Cognition

自引率

0.00%

发文量