{"title":"HVM-1: Large-scale video models pretrained with nearly 5000 hours of human-like video data","authors":"A. Emin Orhan","doi":"arxiv-2407.18067","DOIUrl":null,"url":null,"abstract":"We introduce Human-like Video Models (HVM-1), large-scale video models\npretrained with nearly 5000 hours of curated human-like video data (mostly\negocentric, temporally extended, continuous video recordings), using the\nspatiotemporal masked autoencoder (ST-MAE) algorithm. We release two 633M\nparameter models trained at spatial resolutions of 224x224 and 448x448 pixels.\nWe evaluate the performance of these models in downstream few-shot video and\nimage recognition tasks and compare them against a model pretrained with 1330\nhours of short action-oriented video clips from YouTube (Kinetics-700). HVM-1\nmodels perform competitively against the Kinetics-700 pretrained model in\ndownstream evaluations despite substantial qualitative differences between the\nspatiotemporal characteristics of the corresponding pretraining datasets. HVM-1\nmodels also learn more accurate and more robust object representations compared\nto models pretrained with the image-based MAE algorithm on the same data,\ndemonstrating the potential benefits of learning to predict temporal\nregularities in natural videos for learning better object representations.","PeriodicalId":501517,"journal":{"name":"arXiv - QuanBio - Neurons and Cognition","volume":"53 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Neurons and Cognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.18067","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
We introduce Human-like Video Models (HVM-1), large-scale video models
pretrained with nearly 5000 hours of curated human-like video data (mostly
egocentric, temporally extended, continuous video recordings), using the
spatiotemporal masked autoencoder (ST-MAE) algorithm. We release two 633M
parameter models trained at spatial resolutions of 224x224 and 448x448 pixels.
We evaluate the performance of these models in downstream few-shot video and
image recognition tasks and compare them against a model pretrained with 1330
hours of short action-oriented video clips from YouTube (Kinetics-700). HVM-1
models perform competitively against the Kinetics-700 pretrained model in
downstream evaluations despite substantial qualitative differences between the
spatiotemporal characteristics of the corresponding pretraining datasets. HVM-1
models also learn more accurate and more robust object representations compared
to models pretrained with the image-based MAE algorithm on the same data,
demonstrating the potential benefits of learning to predict temporal
regularities in natural videos for learning better object representations.