{"title":"HVM-1:使用近 5000 小时类人视频数据预训练的大规模视频模型","authors":"A. Emin Orhan","doi":"arxiv-2407.18067","DOIUrl":null,"url":null,"abstract":"We introduce Human-like Video Models (HVM-1), large-scale video models\npretrained with nearly 5000 hours of curated human-like video data (mostly\negocentric, temporally extended, continuous video recordings), using the\nspatiotemporal masked autoencoder (ST-MAE) algorithm. We release two 633M\nparameter models trained at spatial resolutions of 224x224 and 448x448 pixels.\nWe evaluate the performance of these models in downstream few-shot video and\nimage recognition tasks and compare them against a model pretrained with 1330\nhours of short action-oriented video clips from YouTube (Kinetics-700). HVM-1\nmodels perform competitively against the Kinetics-700 pretrained model in\ndownstream evaluations despite substantial qualitative differences between the\nspatiotemporal characteristics of the corresponding pretraining datasets. HVM-1\nmodels also learn more accurate and more robust object representations compared\nto models pretrained with the image-based MAE algorithm on the same data,\ndemonstrating the potential benefits of learning to predict temporal\nregularities in natural videos for learning better object representations.","PeriodicalId":501517,"journal":{"name":"arXiv - QuanBio - Neurons and Cognition","volume":"53 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"HVM-1: Large-scale video models pretrained with nearly 5000 hours of human-like video data\",\"authors\":\"A. Emin Orhan\",\"doi\":\"arxiv-2407.18067\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We introduce Human-like Video Models (HVM-1), large-scale video models\\npretrained with nearly 5000 hours of curated human-like video data (mostly\\negocentric, temporally extended, continuous video recordings), using the\\nspatiotemporal masked autoencoder (ST-MAE) algorithm. We release two 633M\\nparameter models trained at spatial resolutions of 224x224 and 448x448 pixels.\\nWe evaluate the performance of these models in downstream few-shot video and\\nimage recognition tasks and compare them against a model pretrained with 1330\\nhours of short action-oriented video clips from YouTube (Kinetics-700). HVM-1\\nmodels perform competitively against the Kinetics-700 pretrained model in\\ndownstream evaluations despite substantial qualitative differences between the\\nspatiotemporal characteristics of the corresponding pretraining datasets. HVM-1\\nmodels also learn more accurate and more robust object representations compared\\nto models pretrained with the image-based MAE algorithm on the same data,\\ndemonstrating the potential benefits of learning to predict temporal\\nregularities in natural videos for learning better object representations.\",\"PeriodicalId\":501517,\"journal\":{\"name\":\"arXiv - QuanBio - Neurons and Cognition\",\"volume\":\"53 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-07-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - QuanBio - Neurons and Cognition\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2407.18067\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Neurons and Cognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.18067","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
HVM-1: Large-scale video models pretrained with nearly 5000 hours of human-like video data
We introduce Human-like Video Models (HVM-1), large-scale video models
pretrained with nearly 5000 hours of curated human-like video data (mostly
egocentric, temporally extended, continuous video recordings), using the
spatiotemporal masked autoencoder (ST-MAE) algorithm. We release two 633M
parameter models trained at spatial resolutions of 224x224 and 448x448 pixels.
We evaluate the performance of these models in downstream few-shot video and
image recognition tasks and compare them against a model pretrained with 1330
hours of short action-oriented video clips from YouTube (Kinetics-700). HVM-1
models perform competitively against the Kinetics-700 pretrained model in
downstream evaluations despite substantial qualitative differences between the
spatiotemporal characteristics of the corresponding pretraining datasets. HVM-1
models also learn more accurate and more robust object representations compared
to models pretrained with the image-based MAE algorithm on the same data,
demonstrating the potential benefits of learning to predict temporal
regularities in natural videos for learning better object representations.