{"title":"Unified Neural Network Scaling Laws and Scale-time Equivalence","authors":"Akhilan Boopathy, Ila Fiete","doi":"arxiv-2409.05782","DOIUrl":null,"url":null,"abstract":"As neural networks continue to grow in size but datasets might not, it is\nvital to understand how much performance improvement can be expected: is it\nmore important to scale network size or data volume? Thus, neural network\nscaling laws, which characterize how test error varies with network size and\ndata volume, have become increasingly important. However, existing scaling laws\nare often applicable only in limited regimes and often do not incorporate or\npredict well-known phenomena such as double descent. Here, we present a novel\ntheoretical characterization of how three factors -- model size, training time,\nand data volume -- interact to determine the performance of deep neural\nnetworks. We first establish a theoretical and empirical equivalence between\nscaling the size of a neural network and increasing its training time\nproportionally. Scale-time equivalence challenges the current practice, wherein\nlarge models are trained for small durations, and suggests that smaller models\ntrained over extended periods could match their efficacy. It also leads to a\nnovel method for predicting the performance of large-scale networks from\nsmall-scale networks trained for extended epochs, and vice versa. We next\ncombine scale-time equivalence with a linear model analysis of double descent\nto obtain a unified theoretical scaling law, which we confirm with experiments\nacross vision benchmarks and network architectures. These laws explain several\npreviously unexplained phenomena: reduced data requirements for generalization\nin larger models, heightened sensitivity to label noise in overparameterized\nmodels, and instances where increasing model scale does not necessarily enhance\nperformance. Our findings hold significant implications for the practical\ndeployment of neural networks, offering a more accessible and efficient path to\ntraining and fine-tuning large models.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"6 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - STAT - Machine Learning","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.05782","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
As neural networks continue to grow in size but datasets might not, it is
vital to understand how much performance improvement can be expected: is it
more important to scale network size or data volume? Thus, neural network
scaling laws, which characterize how test error varies with network size and
data volume, have become increasingly important. However, existing scaling laws
are often applicable only in limited regimes and often do not incorporate or
predict well-known phenomena such as double descent. Here, we present a novel
theoretical characterization of how three factors -- model size, training time,
and data volume -- interact to determine the performance of deep neural
networks. We first establish a theoretical and empirical equivalence between
scaling the size of a neural network and increasing its training time
proportionally. Scale-time equivalence challenges the current practice, wherein
large models are trained for small durations, and suggests that smaller models
trained over extended periods could match their efficacy. It also leads to a
novel method for predicting the performance of large-scale networks from
small-scale networks trained for extended epochs, and vice versa. We next
combine scale-time equivalence with a linear model analysis of double descent
to obtain a unified theoretical scaling law, which we confirm with experiments
across vision benchmarks and network architectures. These laws explain several
previously unexplained phenomena: reduced data requirements for generalization
in larger models, heightened sensitivity to label noise in overparameterized
models, and instances where increasing model scale does not necessarily enhance
performance. Our findings hold significant implications for the practical
deployment of neural networks, offering a more accessible and efficient path to
training and fine-tuning large models.