Unified Neural Network Scaling Laws and Scale-time Equivalence

arXiv - STAT - Machine Learning Pub Date : 2024-09-09 DOI:arxiv-2409.05782

Akhilan Boopathy, Ila Fiete

{"title":"Unified Neural Network Scaling Laws and Scale-time Equivalence","authors":"Akhilan Boopathy, Ila Fiete","doi":"arxiv-2409.05782","DOIUrl":null,"url":null,"abstract":"As neural networks continue to grow in size but datasets might not, it is\nvital to understand how much performance improvement can be expected: is it\nmore important to scale network size or data volume? Thus, neural network\nscaling laws, which characterize how test error varies with network size and\ndata volume, have become increasingly important. However, existing scaling laws\nare often applicable only in limited regimes and often do not incorporate or\npredict well-known phenomena such as double descent. Here, we present a novel\ntheoretical characterization of how three factors -- model size, training time,\nand data volume -- interact to determine the performance of deep neural\nnetworks. We first establish a theoretical and empirical equivalence between\nscaling the size of a neural network and increasing its training time\nproportionally. Scale-time equivalence challenges the current practice, wherein\nlarge models are trained for small durations, and suggests that smaller models\ntrained over extended periods could match their efficacy. It also leads to a\nnovel method for predicting the performance of large-scale networks from\nsmall-scale networks trained for extended epochs, and vice versa. We next\ncombine scale-time equivalence with a linear model analysis of double descent\nto obtain a unified theoretical scaling law, which we confirm with experiments\nacross vision benchmarks and network architectures. These laws explain several\npreviously unexplained phenomena: reduced data requirements for generalization\nin larger models, heightened sensitivity to label noise in overparameterized\nmodels, and instances where increasing model scale does not necessarily enhance\nperformance. Our findings hold significant implications for the practical\ndeployment of neural networks, offering a more accessible and efficient path to\ntraining and fine-tuning large models.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"6 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - STAT - Machine Learning","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.05782","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

As neural networks continue to grow in size but datasets might not, it is vital to understand how much performance improvement can be expected: is it more important to scale network size or data volume? Thus, neural network scaling laws, which characterize how test error varies with network size and data volume, have become increasingly important. However, existing scaling laws are often applicable only in limited regimes and often do not incorporate or predict well-known phenomena such as double descent. Here, we present a novel theoretical characterization of how three factors -- model size, training time, and data volume -- interact to determine the performance of deep neural networks. We first establish a theoretical and empirical equivalence between scaling the size of a neural network and increasing its training time proportionally. Scale-time equivalence challenges the current practice, wherein large models are trained for small durations, and suggests that smaller models trained over extended periods could match their efficacy. It also leads to a novel method for predicting the performance of large-scale networks from small-scale networks trained for extended epochs, and vice versa. We next combine scale-time equivalence with a linear model analysis of double descent to obtain a unified theoretical scaling law, which we confirm with experiments across vision benchmarks and network architectures. These laws explain several previously unexplained phenomena: reduced data requirements for generalization in larger models, heightened sensitivity to label noise in overparameterized models, and instances where increasing model scale does not necessarily enhance performance. Our findings hold significant implications for the practical deployment of neural networks, offering a more accessible and efficient path to training and fine-tuning large models.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

统一神经网络缩放定律和尺度-时间等效性

神经网络的规模在不断扩大，但数据集的规模可能不会随之扩大，因此了解性能提升的幅度至关重要：是扩大网络规模更重要，还是扩大数据量更重要？因此，描述测试误差如何随网络规模和数据量变化的神经网络缩放定律变得越来越重要。然而，现有的缩放定律往往只适用于有限的情况，而且往往没有包含或预测诸如双下降等众所周知的现象。在这里，我们对模型大小、训练时间和数据量这三个因素如何相互作用决定深度神经网络的性能进行了新颖的理论描述。我们首先从理论和经验上建立了神经网络规模扩大与训练时间成比例增加之间的等价关系。规模-时间等效性对当前的做法提出了挑战，当前的做法是在较短的时间内训练大型模型，这表明在较长时间内训练较小的模型也能达到与之相匹配的效果。它还带来了一种新方法，可以从经过长时间训练的小规模网络预测大规模网络的性能，反之亦然。接下来，我们将规模-时间等效性与双下降线性模型分析相结合，得到了一个统一的理论缩放定律，并通过跨视觉基准和网络架构的实验加以证实。这些定律解释了几个以前无法解释的现象：在较大的模型中降低了泛化的数据要求，在参数过大的模型中提高了对标签噪声的敏感性，以及模型规模的扩大并不一定会提高性能。我们的发现对神经网络的实际部署具有重要意义，为训练和微调大型模型提供了更便捷、更高效的途径。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - STAT - Machine Learning

自引率

0.00%

发文量

期刊最新文献

Fitting Multilevel Factor Models Cartan moving frames and the data manifolds Symmetry-Based Structured Matrices for Efficient Approximately Equivariant Networks Recurrent Interpolants for Probabilistic Time Series Prediction PieClam: A Universal Graph Autoencoder Based on Overlapping Inclusive and Exclusive Communities