Towards a theory of how the structure of language is acquired by deep neural networks

arXiv - PHYS - Disordered Systems and Neural Networks Pub Date : 2024-05-28 DOI:arxiv-2406.00048

Francesco Cagnetta, Matthieu Wyart

{"title":"Towards a theory of how the structure of language is acquired by deep neural networks","authors":"Francesco Cagnetta, Matthieu Wyart","doi":"arxiv-2406.00048","DOIUrl":null,"url":null,"abstract":"How much data is required to learn the structure of a language via next-token\nprediction? We study this question for synthetic datasets generated via a\nProbabilistic Context-Free Grammar (PCFG) -- a hierarchical generative model\nthat captures the tree-like structure of natural languages. We determine\ntoken-token correlations analytically in our model and show that they can be\nused to build a representation of the grammar's hidden variables, the longer\nthe range the deeper the variable. In addition, a finite training set limits\nthe resolution of correlations to an effective range, whose size grows with\nthat of the training set. As a result, a Language Model trained with\nincreasingly many examples can build a deeper representation of the grammar's\nstructure, thus reaching good performance despite the high dimensionality of\nthe problem. We conjecture that the relationship between training set size and\neffective range of correlations holds beyond our synthetic datasets. In\nparticular, our conjecture predicts how the scaling law for the test loss\nbehaviour with training set size depends on the length of the context window,\nwhich we confirm empirically for a collection of lines from Shakespeare's\nplays.","PeriodicalId":501066,"journal":{"name":"arXiv - PHYS - Disordered Systems and Neural Networks","volume":"3 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - PHYS - Disordered Systems and Neural Networks","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2406.00048","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

How much data is required to learn the structure of a language via next-token prediction? We study this question for synthetic datasets generated via a Probabilistic Context-Free Grammar (PCFG) -- a hierarchical generative model that captures the tree-like structure of natural languages. We determine token-token correlations analytically in our model and show that they can be used to build a representation of the grammar's hidden variables, the longer the range the deeper the variable. In addition, a finite training set limits the resolution of correlations to an effective range, whose size grows with that of the training set. As a result, a Language Model trained with increasingly many examples can build a deeper representation of the grammar's structure, thus reaching good performance despite the high dimensionality of the problem. We conjecture that the relationship between training set size and effective range of correlations holds beyond our synthetic datasets. In particular, our conjecture predicts how the scaling law for the test loss behaviour with training set size depends on the length of the context window, which we confirm empirically for a collection of lines from Shakespeare's plays.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

关于深度神经网络如何获得语言结构的理论研究

通过下一个标记预测学习语言结构需要多少数据？我们针对通过概率自由上下文语法 (PCFG) 生成的合成数据集研究了这个问题，PCFG 是一种分层生成模型，可以捕捉自然语言的树状结构。我们通过分析确定了模型中的代词-代词相关性，并证明它们可以用来构建语法隐藏变量的表示，范围越长，变量越深。此外，有限的训练集将相关性的解析限制在一个有效范围内，而这个范围的大小会随着训练集的增大而增大。因此，在越来越多的示例中训练出来的语言模型可以建立语法结构的更深表征，从而在问题维度很高的情况下仍能达到很好的性能。我们推测，训练集大小与相关性有效范围之间的关系并不局限于我们的合成数据集。特别是，我们的猜想预测了测试损失行为随训练集大小的缩放规律如何取决于上下文窗口的长度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - PHYS - Disordered Systems and Neural Networks

自引率

0.00%

发文量