The performance of large language models (LLMs) strongly depends on the textit{temperature} parameter. Empirically, at very low temperatures, LLMs generate sentences with clear repetitive structures, while at very high temperatures, generated sentences are often incomprehensible. In this study, using GPT-2, we numerically demonstrate that the difference between the two regimes is not just a smooth change but a phase transition with singular, divergent statistical quantities. Our extensive analysis shows that critical behaviors, such as a power-law decay of correlation in a text, emerge in the LLM at the transition temperature as well as in a natural language dataset. We also discuss that several statistical quantities characterizing the criticality should be useful to evaluate the performance of LLMs.
{"title":"Critical Phase Transition in a Large Language Model","authors":"Kai Nakaishi, Yoshihiko Nishikawa, Koji Hukushima","doi":"arxiv-2406.05335","DOIUrl":"https://doi.org/arxiv-2406.05335","url":null,"abstract":"The performance of large language models (LLMs) strongly depends on the\u0000textit{temperature} parameter. Empirically, at very low temperatures, LLMs\u0000generate sentences with clear repetitive structures, while at very high\u0000temperatures, generated sentences are often incomprehensible. In this study,\u0000using GPT-2, we numerically demonstrate that the difference between the two\u0000regimes is not just a smooth change but a phase transition with singular,\u0000divergent statistical quantities. Our extensive analysis shows that critical\u0000behaviors, such as a power-law decay of correlation in a text, emerge in the\u0000LLM at the transition temperature as well as in a natural language dataset. We\u0000also discuss that several statistical quantities characterizing the criticality\u0000should be useful to evaluate the performance of LLMs.","PeriodicalId":501066,"journal":{"name":"arXiv - PHYS - Disordered Systems and Neural Networks","volume":"31 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141518825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In recent years, quantum Ising machines have drawn a lot of attention, but due to physical implementation constraints, it has been difficult to achieve dense coupling, such as full coupling with sufficient spins to handle practical large-scale applications. Consequently, classically computable equations have been derived from quantum master equations for these quantum Ising machines. Parallel implementations of these algorithms using FPGAs have been used to rapidly find solutions to these problems on a scale that is difficult to achieve in physical systems. We have developed an FPGA implemented cyber coherent Ising machine (cyber CIM) that is much more versatile than previous implementations using FPGAs. Our architecture is versatile since it can be applied to the open-loop CIM, which was proposed when CIM research began, to the closed-loop CIM, which has been used recently, as well as to Jacobi successive over-relaxation method. By modifying the sequence control code for the calculation control module, other algorithms such as Simulated Bifurcation (SB) can also be implemented. Earlier research on large-scale FPGA implementations of SB and CIM used binary or ternary discrete values for connections, whereas the cyber CIM used FP32 values. Also, the cyber CIM utilized Zeeman terms that were represented as FP32, which were not present in other large-scale FPGA systems. Our implementation with continuous interaction realizes N=4096 on a single FPGA, comparable to the single-FPGA implementation of SB with binary interactions, with N=4096. The cyber CIM enables applications such as CDMA multi-user detector and L0 compressed sensing which were not possible with earlier FPGA systems, while enabling superior calculation speeds, more than ten times faster than a GPU implementation. The calculation speed can be further improved by increasing parallelism, such as through clustering.
{"title":"Highly Versatile FPGA-Implemented Cyber Coherent Ising Machine","authors":"Toru Aonishi, Tatsuya Nagasawa, Toshiyuki Koizumi, Mastiyage Don Sudeera Hasaranga Gunathilaka, Kazushi Mimura, Masato Okada, Satoshi Kako, Yoshihisa Yamamoto","doi":"arxiv-2406.05377","DOIUrl":"https://doi.org/arxiv-2406.05377","url":null,"abstract":"In recent years, quantum Ising machines have drawn a lot of attention, but\u0000due to physical implementation constraints, it has been difficult to achieve\u0000dense coupling, such as full coupling with sufficient spins to handle practical\u0000large-scale applications. Consequently, classically computable equations have\u0000been derived from quantum master equations for these quantum Ising machines.\u0000Parallel implementations of these algorithms using FPGAs have been used to\u0000rapidly find solutions to these problems on a scale that is difficult to\u0000achieve in physical systems. We have developed an FPGA implemented cyber\u0000coherent Ising machine (cyber CIM) that is much more versatile than previous\u0000implementations using FPGAs. Our architecture is versatile since it can be\u0000applied to the open-loop CIM, which was proposed when CIM research began, to\u0000the closed-loop CIM, which has been used recently, as well as to Jacobi\u0000successive over-relaxation method. By modifying the sequence control code for\u0000the calculation control module, other algorithms such as Simulated Bifurcation\u0000(SB) can also be implemented. Earlier research on large-scale FPGA\u0000implementations of SB and CIM used binary or ternary discrete values for\u0000connections, whereas the cyber CIM used FP32 values. Also, the cyber CIM\u0000utilized Zeeman terms that were represented as FP32, which were not present in\u0000other large-scale FPGA systems. Our implementation with continuous interaction\u0000realizes N=4096 on a single FPGA, comparable to the single-FPGA implementation\u0000of SB with binary interactions, with N=4096. The cyber CIM enables applications\u0000such as CDMA multi-user detector and L0 compressed sensing which were not\u0000possible with earlier FPGA systems, while enabling superior calculation speeds,\u0000more than ten times faster than a GPU implementation. The calculation speed can\u0000be further improved by increasing parallelism, such as through clustering.","PeriodicalId":501066,"journal":{"name":"arXiv - PHYS - Disordered Systems and Neural Networks","volume":"354 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141518826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
One of the most impressive applications of a quantum annealer was optimizing a group of Volkswagen to reduce traffic congestion using a D-Wave system. A simple formulation of a quadratic term was proposed to reduce traffic congestion. This quadratic term was useful for determining the shortest routes among several candidates. The original formulation produced decreases in the total lengths of car tours and traffic congestion. In this study, we reformulated the cost function with the sole focus on reducing traffic congestion. We then found a unique cost function for expressing the quadratic function with a dead zone and an inequality constraint.
{"title":"Reconsideration of optimization for reduction of traffic congestion","authors":"Masayuki Ohzeki","doi":"arxiv-2406.05448","DOIUrl":"https://doi.org/arxiv-2406.05448","url":null,"abstract":"One of the most impressive applications of a quantum annealer was optimizing\u0000a group of Volkswagen to reduce traffic congestion using a D-Wave system. A\u0000simple formulation of a quadratic term was proposed to reduce traffic\u0000congestion. This quadratic term was useful for determining the shortest routes\u0000among several candidates. The original formulation produced decreases in the\u0000total lengths of car tours and traffic congestion. In this study, we\u0000reformulated the cost function with the sole focus on reducing traffic\u0000congestion. We then found a unique cost function for expressing the quadratic\u0000function with a dead zone and an inequality constraint.","PeriodicalId":501066,"journal":{"name":"arXiv - PHYS - Disordered Systems and Neural Networks","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141518882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
By using dimensionless conductances as scaling variables, the conventional one-parameter scaling theory of localization fails for non-reciprocal non-Hermitian systems such as the Hanato-Nelson model. Here, we propose a one-parameter scaling function using the participation ratio as the scaling variable. Employing a highly accurate numerical procedure based on exact diagonalization, we demonstrate that this one-parameter scaling function can describe Anderson localization transitions of non-reciprocal non-Hermitian systems in one and two dimensions of symmetry classes AI and A. The critical exponents of correlation lengths depend on symmetries and dimensionality only, a typical feature of universality. Moreover, we derive a complex-gap equation based on the self-consistent Born approximation that can determine the disorder at which the point gap closes. The obtained disorders match perfectly the critical disorders of Anderson localization transitions from the one-parameter scaling function. Finally, we show that the one-parameter scaling function is also valid for Anderson localization transitions in reciprocal non-Hermitian systems such as two-dimensional class AII$^dagger$ and can, thus, serve as a unified scaling function for disordered non-Hermitian systems.
通过使用无量纲电导作为缩放变量,传统的一参数局部化缩放理论对于诸如哈纳托-纳尔逊模型这样的非互易非赫米提系统是失效的。在此,我们提出了一种使用参与比作为缩放变量的单参数缩放函数。通过基于精确对角的高精度数值计算过程,我们证明了这个一参数缩放函数可以描述对称类 AI 和 A 的一维和二维非互惠非ermitian 系统的安德森定位转换。此外,我们还推导出基于自洽玻恩近似的复隙方程,它可以确定点隙关闭时的无序度。得到的无序度与单参数缩放函数中安德森局域化转换的临界无序度完全吻合。最后,我们证明了单参数缩放函数对于对等非赫米提系统(如二维 AII 类$^dagger$)中的安德森定位转换也是有效的,因此可以作为无序非赫米提系统的统一缩放函数。
{"title":"Unified one-parameter scaling function for Anderson localization transitions in non-reciprocal non-Hermitian systems","authors":"C. Wang, Wenxue He, X. R. Wang, Hechen Ren","doi":"arxiv-2406.01984","DOIUrl":"https://doi.org/arxiv-2406.01984","url":null,"abstract":"By using dimensionless conductances as scaling variables, the conventional\u0000one-parameter scaling theory of localization fails for non-reciprocal\u0000non-Hermitian systems such as the Hanato-Nelson model. Here, we propose a\u0000one-parameter scaling function using the participation ratio as the scaling\u0000variable. Employing a highly accurate numerical procedure based on exact\u0000diagonalization, we demonstrate that this one-parameter scaling function can\u0000describe Anderson localization transitions of non-reciprocal non-Hermitian\u0000systems in one and two dimensions of symmetry classes AI and A. The critical\u0000exponents of correlation lengths depend on symmetries and dimensionality only,\u0000a typical feature of universality. Moreover, we derive a complex-gap equation\u0000based on the self-consistent Born approximation that can determine the disorder\u0000at which the point gap closes. The obtained disorders match perfectly the\u0000critical disorders of Anderson localization transitions from the one-parameter\u0000scaling function. Finally, we show that the one-parameter scaling function is\u0000also valid for Anderson localization transitions in reciprocal non-Hermitian\u0000systems such as two-dimensional class AII$^dagger$ and can, thus, serve as a\u0000unified scaling function for disordered non-Hermitian systems.","PeriodicalId":501066,"journal":{"name":"arXiv - PHYS - Disordered Systems and Neural Networks","volume":"28 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141257803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We discuss prototype formation in the Hopfield network. Typically, Hebbian learning with highly correlated states leads to degraded memory performance. We show this type of learning can lead to prototype formation, where unlearned states emerge as representatives of large correlated subsets of states, alleviating capacity woes. This process has similarities to prototype learning in human cognition. We provide a substantial literature review of prototype learning in associative memories, covering contributions from psychology, statistical physics, and computer science. We analyze prototype formation from a theoretical perspective and derive a stability condition for these states based on the number of examples of the prototype presented for learning, the noise in those examples, and the number of non-example states presented. The stability condition is used to construct a probability of stability for a prototype state as the factors of stability change. We also note similarities to traditional network analysis, allowing us to find a prototype capacity. We corroborate these expectations of prototype formation with experiments using a simple Hopfield network with standard Hebbian learning. We extend our experiments to a Hopfield network trained on data with multiple prototypes and find the network is capable of stabilizing multiple prototypes concurrently. We measure the basins of attraction of the multiple prototype states, finding attractor strength grows with the number of examples and the agreement of examples. We link the stability and dominance of prototype states to the energy profile of these states, particularly when comparing the profile shape to target states or other spurious states.
{"title":"Prototype Analysis in Hopfield Networks with Hebbian Learning","authors":"Hayden McAlister, Anthony Robins, Lech Szymanski","doi":"arxiv-2407.03342","DOIUrl":"https://doi.org/arxiv-2407.03342","url":null,"abstract":"We discuss prototype formation in the Hopfield network. Typically, Hebbian\u0000learning with highly correlated states leads to degraded memory performance. We\u0000show this type of learning can lead to prototype formation, where unlearned\u0000states emerge as representatives of large correlated subsets of states,\u0000alleviating capacity woes. This process has similarities to prototype learning\u0000in human cognition. We provide a substantial literature review of prototype\u0000learning in associative memories, covering contributions from psychology,\u0000statistical physics, and computer science. We analyze prototype formation from\u0000a theoretical perspective and derive a stability condition for these states\u0000based on the number of examples of the prototype presented for learning, the\u0000noise in those examples, and the number of non-example states presented. The\u0000stability condition is used to construct a probability of stability for a\u0000prototype state as the factors of stability change. We also note similarities\u0000to traditional network analysis, allowing us to find a prototype capacity. We\u0000corroborate these expectations of prototype formation with experiments using a\u0000simple Hopfield network with standard Hebbian learning. We extend our\u0000experiments to a Hopfield network trained on data with multiple prototypes and\u0000find the network is capable of stabilizing multiple prototypes concurrently. We\u0000measure the basins of attraction of the multiple prototype states, finding\u0000attractor strength grows with the number of examples and the agreement of\u0000examples. We link the stability and dominance of prototype states to the energy\u0000profile of these states, particularly when comparing the profile shape to\u0000target states or other spurious states.","PeriodicalId":501066,"journal":{"name":"arXiv - PHYS - Disordered Systems and Neural Networks","volume":"364 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141569179","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
How much data is required to learn the structure of a language via next-token prediction? We study this question for synthetic datasets generated via a Probabilistic Context-Free Grammar (PCFG) -- a hierarchical generative model that captures the tree-like structure of natural languages. We determine token-token correlations analytically in our model and show that they can be used to build a representation of the grammar's hidden variables, the longer the range the deeper the variable. In addition, a finite training set limits the resolution of correlations to an effective range, whose size grows with that of the training set. As a result, a Language Model trained with increasingly many examples can build a deeper representation of the grammar's structure, thus reaching good performance despite the high dimensionality of the problem. We conjecture that the relationship between training set size and effective range of correlations holds beyond our synthetic datasets. In particular, our conjecture predicts how the scaling law for the test loss behaviour with training set size depends on the length of the context window, which we confirm empirically for a collection of lines from Shakespeare's plays.
{"title":"Towards a theory of how the structure of language is acquired by deep neural networks","authors":"Francesco Cagnetta, Matthieu Wyart","doi":"arxiv-2406.00048","DOIUrl":"https://doi.org/arxiv-2406.00048","url":null,"abstract":"How much data is required to learn the structure of a language via next-token\u0000prediction? We study this question for synthetic datasets generated via a\u0000Probabilistic Context-Free Grammar (PCFG) -- a hierarchical generative model\u0000that captures the tree-like structure of natural languages. We determine\u0000token-token correlations analytically in our model and show that they can be\u0000used to build a representation of the grammar's hidden variables, the longer\u0000the range the deeper the variable. In addition, a finite training set limits\u0000the resolution of correlations to an effective range, whose size grows with\u0000that of the training set. As a result, a Language Model trained with\u0000increasingly many examples can build a deeper representation of the grammar's\u0000structure, thus reaching good performance despite the high dimensionality of\u0000the problem. We conjecture that the relationship between training set size and\u0000effective range of correlations holds beyond our synthetic datasets. In\u0000particular, our conjecture predicts how the scaling law for the test loss\u0000behaviour with training set size depends on the length of the context window,\u0000which we confirm empirically for a collection of lines from Shakespeare's\u0000plays.","PeriodicalId":501066,"journal":{"name":"arXiv - PHYS - Disordered Systems and Neural Networks","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141257270","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multi-index models -- functions which only depend on the covariates through a non-linear transformation of their projection on a subspace -- are a useful benchmark for investigating feature learning with neural networks. This paper examines the theoretical boundaries of learnability in this hypothesis class, focusing particularly on the minimum sample complexity required for weakly recovering their low-dimensional structure with first-order iterative algorithms, in the high-dimensional regime where the number of samples is $n=alpha d$ is proportional to the covariate dimension $d$. Our findings unfold in three parts: (i) first, we identify under which conditions a textit{trivial subspace} can be learned with a single step of a first-order algorithm for any $alpha!>!0$; (ii) second, in the case where the trivial subspace is empty, we provide necessary and sufficient conditions for the existence of an {it easy subspace} consisting of directions that can be learned only above a certain sample complexity $alpha!>!alpha_c$. The critical threshold $alpha_{c}$ marks the presence of a computational phase transition, in the sense that no efficient iterative algorithm can succeed for $alpha!