In recent years, models based on the transformer architecture have seen widespread applications and have become one of the core tools in the field of deep learning. Numerous successful and efficient techniques, such as parameter-efficient fine-tuning and efficient scaling, have been proposed surrounding their applications to further enhance performance. However, the success of these strategies has always lacked the support of rigorous mathematical theory. To study the underlying mechanisms behind transformers and related techniques, we first propose a transformer learning framework motivated by distribution regression, with distributions being inputs, connect a two-stage sampling process with natural language processing, and present a mathematical formulation of the attention mechanism called attention operator. We demonstrate that by the attention operator, transformers can compress distributions into function representations without loss of information. Moreover, with the advantages of our novel attention operator, transformers exhibit a stronger capability to learn functionals with more complex structures than convolutional neural networks and fully connected networks. Finally, we obtain a generalization bound within the distribution regression framework. Throughout theoretical results, we further discuss some successful techniques emerging with large language models (LLMs), such as prompt tuning, parameter-efficient fine-tuning, and efficient scaling. We also provide theoretical insights behind these techniques within our novel analysis framework.
{"title":"Generalization Analysis of Transformers in Distribution Regression","authors":"Peilin Liu;Ding-Xuan Zhou","doi":"10.1162/neco_a_01726","DOIUrl":"10.1162/neco_a_01726","url":null,"abstract":"In recent years, models based on the transformer architecture have seen widespread applications and have become one of the core tools in the field of deep learning. Numerous successful and efficient techniques, such as parameter-efficient fine-tuning and efficient scaling, have been proposed surrounding their applications to further enhance performance. However, the success of these strategies has always lacked the support of rigorous mathematical theory. To study the underlying mechanisms behind transformers and related techniques, we first propose a transformer learning framework motivated by distribution regression, with distributions being inputs, connect a two-stage sampling process with natural language processing, and present a mathematical formulation of the attention mechanism called attention operator. We demonstrate that by the attention operator, transformers can compress distributions into function representations without loss of information. Moreover, with the advantages of our novel attention operator, transformers exhibit a stronger capability to learn functionals with more complex structures than convolutional neural networks and fully connected networks. Finally, we obtain a generalization bound within the distribution regression framework. Throughout theoretical results, we further discuss some successful techniques emerging with large language models (LLMs), such as prompt tuning, parameter-efficient fine-tuning, and efficient scaling. We also provide theoretical insights behind these techniques within our novel analysis framework.","PeriodicalId":54731,"journal":{"name":"Neural Computation","volume":"37 2","pages":"260-293"},"PeriodicalIF":2.7,"publicationDate":"2025-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142669939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hebbian learning theory is rooted in Pavlov’s classical conditioning While mathematical models of the former have been proposed and studied in the past decades, especially in spin glass theory, only recently has it been numerically shown that it is possible to write neural and synaptic dynamics that mirror Pavlov conditioning mechanisms and also give rise to synaptic weights that correspond to the Hebbian learning rule. In this article we show that the same dynamics can be derived with equilibrium statistical mechanics tools and basic and motivated modeling assumptions. Then we show how to study the resulting system of coupled stochastic differential equations assuming the reasonable separation of neural and synaptic timescale. In particular, we analytically demonstrate that this synaptic evolution converges to the Hebbian learning rule in various settings and compute the variance of the stochastic process. Finally, drawing from evidence on pure memory reinforcement during sleep stages, we show how the proposed model can simulate neural networks that undergo sleep-associated memory consolidation processes, thereby proving the compatibility of Pavlovian learning with dreaming mechanisms.
{"title":"Learning in Associative Networks Through Pavlovian Dynamics","authors":"Daniele Lotito;Miriam Aquaro;Chiara Marullo","doi":"10.1162/neco_a_01730","DOIUrl":"10.1162/neco_a_01730","url":null,"abstract":"Hebbian learning theory is rooted in Pavlov’s classical conditioning While mathematical models of the former have been proposed and studied in the past decades, especially in spin glass theory, only recently has it been numerically shown that it is possible to write neural and synaptic dynamics that mirror Pavlov conditioning mechanisms and also give rise to synaptic weights that correspond to the Hebbian learning rule. In this article we show that the same dynamics can be derived with equilibrium statistical mechanics tools and basic and motivated modeling assumptions. Then we show how to study the resulting system of coupled stochastic differential equations assuming the reasonable separation of neural and synaptic timescale. In particular, we analytically demonstrate that this synaptic evolution converges to the Hebbian learning rule in various settings and compute the variance of the stochastic process. Finally, drawing from evidence on pure memory reinforcement during sleep stages, we show how the proposed model can simulate neural networks that undergo sleep-associated memory consolidation processes, thereby proving the compatibility of Pavlovian learning with dreaming mechanisms.","PeriodicalId":54731,"journal":{"name":"Neural Computation","volume":"37 2","pages":"311-343"},"PeriodicalIF":2.7,"publicationDate":"2025-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142774845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Significant progress has been made recently in understanding the generalization of neural networks (NNs) trained by gradient descent (GD) using the algorithmic stability approach. However, most of the existing research has focused on one-hidden-layer NNs and has not addressed the impact of different network scaling. Here, network scaling corresponds to the normalization of the layers. In this article, we greatly extend the previous work (Lei et al., 2022; Richards & Kuzborskij, 2021) by conducting a comprehensive stability and generalization analysis of GD for two-layer and three-layer NNs. For two-layer NNs, our results are established under general network scaling, relaxing previous conditions. In the case of three-layer NNs, our technical contribution lies in demonstrating its nearly co-coercive property by utilizing a novel induction strategy that thoroughly explores the effects of overparameterization. As a direct application of our general findings, we derive the excess risk rate of O(1/n) for GD in both two-layer and three-layer NNs. This sheds light on sufficient or necessary conditions for underparameterized and overparameterized NNs trained by GD to attain the desired risk rate of O(1/n). Moreover, we demonstrate that as the scaling factor increases or the network complexity decreases, less overparameterization is required for GD to achieve the desired error rates. Additionally, under a low-noise condition, we obtain a fast risk rate of O(1/n) for GD in both two-layer and three-layer NNs.
近来,在利用算法稳定性方法理解通过梯度下降(GD)训练的神经网络(NN)的泛化方面取得了重大进展。然而,现有研究大多集中于单隐层神经网络,并未涉及不同网络规模的影响。在这里,网络缩放相当于层的规范化。在本文中,我们大大扩展了之前的工作(Lei 等人,2022;Richards & Kuzborskij,2021),对两层和三层 NN 的 GD 进行了全面的稳定性和泛化分析。对于两层 NN,我们的结果是在一般网络缩放条件下建立的,放宽了之前的条件。对于三层网络,我们的技术贡献在于利用一种新颖的归纳策略,彻底探讨了过参数化的影响,从而证明了其近乎协迫的特性。作为我们一般发现的直接应用,我们得出了两层和三层网络中 GD 的超额风险率为 O(1/n)。这揭示了通过 GD 训练的欠参数化和过参数化 NN 达到 O(1/n) 期望风险率的充分或必要条件。此外,我们还证明,随着缩放因子的增加或网络复杂度的降低,GD 所需的过参数化程度也会降低,从而达到所需的错误率。此外,在低噪声条件下,我们在两层和三层 NN 中都获得了 O(1/n)的快速风险率。
{"title":"Generalization Guarantees of Gradient Descent for Shallow Neural Networks","authors":"Puyu Wang;Yunwen Lei;Di Wang;Yiming Ying;Ding-Xuan Zhou","doi":"10.1162/neco_a_01725","DOIUrl":"10.1162/neco_a_01725","url":null,"abstract":"Significant progress has been made recently in understanding the generalization of neural networks (NNs) trained by gradient descent (GD) using the algorithmic stability approach. However, most of the existing research has focused on one-hidden-layer NNs and has not addressed the impact of different network scaling. Here, network scaling corresponds to the normalization of the layers. In this article, we greatly extend the previous work (Lei et al., 2022; Richards & Kuzborskij, 2021) by conducting a comprehensive stability and generalization analysis of GD for two-layer and three-layer NNs. For two-layer NNs, our results are established under general network scaling, relaxing previous conditions. In the case of three-layer NNs, our technical contribution lies in demonstrating its nearly co-coercive property by utilizing a novel induction strategy that thoroughly explores the effects of overparameterization. As a direct application of our general findings, we derive the excess risk rate of O(1/n) for GD in both two-layer and three-layer NNs. This sheds light on sufficient or necessary conditions for underparameterized and overparameterized NNs trained by GD to attain the desired risk rate of O(1/n). Moreover, we demonstrate that as the scaling factor increases or the network complexity decreases, less overparameterization is required for GD to achieve the desired error rates. Additionally, under a low-noise condition, we obtain a fast risk rate of O(1/n) for GD in both two-layer and three-layer NNs.","PeriodicalId":54731,"journal":{"name":"Neural Computation","volume":"37 2","pages":"344-402"},"PeriodicalIF":2.7,"publicationDate":"2025-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142666383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Complex information processing systems that are capable of a wide variety of tasks, such as the human brain, are composed of specialized units that collaborate and communicate with each other. An important property of such information processing networks is locality: there is no single global unit controlling the modules, but information is exchanged locally. Here, we consider a decision-theoretic approach to study networks of bounded rational decision makers that are allowed to specialize and communicate with each other. In contrast to previous work that has focused on feedforward communication between decision-making agents, we consider cyclical information processing paths allowing for back-and-forth communication. We adapt message-passing algorithms to suit this purpose, essentially allowing for local information flow between units and thus enabling circular dependency structures. We provide examples that show how repeated communication can increase performance given that each unit’s information processing capability is limited and that decision-making systems with too few or too many connections and feedback loops achieve suboptimal utility.
{"title":"Bounded Rational Decision Networks With Belief Propagation","authors":"Gerrit Schmid;Sebastian Gottwald;Daniel A. Braun","doi":"10.1162/neco_a_01719","DOIUrl":"10.1162/neco_a_01719","url":null,"abstract":"Complex information processing systems that are capable of a wide variety of tasks, such as the human brain, are composed of specialized units that collaborate and communicate with each other. An important property of such information processing networks is locality: there is no single global unit controlling the modules, but information is exchanged locally. Here, we consider a decision-theoretic approach to study networks of bounded rational decision makers that are allowed to specialize and communicate with each other. In contrast to previous work that has focused on feedforward communication between decision-making agents, we consider cyclical information processing paths allowing for back-and-forth communication. We adapt message-passing algorithms to suit this purpose, essentially allowing for local information flow between units and thus enabling circular dependency structures. We provide examples that show how repeated communication can increase performance given that each unit’s information processing capability is limited and that decision-making systems with too few or too many connections and feedback loops achieve suboptimal utility.","PeriodicalId":54731,"journal":{"name":"Neural Computation","volume":"37 1","pages":"76-127"},"PeriodicalIF":2.7,"publicationDate":"2024-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10810330","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142395372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Max Dabagia;Christos H. Papadimitriou;Santosh S. Vempala
Even as machine learning exceeds human-level performance on many applications, the generality, robustness, and rapidity of the brain’s learning capabilities remain unmatched. How cognition arises from neural activity is the central open question in neuroscience, inextricable from the study of intelligence itself. A simple formal model of neural activity was proposed in Papadimitriou et al. (2020) and has been subsequently shown, through both mathematical proofs and simulations, to be capable of implementing certain simple cognitive operations via the creation and manipulation of assemblies of neurons. However, many intelligent behaviors rely on the ability to recognize, store, and manipulate temporal sequences of stimuli (planning, language, navigation, to list a few). Here we show that in the same model, sequential precedence can be captured naturally through synaptic weights and plasticity, and, as a result, a range of computations on sequences of assemblies can be carried out. In particular, repeated presentation of a sequence of stimuli leads to the memorization of the sequence through corresponding neural assemblies: upon future presentation of any stimulus in the sequence, the corresponding assembly and its subsequent ones will be activated, one after the other, until the end of the sequence. If the stimulus sequence is presented to two brain areas simultaneously, a scaffolded representation is created, resulting in more efficient memorization and recall, in agreement with cognitive experiments. Finally, we show that any finite state machine can be learned in a similar way, through the presentation of appropriate patterns of sequences. Through an extension of this mechanism, the model can be shown to be capable of universal computation. Taken together, these results provide a concrete hypothesis for the basis of the brain’s remarkable abilities to compute and learn, with sequences playing a vital role.
{"title":"Computation With Sequences of Assemblies in a Model of the Brain","authors":"Max Dabagia;Christos H. Papadimitriou;Santosh S. Vempala","doi":"10.1162/neco_a_01720","DOIUrl":"10.1162/neco_a_01720","url":null,"abstract":"Even as machine learning exceeds human-level performance on many applications, the generality, robustness, and rapidity of the brain’s learning capabilities remain unmatched. How cognition arises from neural activity is the central open question in neuroscience, inextricable from the study of intelligence itself. A simple formal model of neural activity was proposed in Papadimitriou et al. (2020) and has been subsequently shown, through both mathematical proofs and simulations, to be capable of implementing certain simple cognitive operations via the creation and manipulation of assemblies of neurons. However, many intelligent behaviors rely on the ability to recognize, store, and manipulate temporal sequences of stimuli (planning, language, navigation, to list a few). Here we show that in the same model, sequential precedence can be captured naturally through synaptic weights and plasticity, and, as a result, a range of computations on sequences of assemblies can be carried out. In particular, repeated presentation of a sequence of stimuli leads to the memorization of the sequence through corresponding neural assemblies: upon future presentation of any stimulus in the sequence, the corresponding assembly and its subsequent ones will be activated, one after the other, until the end of the sequence. If the stimulus sequence is presented to two brain areas simultaneously, a scaffolded representation is created, resulting in more efficient memorization and recall, in agreement with cognitive experiments. Finally, we show that any finite state machine can be learned in a similar way, through the presentation of appropriate patterns of sequences. Through an extension of this mechanism, the model can be shown to be capable of universal computation. Taken together, these results provide a concrete hypothesis for the basis of the brain’s remarkable abilities to compute and learn, with sequences playing a vital role.","PeriodicalId":54731,"journal":{"name":"Neural Computation","volume":"37 1","pages":"193-233"},"PeriodicalIF":2.7,"publicationDate":"2024-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142395373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Christopher J. Kymn;Denis Kleyko;E. Paxon Frady;Connor Bybee;Pentti Kanerva;Friedrich T. Sommer;Bruno A. Olshausen
We introduce residue hyperdimensional computing, a computing framework that unifies residue number systems with an algebra defined over random, high-dimensional vectors. We show how residue numbers can be represented as high-dimensional vectors in a manner that allows algebraic operations to be performed with component-wise, parallelizable operations on the vector elements. The resulting framework, when combined with an efficient method for factorizing high-dimensional vectors, can represent and operate on numerical values over a large dynamic range using resources that scale only logarithmically with the range, a vast improvement over previous methods. It also exhibits impressive robustness to noise. We demonstrate the potential for this framework to solve computationally difficult problems in visual perception and combinatorial optimization, showing improvement over baseline methods. More broadly, the framework provides a possible account for the computational operations of grid cells in the brain, and it suggests new machine learning architectures for representing and manipulating numerical data.
{"title":"Computing With Residue Numbers in High-Dimensional Representation","authors":"Christopher J. Kymn;Denis Kleyko;E. Paxon Frady;Connor Bybee;Pentti Kanerva;Friedrich T. Sommer;Bruno A. Olshausen","doi":"10.1162/neco_a_01723","DOIUrl":"10.1162/neco_a_01723","url":null,"abstract":"We introduce residue hyperdimensional computing, a computing framework that unifies residue number systems with an algebra defined over random, high-dimensional vectors. We show how residue numbers can be represented as high-dimensional vectors in a manner that allows algebraic operations to be performed with component-wise, parallelizable operations on the vector elements. The resulting framework, when combined with an efficient method for factorizing high-dimensional vectors, can represent and operate on numerical values over a large dynamic range using resources that scale only logarithmically with the range, a vast improvement over previous methods. It also exhibits impressive robustness to noise. We demonstrate the potential for this framework to solve computationally difficult problems in visual perception and combinatorial optimization, showing improvement over baseline methods. More broadly, the framework provides a possible account for the computational operations of grid cells in the brain, and it suggests new machine learning architectures for representing and manipulating numerical data.","PeriodicalId":54731,"journal":{"name":"Neural Computation","volume":"37 1","pages":"1-37"},"PeriodicalIF":2.7,"publicationDate":"2024-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142669937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tomohiro Shiraishi;Daiki Miwa;Vo Nguyen Le Duy;Ichiro Takeuchi
In this study, we investigate the quantification of the statistical reliability of detected change points (CPs) in time series using a recurrent neural network (RNN). Thanks to its flexibility, RNN holds the potential to effectively identify CPs in time series characterized by complex dynamics. However, there is an increased risk of erroneously detecting random noise fluctuations as CPs. The primary goal of this study is to rigorously control the risk of false detections by providing theoretically valid p-values to the CPs detected by RNN. To achieve this, we introduce a novel method based on the framework of selective inference (SI). SI enables valid inferences by conditioning on the event of hypothesis selection, thus mitigating bias from generating and testing hypotheses on the same data. In this study, we apply an SI framework to RNN-based CP detection, where characterizing the complex process of RNN selecting CPs is our main technical challenge. We demonstrate the validity and effectiveness of the proposed method through artificial and real data experiments.
{"title":"Selective Inference for Change Point Detection by Recurrent Neural Network","authors":"Tomohiro Shiraishi;Daiki Miwa;Vo Nguyen Le Duy;Ichiro Takeuchi","doi":"10.1162/neco_a_01724","DOIUrl":"10.1162/neco_a_01724","url":null,"abstract":"In this study, we investigate the quantification of the statistical reliability of detected change points (CPs) in time series using a recurrent neural network (RNN). Thanks to its flexibility, RNN holds the potential to effectively identify CPs in time series characterized by complex dynamics. However, there is an increased risk of erroneously detecting random noise fluctuations as CPs. The primary goal of this study is to rigorously control the risk of false detections by providing theoretically valid p-values to the CPs detected by RNN. To achieve this, we introduce a novel method based on the framework of selective inference (SI). SI enables valid inferences by conditioning on the event of hypothesis selection, thus mitigating bias from generating and testing hypotheses on the same data. In this study, we apply an SI framework to RNN-based CP detection, where characterizing the complex process of RNN selecting CPs is our main technical challenge. We demonstrate the validity and effectiveness of the proposed method through artificial and real data experiments.","PeriodicalId":54731,"journal":{"name":"Neural Computation","volume":"37 1","pages":"160-192"},"PeriodicalIF":2.7,"publicationDate":"2024-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142666703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michele Garibbo;Casimir J. H. Ludwig;Nathan F. Lepora;Laurence Aitchison
In human error–based learning, the size and direction of a scalar error (i.e., the “directed error”) are used to update future actions. Modern deep reinforcement learning (RL) methods perform a similar operation but in terms of scalar rewards. Despite this similarity, the relationship between action updates of deep RL and human error–based learning has yet to be investigated. Here, we systematically compare the three major families of deep RL algorithms to human error–based learning. We show that all three deep RL approaches are qualitatively different from human error–based learning, as assessed by a mirror-reversal perturbation experiment. To bridge this gap, we developed an alternative deep RL algorithm inspired by human error–based learning, model-based deterministic policy gradients (MB-DPG). We showed that MB-DPG captures human error–based learning under mirror-reversal and rotational perturbations and that MB-DPG learns faster than canonical model-free algorithms on complex arm-based reaching tasks, while being more robust to (forward) model misspecification than model-based RL.
{"title":"Relating Human Error–Based Learning to Modern Deep RL Algorithms","authors":"Michele Garibbo;Casimir J. H. Ludwig;Nathan F. Lepora;Laurence Aitchison","doi":"10.1162/neco_a_01721","DOIUrl":"10.1162/neco_a_01721","url":null,"abstract":"In human error–based learning, the size and direction of a scalar error (i.e., the “directed error”) are used to update future actions. Modern deep reinforcement learning (RL) methods perform a similar operation but in terms of scalar rewards. Despite this similarity, the relationship between action updates of deep RL and human error–based learning has yet to be investigated. Here, we systematically compare the three major families of deep RL algorithms to human error–based learning. We show that all three deep RL approaches are qualitatively different from human error–based learning, as assessed by a mirror-reversal perturbation experiment. To bridge this gap, we developed an alternative deep RL algorithm inspired by human error–based learning, model-based deterministic policy gradients (MB-DPG). We showed that MB-DPG captures human error–based learning under mirror-reversal and rotational perturbations and that MB-DPG learns faster than canonical model-free algorithms on complex arm-based reaching tasks, while being more robust to (forward) model misspecification than model-based RL.","PeriodicalId":54731,"journal":{"name":"Neural Computation","volume":"37 1","pages":"128-159"},"PeriodicalIF":2.7,"publicationDate":"2024-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142395376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The free energy principle (FEP) describes (biological) agents as minimizing a variational free energy (FE) with respect to a generative model of their environment. Active inference (AIF) is a corollary of the FEP that describes how agents explore and exploit their environment by minimizing an expected FE objective. In two related papers, we describe a scalable, epistemic approach to synthetic AIF by message passing on free-form Forney-style factor graphs (FFGs). A companion paper (part I of this article; Koudahl et al., 2023) introduces a constrained FFG (CFFG) notation that visually represents (generalized) FE objectives for AIF. This article (part II) derives message-passing algorithms that minimize (generalized) FE objectives on a CFFG by variational calculus. A comparison between simulated Bethe and generalized FE agents illustrates how the message-passing approach to synthetic AIF induces epistemic behavior on a T-maze navigation task. Extension of the T-maze simulation to learning goal statistics and a multiagent bargaining setting illustrate how this approach encourages reuse of nodes and updates in alternative settings. With a full message-passing account of synthetic AIF agents, it becomes possible to derive and reuse message updates across models and move closer to industrial applications of synthetic AIF.
自由能原理(FEP)将(生物)代理描述为相对于其环境的生成模型最小化可变自由能(FE)。主动推理(AIF)是自由能原理的必然结果,它描述了生物体如何通过最小化预期自由能目标来探索和利用其环境。在两篇相关论文中,我们描述了通过在自由形式的福尼式因子图(FFGs)上进行消息传递来合成 AIF 的可扩展认识论方法。另一篇相关论文(本文第一部分;Koudahl 等人,2023 年)介绍了一种受限 FFG(CFFG)符号,它能直观地表示 AIF 的(广义)FE 目标。本文(第二部分)通过变分法推导了在 CFFG 上最小化(广义)FE 目标的消息传递算法。模拟贝特代理和广义 FE 代理之间的比较说明了合成 AIF 的信息传递方法如何在 T 型迷宫导航任务中诱导认识行为。将 T 型迷宫模拟扩展到学习目标统计和多代理讨价还价设置,说明了这种方法如何鼓励在其他设置中重复使用节点和更新。有了合成 AIF 代理的完整消息传递账户,就有可能在不同模型中推导和重用消息更新,并更接近合成 AIF 的工业应用。
{"title":"Realizing Synthetic Active Inference Agents, Part II: Variational Message Updates","authors":"Thijs van de Laar;Magnus Koudahl;Bert de Vries","doi":"10.1162/neco_a_01713","DOIUrl":"10.1162/neco_a_01713","url":null,"abstract":"The free energy principle (FEP) describes (biological) agents as minimizing a variational free energy (FE) with respect to a generative model of their environment. Active inference (AIF) is a corollary of the FEP that describes how agents explore and exploit their environment by minimizing an expected FE objective. In two related papers, we describe a scalable, epistemic approach to synthetic AIF by message passing on free-form Forney-style factor graphs (FFGs). A companion paper (part I of this article; Koudahl et al., 2023) introduces a constrained FFG (CFFG) notation that visually represents (generalized) FE objectives for AIF. This article (part II) derives message-passing algorithms that minimize (generalized) FE objectives on a CFFG by variational calculus. A comparison between simulated Bethe and generalized FE agents illustrates how the message-passing approach to synthetic AIF induces epistemic behavior on a T-maze navigation task. Extension of the T-maze simulation to learning goal statistics and a multiagent bargaining setting illustrate how this approach encourages reuse of nodes and updates in alternative settings. With a full message-passing account of synthetic AIF agents, it becomes possible to derive and reuse message updates across models and move closer to industrial applications of synthetic AIF.","PeriodicalId":54731,"journal":{"name":"Neural Computation","volume":"37 1","pages":"38-75"},"PeriodicalIF":2.7,"publicationDate":"2024-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142309093","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Given a directed graph of nodes and edges connecting them, a common problem is to find the shortest path between any two nodes. Here we show that the shortest path distances can be found by a simple matrix inversion: if the edges are given by the adjacency matrix Aij, then with a suitably small value of γ, the shortest path distances are Dij=ceil(logγ[(I-γA)-1]ij).We derive several graph-theoretic bounds on the value of γ and explore its useful range with numerics on different graph types. Even when the distance function is not globally accurate across the entire graph, it still works locally to instruct pursuit of the shortest path. In this mode, it also extends to weighted graphs with positive edge weights. For a wide range of dense graphs, this distance function is computationally faster than the best available alternative. Finally, we show that this method leads naturally to a neural network solution of the all-pairs-shortest-path problem.
{"title":"A Fast Algorithm for All-Pairs-Shortest-Paths Suitable for Neural Networks","authors":"Zeyu Jing;Markus Meister","doi":"10.1162/neco_a_01716","DOIUrl":"10.1162/neco_a_01716","url":null,"abstract":"Given a directed graph of nodes and edges connecting them, a common problem is to find the shortest path between any two nodes. Here we show that the shortest path distances can be found by a simple matrix inversion: if the edges are given by the adjacency matrix Aij, then with a suitably small value of γ, the shortest path distances are Dij=ceil(logγ[(I-γA)-1]ij).We derive several graph-theoretic bounds on the value of γ and explore its useful range with numerics on different graph types. Even when the distance function is not globally accurate across the entire graph, it still works locally to instruct pursuit of the shortest path. In this mode, it also extends to weighted graphs with positive edge weights. For a wide range of dense graphs, this distance function is computationally faster than the best available alternative. Finally, we show that this method leads naturally to a neural network solution of the all-pairs-shortest-path problem.","PeriodicalId":54731,"journal":{"name":"Neural Computation","volume":"36 12","pages":"2710-2733"},"PeriodicalIF":2.7,"publicationDate":"2024-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142395362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}