Meta-learning is characterized by its ability to learn how to learn, enabling the adaptation of learning strategies across different tasks. Recent research introduced the Meta-Thompson Sampling (Meta-TS), which meta-learns an unknown prior distribution sampled from a meta-prior by interacting with bandit instances drawn from it. However, its analysis was limited to Gaussian bandit. The contextual multi-armed bandit framework is an extension of the Gaussian Bandit, which challenges agent to utilize context vectors to predict the most valuable arms, optimally balancing exploration and exploitation to minimize regret over time. This paper introduces Meta-TSLB algorithm, a modified Meta-TS for linear contextual bandits. We theoretically analyze Meta-TSLB and derive an $ Oleft( left( m+log left( m right) right) sqrt{nlog left( n right)} right)$ bound on its Bayes regret, in which $m$ represents the number of bandit instances, and $n$ the number of rounds of Thompson Sampling. Additionally, our work complements the analysis of Meta-TS for linear contextual bandits. The performance of Meta-TSLB is evaluated experimentally under different settings, and we experimente and analyze the generalization capability of Meta-TSLB, showcasing its potential to adapt to unseen instances.
{"title":"Modified Meta-Thompson Sampling for Linear Bandits and Its Bayes Regret Analysis","authors":"Hao Li, Dong Liang, Zheng Xie","doi":"arxiv-2409.06329","DOIUrl":"https://doi.org/arxiv-2409.06329","url":null,"abstract":"Meta-learning is characterized by its ability to learn how to learn, enabling\u0000the adaptation of learning strategies across different tasks. Recent research\u0000introduced the Meta-Thompson Sampling (Meta-TS), which meta-learns an unknown\u0000prior distribution sampled from a meta-prior by interacting with bandit\u0000instances drawn from it. However, its analysis was limited to Gaussian bandit.\u0000The contextual multi-armed bandit framework is an extension of the Gaussian\u0000Bandit, which challenges agent to utilize context vectors to predict the most\u0000valuable arms, optimally balancing exploration and exploitation to minimize\u0000regret over time. This paper introduces Meta-TSLB algorithm, a modified Meta-TS\u0000for linear contextual bandits. We theoretically analyze Meta-TSLB and derive an\u0000$ Oleft( left( m+log left( m right) right) sqrt{nlog left( n right)}\u0000right)$ bound on its Bayes regret, in which $m$ represents the number of\u0000bandit instances, and $n$ the number of rounds of Thompson Sampling.\u0000Additionally, our work complements the analysis of Meta-TS for linear\u0000contextual bandits. The performance of Meta-TSLB is evaluated experimentally\u0000under different settings, and we experimente and analyze the generalization\u0000capability of Meta-TSLB, showcasing its potential to adapt to unseen instances.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142206648","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As Large Language Models become more ubiquitous across domains, it becomes important to examine their inherent limitations critically. This work argues that hallucinations in language models are not just occasional errors but an inevitable feature of these systems. We demonstrate that hallucinations stem from the fundamental mathematical and logical structure of LLMs. It is, therefore, impossible to eliminate them through architectural improvements, dataset enhancements, or fact-checking mechanisms. Our analysis draws on computational theory and Godel's First Incompleteness Theorem, which references the undecidability of problems like the Halting, Emptiness, and Acceptance Problems. We demonstrate that every stage of the LLM process-from training data compilation to fact retrieval, intent classification, and text generation-will have a non-zero probability of producing hallucinations. This work introduces the concept of Structural Hallucination as an intrinsic nature of these systems. By establishing the mathematical certainty of hallucinations, we challenge the prevailing notion that they can be fully mitigated.
{"title":"LLMs Will Always Hallucinate, and We Need to Live With This","authors":"Sourav Banerjee, Ayushi Agarwal, Saloni Singla","doi":"arxiv-2409.05746","DOIUrl":"https://doi.org/arxiv-2409.05746","url":null,"abstract":"As Large Language Models become more ubiquitous across domains, it becomes\u0000important to examine their inherent limitations critically. This work argues\u0000that hallucinations in language models are not just occasional errors but an\u0000inevitable feature of these systems. We demonstrate that hallucinations stem\u0000from the fundamental mathematical and logical structure of LLMs. It is,\u0000therefore, impossible to eliminate them through architectural improvements,\u0000dataset enhancements, or fact-checking mechanisms. Our analysis draws on\u0000computational theory and Godel's First Incompleteness Theorem, which references\u0000the undecidability of problems like the Halting, Emptiness, and Acceptance\u0000Problems. We demonstrate that every stage of the LLM process-from training data\u0000compilation to fact retrieval, intent classification, and text generation-will\u0000have a non-zero probability of producing hallucinations. This work introduces\u0000the concept of Structural Hallucination as an intrinsic nature of these\u0000systems. By establishing the mathematical certainty of hallucinations, we\u0000challenge the prevailing notion that they can be fully mitigated.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"32 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142206692","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multitask learning is a widely used paradigm for training models on diverse tasks, with applications ranging from graph neural networks to language model fine-tuning. Since tasks may interfere with each other, a key notion for modeling their relationships is task affinity. This includes pairwise task affinity, computed among pairs of tasks, and higher-order affinity, computed among subsets of tasks. Naively computing either of them requires repeatedly training on data from various task combinations, which is computationally intensive. We present a new algorithm Grad-TAG that can estimate task affinities without this repeated training. The key idea of Grad-TAG is to train a "base" model for all tasks and then use a linearization technique to estimate the loss of the model for a specific task combination. The linearization works by computing a gradient-based approximation of the loss, using low-dimensional projections of gradients as features in a logistic regression to predict labels for the task combination. We show that the linearized model can provably approximate the loss when the gradient-based approximation is accurate, and also empirically verify that on several large models. Then, given the estimated task affinity, we design a semi-definite program for clustering similar tasks by maximizing the average density of clusters. We evaluate Grad-TAG's performance across seven datasets, including multi-label classification on graphs, and instruction fine-tuning of language models. Our task affinity estimates are within 2.7% distance to the true affinities while needing only 3% of FLOPs in full training. On our largest graph with 21M edges and 500 labeling tasks, our algorithm delivers estimates within 5% distance to the true affinities, using only 112 GPU hours. Our results show that Grad-TAG achieves excellent performance and runtime tradeoffs compared to existing approaches.
{"title":"Scalable Multitask Learning Using Gradient-based Estimation of Task Affinity","authors":"Dongyue Li, Aneesh Sharma, Hongyang R. Zhang","doi":"arxiv-2409.06091","DOIUrl":"https://doi.org/arxiv-2409.06091","url":null,"abstract":"Multitask learning is a widely used paradigm for training models on diverse\u0000tasks, with applications ranging from graph neural networks to language model\u0000fine-tuning. Since tasks may interfere with each other, a key notion for\u0000modeling their relationships is task affinity. This includes pairwise task\u0000affinity, computed among pairs of tasks, and higher-order affinity, computed\u0000among subsets of tasks. Naively computing either of them requires repeatedly\u0000training on data from various task combinations, which is computationally\u0000intensive. We present a new algorithm Grad-TAG that can estimate task\u0000affinities without this repeated training. The key idea of Grad-TAG is to train a \"base\" model for all tasks and then\u0000use a linearization technique to estimate the loss of the model for a specific\u0000task combination. The linearization works by computing a gradient-based\u0000approximation of the loss, using low-dimensional projections of gradients as\u0000features in a logistic regression to predict labels for the task combination.\u0000We show that the linearized model can provably approximate the loss when the\u0000gradient-based approximation is accurate, and also empirically verify that on\u0000several large models. Then, given the estimated task affinity, we design a\u0000semi-definite program for clustering similar tasks by maximizing the average\u0000density of clusters. We evaluate Grad-TAG's performance across seven datasets, including\u0000multi-label classification on graphs, and instruction fine-tuning of language\u0000models. Our task affinity estimates are within 2.7% distance to the true\u0000affinities while needing only 3% of FLOPs in full training. On our largest\u0000graph with 21M edges and 500 labeling tasks, our algorithm delivers estimates\u0000within 5% distance to the true affinities, using only 112 GPU hours. Our\u0000results show that Grad-TAG achieves excellent performance and runtime tradeoffs\u0000compared to existing approaches.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142206743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tristan Thrush, Christopher Potts, Tatsunori Hashimoto
Quality pretraining data is often seen as the key to high-performance language models. However, progress in understanding pretraining data has been slow due to the costly pretraining runs required for data selection experiments. We present a framework that avoids these costs and selects high-quality pretraining data without any LLM training of our own. Our work is based on a simple observation: LLM losses on many pretraining texts are correlated with downstream benchmark performance, and selecting high-correlation documents is an effective pretraining data selection method. We build a new statistical framework for data selection centered around estimates of perplexity-benchmark correlations and perform data selection using a sample of 90 LLMs taken from the Open LLM Leaderboard on texts from tens of thousands of web domains. In controlled pretraining experiments at the 160M parameter scale on 8 benchmarks, our approach outperforms DSIR on every benchmark, while matching the best data selector found in DataComp-LM, a hand-engineered bigram classifier.
{"title":"Improving Pretraining Data Using Perplexity Correlations","authors":"Tristan Thrush, Christopher Potts, Tatsunori Hashimoto","doi":"arxiv-2409.05816","DOIUrl":"https://doi.org/arxiv-2409.05816","url":null,"abstract":"Quality pretraining data is often seen as the key to high-performance\u0000language models. However, progress in understanding pretraining data has been\u0000slow due to the costly pretraining runs required for data selection\u0000experiments. We present a framework that avoids these costs and selects\u0000high-quality pretraining data without any LLM training of our own. Our work is\u0000based on a simple observation: LLM losses on many pretraining texts are\u0000correlated with downstream benchmark performance, and selecting\u0000high-correlation documents is an effective pretraining data selection method.\u0000We build a new statistical framework for data selection centered around\u0000estimates of perplexity-benchmark correlations and perform data selection using\u0000a sample of 90 LLMs taken from the Open LLM Leaderboard on texts from tens of\u0000thousands of web domains. In controlled pretraining experiments at the 160M\u0000parameter scale on 8 benchmarks, our approach outperforms DSIR on every\u0000benchmark, while matching the best data selector found in DataComp-LM, a\u0000hand-engineered bigram classifier.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142206742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As neural networks continue to grow in size but datasets might not, it is vital to understand how much performance improvement can be expected: is it more important to scale network size or data volume? Thus, neural network scaling laws, which characterize how test error varies with network size and data volume, have become increasingly important. However, existing scaling laws are often applicable only in limited regimes and often do not incorporate or predict well-known phenomena such as double descent. Here, we present a novel theoretical characterization of how three factors -- model size, training time, and data volume -- interact to determine the performance of deep neural networks. We first establish a theoretical and empirical equivalence between scaling the size of a neural network and increasing its training time proportionally. Scale-time equivalence challenges the current practice, wherein large models are trained for small durations, and suggests that smaller models trained over extended periods could match their efficacy. It also leads to a novel method for predicting the performance of large-scale networks from small-scale networks trained for extended epochs, and vice versa. We next combine scale-time equivalence with a linear model analysis of double descent to obtain a unified theoretical scaling law, which we confirm with experiments across vision benchmarks and network architectures. These laws explain several previously unexplained phenomena: reduced data requirements for generalization in larger models, heightened sensitivity to label noise in overparameterized models, and instances where increasing model scale does not necessarily enhance performance. Our findings hold significant implications for the practical deployment of neural networks, offering a more accessible and efficient path to training and fine-tuning large models.
{"title":"Unified Neural Network Scaling Laws and Scale-time Equivalence","authors":"Akhilan Boopathy, Ila Fiete","doi":"arxiv-2409.05782","DOIUrl":"https://doi.org/arxiv-2409.05782","url":null,"abstract":"As neural networks continue to grow in size but datasets might not, it is\u0000vital to understand how much performance improvement can be expected: is it\u0000more important to scale network size or data volume? Thus, neural network\u0000scaling laws, which characterize how test error varies with network size and\u0000data volume, have become increasingly important. However, existing scaling laws\u0000are often applicable only in limited regimes and often do not incorporate or\u0000predict well-known phenomena such as double descent. Here, we present a novel\u0000theoretical characterization of how three factors -- model size, training time,\u0000and data volume -- interact to determine the performance of deep neural\u0000networks. We first establish a theoretical and empirical equivalence between\u0000scaling the size of a neural network and increasing its training time\u0000proportionally. Scale-time equivalence challenges the current practice, wherein\u0000large models are trained for small durations, and suggests that smaller models\u0000trained over extended periods could match their efficacy. It also leads to a\u0000novel method for predicting the performance of large-scale networks from\u0000small-scale networks trained for extended epochs, and vice versa. We next\u0000combine scale-time equivalence with a linear model analysis of double descent\u0000to obtain a unified theoretical scaling law, which we confirm with experiments\u0000across vision benchmarks and network architectures. These laws explain several\u0000previously unexplained phenomena: reduced data requirements for generalization\u0000in larger models, heightened sensitivity to label noise in overparameterized\u0000models, and instances where increasing model scale does not necessarily enhance\u0000performance. Our findings hold significant implications for the practical\u0000deployment of neural networks, offering a more accessible and efficient path to\u0000training and fine-tuning large models.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142206744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shuvayan Banerjee, Radhendushka Srivastava, James Saunderson, Ajit Rajwade
Given $p$ samples, each of which may or may not be defective, group testing (GT) aims to determine their defect status by performing tests on $n < p$ `groups', where a group is formed by mixing a subset of the $p$ samples. Assuming that the number of defective samples is very small compared to $p$, GT algorithms have provided excellent recovery of the status of all $p$ samples with even a small number of groups. Most existing methods, however, assume that the group memberships are accurately specified. This assumption may not always be true in all applications, due to various resource constraints. Such errors could occur, eg, when a technician, preparing the groups in a laboratory, unknowingly mixes together an incorrect subset of samples as compared to what was specified. We develop a new GT method, the Debiased Robust Lasso Test Method (DRLT), that handles such group membership specification errors. The proposed DRLT method is based on an approach to debias, or reduce the inherent bias in, estimates produced by Lasso, a popular and effective sparse regression technique. We also provide theoretical upper bounds on the reconstruction error produced by our estimator. Our approach is then combined with two carefully designed hypothesis tests respectively for (i) the identification of defective samples in the presence of errors in group membership specifications, and (ii) the identification of groups with erroneous membership specifications. The DRLT approach extends the literature on bias mitigation of statistical estimators such as the LASSO, to handle the important case when some of the measurements contain outliers, due to factors such as group membership specification errors. We present numerical results which show that our approach outperforms several baselines and robust regression techniques for identification of defective samples as well as erroneously specified groups.
{"title":"Robust Non-adaptive Group Testing under Errors in Group Membership Specifications","authors":"Shuvayan Banerjee, Radhendushka Srivastava, James Saunderson, Ajit Rajwade","doi":"arxiv-2409.05345","DOIUrl":"https://doi.org/arxiv-2409.05345","url":null,"abstract":"Given $p$ samples, each of which may or may not be defective, group testing\u0000(GT) aims to determine their defect status by performing tests on $n < p$\u0000`groups', where a group is formed by mixing a subset of the $p$ samples.\u0000Assuming that the number of defective samples is very small compared to $p$, GT\u0000algorithms have provided excellent recovery of the status of all $p$ samples\u0000with even a small number of groups. Most existing methods, however, assume that\u0000the group memberships are accurately specified. This assumption may not always\u0000be true in all applications, due to various resource constraints. Such errors\u0000could occur, eg, when a technician, preparing the groups in a laboratory,\u0000unknowingly mixes together an incorrect subset of samples as compared to what\u0000was specified. We develop a new GT method, the Debiased Robust Lasso Test\u0000Method (DRLT), that handles such group membership specification errors. The\u0000proposed DRLT method is based on an approach to debias, or reduce the inherent\u0000bias in, estimates produced by Lasso, a popular and effective sparse regression\u0000technique. We also provide theoretical upper bounds on the reconstruction error\u0000produced by our estimator. Our approach is then combined with two carefully\u0000designed hypothesis tests respectively for (i) the identification of defective\u0000samples in the presence of errors in group membership specifications, and (ii)\u0000the identification of groups with erroneous membership specifications. The DRLT\u0000approach extends the literature on bias mitigation of statistical estimators\u0000such as the LASSO, to handle the important case when some of the measurements\u0000contain outliers, due to factors such as group membership specification errors.\u0000We present numerical results which show that our approach outperforms several\u0000baselines and robust regression techniques for identification of defective\u0000samples as well as erroneously specified groups.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142206700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This research aims to propose and evaluate a novel model named K-Fold Causal Bayesian Additive Regression Trees (K-Fold Causal BART) for improved estimation of Average Treatment Effects (ATE) and Conditional Average Treatment Effects (CATE). The study employs synthetic and semi-synthetic datasets, including the widely recognized Infant Health and Development Program (IHDP) benchmark dataset, to validate the model's performance. Despite promising results in synthetic scenarios, the IHDP dataset reveals that the proposed model is not state-of-the-art for ATE and CATE estimation. Nonetheless, the research provides several novel insights: 1. The ps-BART model is likely the preferred choice for CATE and ATE estimation due to better generalization compared to the other benchmark models - including the Bayesian Causal Forest (BCF) model, which is considered by many the current best model for CATE estimation, 2. The BCF model's performance deteriorates significantly with increasing treatment effect heterogeneity, while the ps-BART model remains robust, 3. Models tend to be overconfident in CATE uncertainty quantification when treatment effect heterogeneity is low, 4. A second K-Fold method is unnecessary for avoiding overfitting in CATE estimation, as it adds computational costs without improving performance, 5. Detailed analysis reveals the importance of understanding dataset characteristics and using nuanced evaluation methods, 6. The conclusion of Curth et al. (2021) that indirect strategies for CATE estimation are superior for the IHDP dataset is contradicted by the results of this research. These findings challenge existing assumptions and suggest directions for future research to enhance causal inference methodologies.
{"title":"K-Fold Causal BART for CATE Estimation","authors":"Hugo Gobato Souto, Francisco Louzada Neto","doi":"arxiv-2409.05665","DOIUrl":"https://doi.org/arxiv-2409.05665","url":null,"abstract":"This research aims to propose and evaluate a novel model named K-Fold Causal\u0000Bayesian Additive Regression Trees (K-Fold Causal BART) for improved estimation\u0000of Average Treatment Effects (ATE) and Conditional Average Treatment Effects\u0000(CATE). The study employs synthetic and semi-synthetic datasets, including the\u0000widely recognized Infant Health and Development Program (IHDP) benchmark\u0000dataset, to validate the model's performance. Despite promising results in\u0000synthetic scenarios, the IHDP dataset reveals that the proposed model is not\u0000state-of-the-art for ATE and CATE estimation. Nonetheless, the research\u0000provides several novel insights: 1. The ps-BART model is likely the preferred\u0000choice for CATE and ATE estimation due to better generalization compared to the\u0000other benchmark models - including the Bayesian Causal Forest (BCF) model,\u0000which is considered by many the current best model for CATE estimation, 2. The\u0000BCF model's performance deteriorates significantly with increasing treatment\u0000effect heterogeneity, while the ps-BART model remains robust, 3. Models tend to\u0000be overconfident in CATE uncertainty quantification when treatment effect\u0000heterogeneity is low, 4. A second K-Fold method is unnecessary for avoiding\u0000overfitting in CATE estimation, as it adds computational costs without\u0000improving performance, 5. Detailed analysis reveals the importance of\u0000understanding dataset characteristics and using nuanced evaluation methods, 6.\u0000The conclusion of Curth et al. (2021) that indirect strategies for CATE\u0000estimation are superior for the IHDP dataset is contradicted by the results of\u0000this research. These findings challenge existing assumptions and suggest\u0000directions for future research to enhance causal inference methodologies.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142206693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gianmarco Genalti, Marco Mussi, Nicola Gatti, Marcello Restelli, Matteo Castiglioni, Alberto Maria Metelli
Rested and Restless Bandits are two well-known bandit settings that are useful to model real-world sequential decision-making problems in which the expected reward of an arm evolves over time due to the actions we perform or due to the nature. In this work, we propose Graph-Triggered Bandits (GTBs), a unifying framework to generalize and extend rested and restless bandits. In this setting, the evolution of the arms' expected rewards is governed by a graph defined over the arms. An edge connecting a pair of arms $(i,j)$ represents the fact that a pull of arm $i$ triggers the evolution of arm $j$, and vice versa. Interestingly, rested and restless bandits are both special cases of our model for some suitable (degenerated) graph. As relevant case studies for this setting, we focus on two specific types of monotonic bandits: rising, where the expected reward of an arm grows as the number of triggers increases, and rotting, where the opposite behavior occurs. For these cases, we study the optimal policies. We provide suitable algorithms for all scenarios and discuss their theoretical guarantees, highlighting the complexity of the learning problem concerning instance-dependent terms that encode specific properties of the underlying graph structure.
{"title":"Bridging Rested and Restless Bandits with Graph-Triggering: Rising and Rotting","authors":"Gianmarco Genalti, Marco Mussi, Nicola Gatti, Marcello Restelli, Matteo Castiglioni, Alberto Maria Metelli","doi":"arxiv-2409.05980","DOIUrl":"https://doi.org/arxiv-2409.05980","url":null,"abstract":"Rested and Restless Bandits are two well-known bandit settings that are\u0000useful to model real-world sequential decision-making problems in which the\u0000expected reward of an arm evolves over time due to the actions we perform or\u0000due to the nature. In this work, we propose Graph-Triggered Bandits (GTBs), a\u0000unifying framework to generalize and extend rested and restless bandits. In\u0000this setting, the evolution of the arms' expected rewards is governed by a\u0000graph defined over the arms. An edge connecting a pair of arms $(i,j)$\u0000represents the fact that a pull of arm $i$ triggers the evolution of arm $j$,\u0000and vice versa. Interestingly, rested and restless bandits are both special\u0000cases of our model for some suitable (degenerated) graph. As relevant case\u0000studies for this setting, we focus on two specific types of monotonic bandits:\u0000rising, where the expected reward of an arm grows as the number of triggers\u0000increases, and rotting, where the opposite behavior occurs. For these cases, we\u0000study the optimal policies. We provide suitable algorithms for all scenarios\u0000and discuss their theoretical guarantees, highlighting the complexity of the\u0000learning problem concerning instance-dependent terms that encode specific\u0000properties of the underlying graph structure.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142206650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Probability estimation of tree topologies is one of the fundamental tasks in phylogenetic inference. The recently proposed subsplit Bayesian networks (SBNs) provide a powerful probabilistic graphical model for tree topology probability estimation by properly leveraging the hierarchical structure of phylogenetic trees. However, the expectation maximization (EM) method currently used for learning SBN parameters does not scale up to large data sets. In this paper, we introduce several computationally efficient methods for training SBNs and show that variance reduction could be the key for better performance. Furthermore, we also introduce the variance reduction technique to improve the optimization of SBN parameters for variational Bayesian phylogenetic inference (VBPI). Extensive synthetic and real data experiments demonstrate that our methods outperform previous baseline methods on the tasks of tree topology probability estimation as well as Bayesian phylogenetic inference using SBNs.
{"title":"Improving Tree Probability Estimation with Stochastic Optimization and Variance Reduction","authors":"Tianyu Xie, Musu Yuan, Minghua Deng, Cheng Zhang","doi":"arxiv-2409.05282","DOIUrl":"https://doi.org/arxiv-2409.05282","url":null,"abstract":"Probability estimation of tree topologies is one of the fundamental tasks in\u0000phylogenetic inference. The recently proposed subsplit Bayesian networks (SBNs)\u0000provide a powerful probabilistic graphical model for tree topology probability\u0000estimation by properly leveraging the hierarchical structure of phylogenetic\u0000trees. However, the expectation maximization (EM) method currently used for\u0000learning SBN parameters does not scale up to large data sets. In this paper, we\u0000introduce several computationally efficient methods for training SBNs and show\u0000that variance reduction could be the key for better performance. Furthermore,\u0000we also introduce the variance reduction technique to improve the optimization\u0000of SBN parameters for variational Bayesian phylogenetic inference (VBPI).\u0000Extensive synthetic and real data experiments demonstrate that our methods\u0000outperform previous baseline methods on the tasks of tree topology probability\u0000estimation as well as Bayesian phylogenetic inference using SBNs.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"58 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142206748","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Akhilan Boopathy, Sunshine Jiang, William Yue, Jaedong Hwang, Abhiram Iyer, Ila Fiete
Modular neural networks outperform nonmodular neural networks on tasks ranging from visual question answering to robotics. These performance improvements are thought to be due to modular networks' superior ability to model the compositional and combinatorial structure of real-world problems. However, a theoretical explanation of how modularity improves generalizability, and how to leverage task modularity while training networks remains elusive. Using recent theoretical progress in explaining neural network generalization, we investigate how the amount of training data required to generalize on a task varies with the intrinsic dimensionality of a task's input. We show theoretically that when applied to modularly structured tasks, while nonmodular networks require an exponential number of samples with task dimensionality, modular networks' sample complexity is independent of task dimensionality: modular networks can generalize in high dimensions. We then develop a novel learning rule for modular networks to exploit this advantage and empirically show the improved generalization of the rule, both in- and out-of-distribution, on high-dimensional, modular tasks.
{"title":"Breaking Neural Network Scaling Laws with Modularity","authors":"Akhilan Boopathy, Sunshine Jiang, William Yue, Jaedong Hwang, Abhiram Iyer, Ila Fiete","doi":"arxiv-2409.05780","DOIUrl":"https://doi.org/arxiv-2409.05780","url":null,"abstract":"Modular neural networks outperform nonmodular neural networks on tasks\u0000ranging from visual question answering to robotics. These performance\u0000improvements are thought to be due to modular networks' superior ability to\u0000model the compositional and combinatorial structure of real-world problems.\u0000However, a theoretical explanation of how modularity improves generalizability,\u0000and how to leverage task modularity while training networks remains elusive.\u0000Using recent theoretical progress in explaining neural network generalization,\u0000we investigate how the amount of training data required to generalize on a task\u0000varies with the intrinsic dimensionality of a task's input. We show\u0000theoretically that when applied to modularly structured tasks, while nonmodular\u0000networks require an exponential number of samples with task dimensionality,\u0000modular networks' sample complexity is independent of task dimensionality:\u0000modular networks can generalize in high dimensions. We then develop a novel\u0000learning rule for modular networks to exploit this advantage and empirically\u0000show the improved generalization of the rule, both in- and out-of-distribution,\u0000on high-dimensional, modular tasks.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142206745","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}