arXiv - STAT - Machine Learning最新文献_第10页

Modified Meta-Thompson Sampling for Linear Bandits and Its Bayes Regret Analysis 线性匪徒的修正元汤普森抽样及其贝叶斯后悔分析

arXiv - STAT - Machine Learning

Pub Date : 2024-09-10 DOI: arxiv-2409.06329

Hao Li, Dong Liang, Zheng Xie

Meta-learning is characterized by its ability to learn how to learn, enablingthe adaptation of learning strategies across different tasks. Recent researchintroduced the Meta-Thompson Sampling (Meta-TS), which meta-learns an unknownprior distribution sampled from a meta-prior by interacting with banditinstances drawn from it. However, its analysis was limited to Gaussian bandit.The contextual multi-armed bandit framework is an extension of the GaussianBandit, which challenges agent to utilize context vectors to predict the mostvaluable arms, optimally balancing exploration and exploitation to minimizeregret over time. This paper introduces Meta-TSLB algorithm, a modified Meta-TSfor linear contextual bandits. We theoretically analyze Meta-TSLB and derive an$ Oleft( left( m+log left( m right) right) sqrt{nlog left( n right)}right)$ bound on its Bayes regret, in which $m$ represents the number ofbandit instances, and $n$ the number of rounds of Thompson Sampling.Additionally, our work complements the analysis of Meta-TS for linearcontextual bandits. The performance of Meta-TSLB is evaluated experimentallyunder different settings, and we experimente and analyze the generalizationcapability of Meta-TSLB, showcasing its potential to adapt to unseen instances.

元学习（Meta-learning）的特点是能够学习如何学习，从而在不同任务中调整学习策略。最近的研究引入了元汤普森采样（Meta-TS），通过与从中抽取的匪徒实例交互，元学习从元前沿中采样的未知前沿分布。情境多臂强盗框架是高斯强盗（GaussianBandit）的扩展，它要求代理利用情境向量来预测最有价值的武器，优化探索和利用之间的平衡，从而随着时间的推移最大限度地减少遗憾。本文介绍了 Meta-TSLB 算法，这是一种针对线性情境匪帮的改进型 Meta-TS。我们从理论上分析了 Meta-TSLB，并推导出一个$ Oleft( （left( m+log left( m right) right) sqrt{nlog left( n right)}（right）$ 对其贝叶斯遗憾的约束，其中$m$ 代表匪帮实例的数量，$n$ 代表汤普森采样的轮数。我们在不同设置下对 Meta-TSLB 的性能进行了实验评估，并对 Meta-TSLB 的泛化能力进行了实验和分析，展示了其适应未知实例的潜力。

{"title":"Modified Meta-Thompson Sampling for Linear Bandits and Its Bayes Regret Analysis","authors":"Hao Li, Dong Liang, Zheng Xie","doi":"arxiv-2409.06329","DOIUrl":"https://doi.org/arxiv-2409.06329","url":null,"abstract":"Meta-learning is characterized by its ability to learn how to learn, enabling\u0000the adaptation of learning strategies across different tasks. Recent research\u0000introduced the Meta-Thompson Sampling (Meta-TS), which meta-learns an unknown\u0000prior distribution sampled from a meta-prior by interacting with bandit\u0000instances drawn from it. However, its analysis was limited to Gaussian bandit.\u0000The contextual multi-armed bandit framework is an extension of the Gaussian\u0000Bandit, which challenges agent to utilize context vectors to predict the most\u0000valuable arms, optimally balancing exploration and exploitation to minimize\u0000regret over time. This paper introduces Meta-TSLB algorithm, a modified Meta-TS\u0000for linear contextual bandits. We theoretically analyze Meta-TSLB and derive an\u0000$ Oleft( left( m+log left( m right) right) sqrt{nlog left( n right)}\u0000right)$ bound on its Bayes regret, in which $m$ represents the number of\u0000bandit instances, and $n$ the number of rounds of Thompson Sampling.\u0000Additionally, our work complements the analysis of Meta-TS for linear\u0000contextual bandits. The performance of Meta-TSLB is evaluated experimentally\u0000under different settings, and we experimente and analyze the generalization\u0000capability of Meta-TSLB, showcasing its potential to adapt to unseen instances.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142206648","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

LLMs Will Always Hallucinate, and We Need to Live With This 法学硕士总会产生幻觉，我们需要接受这一点

arXiv - STAT - Machine Learning

Pub Date : 2024-09-09 DOI: arxiv-2409.05746

Sourav Banerjee, Ayushi Agarwal, Saloni Singla

As Large Language Models become more ubiquitous across domains, it becomesimportant to examine their inherent limitations critically. This work arguesthat hallucinations in language models are not just occasional errors but aninevitable feature of these systems. We demonstrate that hallucinations stemfrom the fundamental mathematical and logical structure of LLMs. It is,therefore, impossible to eliminate them through architectural improvements,dataset enhancements, or fact-checking mechanisms. Our analysis draws oncomputational theory and Godel's First Incompleteness Theorem, which referencesthe undecidability of problems like the Halting, Emptiness, and AcceptanceProblems. We demonstrate that every stage of the LLM process-from training datacompilation to fact retrieval, intent classification, and text generation-willhave a non-zero probability of producing hallucinations. This work introducesthe concept of Structural Hallucination as an intrinsic nature of thesesystems. By establishing the mathematical certainty of hallucinations, wechallenge the prevailing notion that they can be fully mitigated.

随着大型语言模型在各个领域变得越来越普遍，批判性地审视其固有的局限性变得非常重要。本研究认为，语言模型中的幻觉并不只是偶尔出现的错误，而是这些系统不可避免的特征。我们证明，幻觉源于语言模型的基本数学和逻辑结构。因此，不可能通过架构改进、数据集增强或事实检查机制来消除幻觉。我们的分析借鉴了计算理论和戈德尔第一不完备性定理，其中提到了诸如 "停止问题"、"空性问题 "和 "接受问题 "等问题的不可判定性。我们证明，LLM 过程的每个阶段--从训练数据编译到事实检索、意图分类和文本生成--产生幻觉的概率都不为零。这项工作引入了 "结构性幻觉 "的概念，将其视为系统的内在本质。通过确定幻觉在数学上的确定性，我们对可以完全避免幻觉的普遍观点提出了挑战。

{"title":"LLMs Will Always Hallucinate, and We Need to Live With This","authors":"Sourav Banerjee, Ayushi Agarwal, Saloni Singla","doi":"arxiv-2409.05746","DOIUrl":"https://doi.org/arxiv-2409.05746","url":null,"abstract":"As Large Language Models become more ubiquitous across domains, it becomes\u0000important to examine their inherent limitations critically. This work argues\u0000that hallucinations in language models are not just occasional errors but an\u0000inevitable feature of these systems. We demonstrate that hallucinations stem\u0000from the fundamental mathematical and logical structure of LLMs. It is,\u0000therefore, impossible to eliminate them through architectural improvements,\u0000dataset enhancements, or fact-checking mechanisms. Our analysis draws on\u0000computational theory and Godel's First Incompleteness Theorem, which references\u0000the undecidability of problems like the Halting, Emptiness, and Acceptance\u0000Problems. We demonstrate that every stage of the LLM process-from training data\u0000compilation to fact retrieval, intent classification, and text generation-will\u0000have a non-zero probability of producing hallucinations. This work introduces\u0000the concept of Structural Hallucination as an intrinsic nature of these\u0000systems. By establishing the mathematical certainty of hallucinations, we\u0000challenge the prevailing notion that they can be fully mitigated.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"32 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142206692","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Scalable Multitask Learning Using Gradient-based Estimation of Task Affinity 利用基于梯度的任务亲和性估计进行可扩展的多任务学习

arXiv - STAT - Machine Learning

Pub Date : 2024-09-09 DOI: arxiv-2409.06091

Dongyue Li, Aneesh Sharma, Hongyang R. Zhang

Multitask learning is a widely used paradigm for training models on diversetasks, with applications ranging from graph neural networks to language modelfine-tuning. Since tasks may interfere with each other, a key notion formodeling their relationships is task affinity. This includes pairwise taskaffinity, computed among pairs of tasks, and higher-order affinity, computedamong subsets of tasks. Naively computing either of them requires repeatedlytraining on data from various task combinations, which is computationallyintensive. We present a new algorithm Grad-TAG that can estimate taskaffinities without this repeated training. The key idea of Grad-TAG is to train a "base" model for all tasks and thenuse a linearization technique to estimate the loss of the model for a specifictask combination. The linearization works by computing a gradient-basedapproximation of the loss, using low-dimensional projections of gradients asfeatures in a logistic regression to predict labels for the task combination.We show that the linearized model can provably approximate the loss when thegradient-based approximation is accurate, and also empirically verify that onseveral large models. Then, given the estimated task affinity, we design asemi-definite program for clustering similar tasks by maximizing the averagedensity of clusters. We evaluate Grad-TAG's performance across seven datasets, includingmulti-label classification on graphs, and instruction fine-tuning of languagemodels. Our task affinity estimates are within 2.7% distance to the trueaffinities while needing only 3% of FLOPs in full training. On our largestgraph with 21M edges and 500 labeling tasks, our algorithm delivers estimateswithin 5% distance to the true affinities, using only 112 GPU hours. Ourresults show that Grad-TAG achieves excellent performance and runtime tradeoffscompared to existing approaches.

多任务学习是一种广泛使用的范式，用于在多种任务上训练模型，应用范围从图神经网络到语言模态微调。由于任务之间可能会相互干扰，因此建模它们之间关系的一个关键概念就是任务亲和性。这包括成对任务亲和力（在成对任务之间计算）和高阶亲和力（在任务子集之间计算）。计算这两种亲和力都需要对各种任务组合的数据进行反复训练，计算量非常大。我们提出了一种新算法 Grad-TAG，它可以在不重复训练的情况下估计任务亲和力。Grad-TAG 的主要思路是为所有任务训练一个 "基础 "模型，然后使用线性化技术来估计特定任务组合的模型损失。线性化的工作原理是计算损失的基于梯度的近似值，使用梯度的低维投影作为逻辑回归的特征来预测任务组合的标签。我们证明了当基于梯度的近似值准确时，线性化模型可以近似损失，并在多个大型模型上进行了经验验证。然后，根据估计的任务亲和度，我们设计了一个半定义程序，通过最大化聚类的平均密度对相似任务进行聚类。我们评估了 Grad-TAG 在七个数据集上的性能，包括图的多标签分类和语言模型的指令微调。我们的任务亲和度估计值与真实亲和度的距离在 2.7% 以内，而完全训练只需要 3% 的 FLOPs。在具有 2100 万条边和 500 个标注任务的最大图上，我们的算法得出的估计值与真实亲和度的距离在 5%以内，只用了 112 个 GPU 小时。我们的结果表明，与现有方法相比，Grad-TAG 实现了出色的性能和运行时间折衷。

{"title":"Scalable Multitask Learning Using Gradient-based Estimation of Task Affinity","authors":"Dongyue Li, Aneesh Sharma, Hongyang R. Zhang","doi":"arxiv-2409.06091","DOIUrl":"https://doi.org/arxiv-2409.06091","url":null,"abstract":"Multitask learning is a widely used paradigm for training models on diverse\u0000tasks, with applications ranging from graph neural networks to language model\u0000fine-tuning. Since tasks may interfere with each other, a key notion for\u0000modeling their relationships is task affinity. This includes pairwise task\u0000affinity, computed among pairs of tasks, and higher-order affinity, computed\u0000among subsets of tasks. Naively computing either of them requires repeatedly\u0000training on data from various task combinations, which is computationally\u0000intensive. We present a new algorithm Grad-TAG that can estimate task\u0000affinities without this repeated training. The key idea of Grad-TAG is to train a \"base\" model for all tasks and then\u0000use a linearization technique to estimate the loss of the model for a specific\u0000task combination. The linearization works by computing a gradient-based\u0000approximation of the loss, using low-dimensional projections of gradients as\u0000features in a logistic regression to predict labels for the task combination.\u0000We show that the linearized model can provably approximate the loss when the\u0000gradient-based approximation is accurate, and also empirically verify that on\u0000several large models. Then, given the estimated task affinity, we design a\u0000semi-definite program for clustering similar tasks by maximizing the average\u0000density of clusters. We evaluate Grad-TAG's performance across seven datasets, including\u0000multi-label classification on graphs, and instruction fine-tuning of language\u0000models. Our task affinity estimates are within 2.7% distance to the true\u0000affinities while needing only 3% of FLOPs in full training. On our largest\u0000graph with 21M edges and 500 labeling tasks, our algorithm delivers estimates\u0000within 5% distance to the true affinities, using only 112 GPU hours. Our\u0000results show that Grad-TAG achieves excellent performance and runtime tradeoffs\u0000compared to existing approaches.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142206743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Improving Pretraining Data Using Perplexity Correlations 利用复杂性相关性改进预训练数据

arXiv - STAT - Machine Learning

Pub Date : 2024-09-09 DOI: arxiv-2409.05816

Tristan Thrush, Christopher Potts, Tatsunori Hashimoto

Quality pretraining data is often seen as the key to high-performancelanguage models. However, progress in understanding pretraining data has beenslow due to the costly pretraining runs required for data selectionexperiments. We present a framework that avoids these costs and selectshigh-quality pretraining data without any LLM training of our own. Our work isbased on a simple observation: LLM losses on many pretraining texts arecorrelated with downstream benchmark performance, and selectinghigh-correlation documents is an effective pretraining data selection method.We build a new statistical framework for data selection centered aroundestimates of perplexity-benchmark correlations and perform data selection usinga sample of 90 LLMs taken from the Open LLM Leaderboard on texts from tens ofthousands of web domains. In controlled pretraining experiments at the 160Mparameter scale on 8 benchmarks, our approach outperforms DSIR on everybenchmark, while matching the best data selector found in DataComp-LM, ahand-engineered bigram classifier.

高质量的预训练数据通常被视为高性能语言模型的关键。然而，由于数据选择实验需要高成本的预训练运行，因此在理解预训练数据方面进展缓慢。我们提出了一个框架，它可以避免这些成本，并在不进行任何 LLM 训练的情况下选择高质量的预训练数据。我们的工作基于一个简单的观察结果：我们构建了一个新的数据选择统计框架，该框架以对困惑度-基准相关性的估计为中心，并使用从开放 LLM 排行榜（Open LLM Leaderboard）中抽取的 90 个 LLM 样本，对来自数万个网络域的文本进行数据选择。在 8 个基准的 1.6 亿参数规模的受控预训练实验中，我们的方法在每个基准上的表现都优于 DSIR，同时与 DataComp-LM 中的最佳数据选择器（一种人工设计的 bigram 分类器）不相上下。

引用次数: 0

Unified Neural Network Scaling Laws and Scale-time Equivalence 统一神经网络缩放定律和尺度-时间等效性

arXiv - STAT - Machine Learning

Pub Date : 2024-09-09 DOI: arxiv-2409.05782

Akhilan Boopathy, Ila Fiete

As neural networks continue to grow in size but datasets might not, it isvital to understand how much performance improvement can be expected: is itmore important to scale network size or data volume? Thus, neural networkscaling laws, which characterize how test error varies with network size anddata volume, have become increasingly important. However, existing scaling lawsare often applicable only in limited regimes and often do not incorporate orpredict well-known phenomena such as double descent. Here, we present a noveltheoretical characterization of how three factors -- model size, training time,and data volume -- interact to determine the performance of deep neuralnetworks. We first establish a theoretical and empirical equivalence betweenscaling the size of a neural network and increasing its training timeproportionally. Scale-time equivalence challenges the current practice, whereinlarge models are trained for small durations, and suggests that smaller modelstrained over extended periods could match their efficacy. It also leads to anovel method for predicting the performance of large-scale networks fromsmall-scale networks trained for extended epochs, and vice versa. We nextcombine scale-time equivalence with a linear model analysis of double descentto obtain a unified theoretical scaling law, which we confirm with experimentsacross vision benchmarks and network architectures. These laws explain severalpreviously unexplained phenomena: reduced data requirements for generalizationin larger models, heightened sensitivity to label noise in overparameterizedmodels, and instances where increasing model scale does not necessarily enhanceperformance. Our findings hold significant implications for the practicaldeployment of neural networks, offering a more accessible and efficient path totraining and fine-tuning large models.

神经网络的规模在不断扩大，但数据集的规模可能不会随之扩大，因此了解性能提升的幅度至关重要：是扩大网络规模更重要，还是扩大数据量更重要？因此，描述测试误差如何随网络规模和数据量变化的神经网络缩放定律变得越来越重要。然而，现有的缩放定律往往只适用于有限的情况，而且往往没有包含或预测诸如双下降等众所周知的现象。在这里，我们对模型大小、训练时间和数据量这三个因素如何相互作用决定深度神经网络的性能进行了新颖的理论描述。我们首先从理论和经验上建立了神经网络规模扩大与训练时间成比例增加之间的等价关系。规模-时间等效性对当前的做法提出了挑战，当前的做法是在较短的时间内训练大型模型，这表明在较长时间内训练较小的模型也能达到与之相匹配的效果。它还带来了一种新方法，可以从经过长时间训练的小规模网络预测大规模网络的性能，反之亦然。接下来，我们将规模-时间等效性与双下降线性模型分析相结合，得到了一个统一的理论缩放定律，并通过跨视觉基准和网络架构的实验加以证实。这些定律解释了几个以前无法解释的现象：在较大的模型中降低了泛化的数据要求，在参数过大的模型中提高了对标签噪声的敏感性，以及模型规模的扩大并不一定会提高性能。我们的发现对神经网络的实际部署具有重要意义，为训练和微调大型模型提供了更便捷、更高效的途径。

{"title":"Unified Neural Network Scaling Laws and Scale-time Equivalence","authors":"Akhilan Boopathy, Ila Fiete","doi":"arxiv-2409.05782","DOIUrl":"https://doi.org/arxiv-2409.05782","url":null,"abstract":"As neural networks continue to grow in size but datasets might not, it is\u0000vital to understand how much performance improvement can be expected: is it\u0000more important to scale network size or data volume? Thus, neural network\u0000scaling laws, which characterize how test error varies with network size and\u0000data volume, have become increasingly important. However, existing scaling laws\u0000are often applicable only in limited regimes and often do not incorporate or\u0000predict well-known phenomena such as double descent. Here, we present a novel\u0000theoretical characterization of how three factors -- model size, training time,\u0000and data volume -- interact to determine the performance of deep neural\u0000networks. We first establish a theoretical and empirical equivalence between\u0000scaling the size of a neural network and increasing its training time\u0000proportionally. Scale-time equivalence challenges the current practice, wherein\u0000large models are trained for small durations, and suggests that smaller models\u0000trained over extended periods could match their efficacy. It also leads to a\u0000novel method for predicting the performance of large-scale networks from\u0000small-scale networks trained for extended epochs, and vice versa. We next\u0000combine scale-time equivalence with a linear model analysis of double descent\u0000to obtain a unified theoretical scaling law, which we confirm with experiments\u0000across vision benchmarks and network architectures. These laws explain several\u0000previously unexplained phenomena: reduced data requirements for generalization\u0000in larger models, heightened sensitivity to label noise in overparameterized\u0000models, and instances where increasing model scale does not necessarily enhance\u0000performance. Our findings hold significant implications for the practical\u0000deployment of neural networks, offering a more accessible and efficient path to\u0000training and fine-tuning large models.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142206744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Robust Non-adaptive Group Testing under Errors in Group Membership Specifications 小组成员规格错误下的稳健非适应性小组测试

arXiv - STAT - Machine Learning

Pub Date : 2024-09-09 DOI: arxiv-2409.05345

Shuvayan Banerjee, Radhendushka Srivastava, James Saunderson, Ajit Rajwade

Given $p$ samples, each of which may or may not be defective, group testing(GT) aims to determine their defect status by performing tests on $n < p$`groups', where a group is formed by mixing a subset of the $p$ samples.Assuming that the number of defective samples is very small compared to $p$, GTalgorithms have provided excellent recovery of the status of all $p$ sampleswith even a small number of groups. Most existing methods, however, assume thatthe group memberships are accurately specified. This assumption may not alwaysbe true in all applications, due to various resource constraints. Such errorscould occur, eg, when a technician, preparing the groups in a laboratory,unknowingly mixes together an incorrect subset of samples as compared to whatwas specified. We develop a new GT method, the Debiased Robust Lasso TestMethod (DRLT), that handles such group membership specification errors. Theproposed DRLT method is based on an approach to debias, or reduce the inherentbias in, estimates produced by Lasso, a popular and effective sparse regressiontechnique. We also provide theoretical upper bounds on the reconstruction errorproduced by our estimator. Our approach is then combined with two carefullydesigned hypothesis tests respectively for (i) the identification of defectivesamples in the presence of errors in group membership specifications, and (ii)the identification of groups with erroneous membership specifications. The DRLTapproach extends the literature on bias mitigation of statistical estimatorssuch as the LASSO, to handle the important case when some of the measurementscontain outliers, due to factors such as group membership specification errors.We present numerical results which show that our approach outperforms severalbaselines and robust regression techniques for identification of defectivesamples as well as erroneously specified groups.

给定 $p$ 样品，其中每个可能有缺陷，也可能没有缺陷，分组测试（GT）旨在通过对 $n < p$"组 "进行测试来确定它们的缺陷状态，其中一个组是由 $p$ 样品的一个子集混合而成。不过，现有的大多数方法都假定组的成员身份是准确指定的。由于各种资源限制，这一假设在所有应用中可能并不总是正确的。例如，当技术人员在实验室准备分组时，在不知情的情况下将错误的样本子集混合在一起，就会出现这种错误。我们开发了一种新的 GT 方法--Debiased Robust Lasso TestMethod（DRLT），可以处理此类组员资格规范错误。拟议的 DRLT 方法基于一种去偏方法，即减少由 Lasso（一种流行而有效的稀疏回归技术）产生的估计值中的固有偏差。我们还提供了估计器产生的重建误差的理论上限。然后，我们将这一方法与两个精心设计的假设检验相结合，分别用于 (i) 识别存在群体成员规格错误的缺陷样本，以及 (ii) 识别成员规格错误的群体。DRLT 方法扩展了有关 LASSO 等统计估计器偏差缓解的文献，以处理因群体成员规格错误等因素导致部分测量值包含异常值的重要情况。我们给出的数值结果表明，在识别缺陷样本和错误规格群体方面，我们的方法优于几种基本方法和稳健回归技术。

{"title":"Robust Non-adaptive Group Testing under Errors in Group Membership Specifications","authors":"Shuvayan Banerjee, Radhendushka Srivastava, James Saunderson, Ajit Rajwade","doi":"arxiv-2409.05345","DOIUrl":"https://doi.org/arxiv-2409.05345","url":null,"abstract":"Given $p$ samples, each of which may or may not be defective, group testing\u0000(GT) aims to determine their defect status by performing tests on $n < p$\u0000`groups', where a group is formed by mixing a subset of the $p$ samples.\u0000Assuming that the number of defective samples is very small compared to $p$, GT\u0000algorithms have provided excellent recovery of the status of all $p$ samples\u0000with even a small number of groups. Most existing methods, however, assume that\u0000the group memberships are accurately specified. This assumption may not always\u0000be true in all applications, due to various resource constraints. Such errors\u0000could occur, eg, when a technician, preparing the groups in a laboratory,\u0000unknowingly mixes together an incorrect subset of samples as compared to what\u0000was specified. We develop a new GT method, the Debiased Robust Lasso Test\u0000Method (DRLT), that handles such group membership specification errors. The\u0000proposed DRLT method is based on an approach to debias, or reduce the inherent\u0000bias in, estimates produced by Lasso, a popular and effective sparse regression\u0000technique. We also provide theoretical upper bounds on the reconstruction error\u0000produced by our estimator. Our approach is then combined with two carefully\u0000designed hypothesis tests respectively for (i) the identification of defective\u0000samples in the presence of errors in group membership specifications, and (ii)\u0000the identification of groups with erroneous membership specifications. The DRLT\u0000approach extends the literature on bias mitigation of statistical estimators\u0000such as the LASSO, to handle the important case when some of the measurements\u0000contain outliers, due to factors such as group membership specification errors.\u0000We present numerical results which show that our approach outperforms several\u0000baselines and robust regression techniques for identification of defective\u0000samples as well as erroneously specified groups.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142206700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

K-Fold Causal BART for CATE Estimation 用于 CATE 估算的 K 折因果 BART

arXiv - STAT - Machine Learning

Pub Date : 2024-09-09 DOI: arxiv-2409.05665

Hugo Gobato Souto, Francisco Louzada Neto

This research aims to propose and evaluate a novel model named K-Fold CausalBayesian Additive Regression Trees (K-Fold Causal BART) for improved estimationof Average Treatment Effects (ATE) and Conditional Average Treatment Effects(CATE). The study employs synthetic and semi-synthetic datasets, including thewidely recognized Infant Health and Development Program (IHDP) benchmarkdataset, to validate the model's performance. Despite promising results insynthetic scenarios, the IHDP dataset reveals that the proposed model is notstate-of-the-art for ATE and CATE estimation. Nonetheless, the researchprovides several novel insights: 1. The ps-BART model is likely the preferredchoice for CATE and ATE estimation due to better generalization compared to theother benchmark models - including the Bayesian Causal Forest (BCF) model,which is considered by many the current best model for CATE estimation, 2. TheBCF model's performance deteriorates significantly with increasing treatmenteffect heterogeneity, while the ps-BART model remains robust, 3. Models tend tobe overconfident in CATE uncertainty quantification when treatment effectheterogeneity is low, 4. A second K-Fold method is unnecessary for avoidingoverfitting in CATE estimation, as it adds computational costs withoutimproving performance, 5. Detailed analysis reveals the importance ofunderstanding dataset characteristics and using nuanced evaluation methods, 6.The conclusion of Curth et al. (2021) that indirect strategies for CATEestimation are superior for the IHDP dataset is contradicted by the results ofthis research. These findings challenge existing assumptions and suggestdirections for future research to enhance causal inference methodologies.

本研究旨在提出并评估一种名为 K 倍因果贝叶斯加性回归树（K-Fold Causal BART）的新型模型，以改进平均治疗效果（ATE）和条件平均治疗效果（CATE）的估计。研究采用了合成和半合成数据集，包括广受认可的婴儿健康与发展计划（IHDP）基准数据集，以验证模型的性能。尽管在合成场景中取得了很好的结果，但 IHDP 数据集显示，所提出的模型在 ATE 和 CATE 估算方面并不先进。尽管如此，这项研究还是提出了一些新见解：1.与其他基准模型（包括贝叶斯因果森林（BCF）模型）相比，ps-BART 模型具有更好的泛化能力，因此很可能是 CATE 和 ATE 估计的首选模型，而后者被许多人认为是当前 CATE 估计的最佳模型；2. 随着治疗效果异质性的增加，BCF 模型的性能显著下降，而 ps-BART 模型则保持稳健；3. 当治疗效果异质性较低时，模型在 CATE 不确定性量化方面往往过于自信；4.5.详细分析揭示了了解数据集特征和使用细致入微的评估方法的重要性，6.Curth 等人（2021 年）关于 CATE 估算的间接策略对于 IHDP 数据集更优越的结论与本研究结果相矛盾。这些发现对现有假设提出了挑战，并为未来研究提出了方向，以加强因果推理方法。

{"title":"K-Fold Causal BART for CATE Estimation","authors":"Hugo Gobato Souto, Francisco Louzada Neto","doi":"arxiv-2409.05665","DOIUrl":"https://doi.org/arxiv-2409.05665","url":null,"abstract":"This research aims to propose and evaluate a novel model named K-Fold Causal\u0000Bayesian Additive Regression Trees (K-Fold Causal BART) for improved estimation\u0000of Average Treatment Effects (ATE) and Conditional Average Treatment Effects\u0000(CATE). The study employs synthetic and semi-synthetic datasets, including the\u0000widely recognized Infant Health and Development Program (IHDP) benchmark\u0000dataset, to validate the model's performance. Despite promising results in\u0000synthetic scenarios, the IHDP dataset reveals that the proposed model is not\u0000state-of-the-art for ATE and CATE estimation. Nonetheless, the research\u0000provides several novel insights: 1. The ps-BART model is likely the preferred\u0000choice for CATE and ATE estimation due to better generalization compared to the\u0000other benchmark models - including the Bayesian Causal Forest (BCF) model,\u0000which is considered by many the current best model for CATE estimation, 2. The\u0000BCF model's performance deteriorates significantly with increasing treatment\u0000effect heterogeneity, while the ps-BART model remains robust, 3. Models tend to\u0000be overconfident in CATE uncertainty quantification when treatment effect\u0000heterogeneity is low, 4. A second K-Fold method is unnecessary for avoiding\u0000overfitting in CATE estimation, as it adds computational costs without\u0000improving performance, 5. Detailed analysis reveals the importance of\u0000understanding dataset characteristics and using nuanced evaluation methods, 6.\u0000The conclusion of Curth et al. (2021) that indirect strategies for CATE\u0000estimation are superior for the IHDP dataset is contradicted by the results of\u0000this research. These findings challenge existing assumptions and suggest\u0000directions for future research to enhance causal inference methodologies.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142206693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Bridging Rested and Restless Bandits with Graph-Triggering: Rising and Rotting 用图形触发技术连接静止和不静止的强盗：崛起与腐烂

arXiv - STAT - Machine Learning

Pub Date : 2024-09-09 DOI: arxiv-2409.05980

Gianmarco Genalti, Marco Mussi, Nicola Gatti, Marcello Restelli, Matteo Castiglioni, Alberto Maria Metelli

Rested and Restless Bandits are two well-known bandit settings that areuseful to model real-world sequential decision-making problems in which theexpected reward of an arm evolves over time due to the actions we perform ordue to the nature. In this work, we propose Graph-Triggered Bandits (GTBs), aunifying framework to generalize and extend rested and restless bandits. Inthis setting, the evolution of the arms' expected rewards is governed by agraph defined over the arms. An edge connecting a pair of arms $(i,j)$represents the fact that a pull of arm $i$ triggers the evolution of arm $j$,and vice versa. Interestingly, rested and restless bandits are both specialcases of our model for some suitable (degenerated) graph. As relevant casestudies for this setting, we focus on two specific types of monotonic bandits:rising, where the expected reward of an arm grows as the number of triggersincreases, and rotting, where the opposite behavior occurs. For these cases, westudy the optimal policies. We provide suitable algorithms for all scenariosand discuss their theoretical guarantees, highlighting the complexity of thelearning problem concerning instance-dependent terms that encode specificproperties of the underlying graph structure.

静止匪徒（Rested Bandits）和不安定匪徒（Restless Bandits）是两种著名的匪徒设置，它们可以用来模拟现实世界中的顺序决策问题，在这些问题中，手臂的预期奖励会随着我们所执行的行动或自然环境的变化而变化。在这项工作中，我们提出了图触发匪帮（Graph-Triggered Bandits，GTBs），这是一个统一的框架，用于概括和扩展静止匪帮和不安匪帮。在这种情况下，匪臂预期奖励的演变受匪臂上定义的图的支配。连接一对臂$(i,j)$的边表示臂$i$的拉动触发了臂$j$的演化，反之亦然。有趣的是，对于某些合适的（退化的）图，静止的匪徒和躁动的匪徒都是我们模型的特例。作为这种情况下的相关案例研究，我们将重点放在两种特定类型的单调匪徒身上：一种是 "上升 "匪徒，其手臂的预期奖励会随着触发次数的增加而增加；另一种是 "腐烂 "匪徒，其行为恰恰相反。针对这些情况，我们研究了最优策略。我们为所有情况提供了合适的算法，并讨论了这些算法的理论保证，强调了学习问题的复杂性，这些学习问题涉及与实例相关的术语，这些术语编码了底层图结构的特定属性。

{"title":"Bridging Rested and Restless Bandits with Graph-Triggering: Rising and Rotting","authors":"Gianmarco Genalti, Marco Mussi, Nicola Gatti, Marcello Restelli, Matteo Castiglioni, Alberto Maria Metelli","doi":"arxiv-2409.05980","DOIUrl":"https://doi.org/arxiv-2409.05980","url":null,"abstract":"Rested and Restless Bandits are two well-known bandit settings that are\u0000useful to model real-world sequential decision-making problems in which the\u0000expected reward of an arm evolves over time due to the actions we perform or\u0000due to the nature. In this work, we propose Graph-Triggered Bandits (GTBs), a\u0000unifying framework to generalize and extend rested and restless bandits. In\u0000this setting, the evolution of the arms' expected rewards is governed by a\u0000graph defined over the arms. An edge connecting a pair of arms $(i,j)$\u0000represents the fact that a pull of arm $i$ triggers the evolution of arm $j$,\u0000and vice versa. Interestingly, rested and restless bandits are both special\u0000cases of our model for some suitable (degenerated) graph. As relevant case\u0000studies for this setting, we focus on two specific types of monotonic bandits:\u0000rising, where the expected reward of an arm grows as the number of triggers\u0000increases, and rotting, where the opposite behavior occurs. For these cases, we\u0000study the optimal policies. We provide suitable algorithms for all scenarios\u0000and discuss their theoretical guarantees, highlighting the complexity of the\u0000learning problem concerning instance-dependent terms that encode specific\u0000properties of the underlying graph structure.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142206650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Improving Tree Probability Estimation with Stochastic Optimization and Variance Reduction 利用随机优化和方差缩小改进树概率估计

arXiv - STAT - Machine Learning

Pub Date : 2024-09-09 DOI: arxiv-2409.05282

Tianyu Xie, Musu Yuan, Minghua Deng, Cheng Zhang

Probability estimation of tree topologies is one of the fundamental tasks inphylogenetic inference. The recently proposed subsplit Bayesian networks (SBNs)provide a powerful probabilistic graphical model for tree topology probabilityestimation by properly leveraging the hierarchical structure of phylogenetictrees. However, the expectation maximization (EM) method currently used forlearning SBN parameters does not scale up to large data sets. In this paper, weintroduce several computationally efficient methods for training SBNs and showthat variance reduction could be the key for better performance. Furthermore,we also introduce the variance reduction technique to improve the optimizationof SBN parameters for variational Bayesian phylogenetic inference (VBPI).Extensive synthetic and real data experiments demonstrate that our methodsoutperform previous baseline methods on the tasks of tree topology probabilityestimation as well as Bayesian phylogenetic inference using SBNs.

树拓扑的概率估计是系统发育推断的基本任务之一。最近提出的子分裂贝叶斯网络（SBN）通过适当利用系统树的层次结构，为树拓扑概率估计提供了一个强大的概率图形模型。然而，目前用于学习 SBN 参数的期望最大化（EM）方法无法扩展到大型数据集。在本文中，我们介绍了几种高效计算的 SBNs 训练方法，并证明方差缩小可能是提高性能的关键。广泛的合成和真实数据实验证明，我们的方法在树拓扑概率估计以及使用 SBN 的贝叶斯系统发育推断任务上优于以前的基线方法。

引用次数: 0

Breaking Neural Network Scaling Laws with Modularity 用模块化打破神经网络扩展法则

arXiv - STAT - Machine Learning

Pub Date : 2024-09-09 DOI: arxiv-2409.05780

Akhilan Boopathy, Sunshine Jiang, William Yue, Jaedong Hwang, Abhiram Iyer, Ila Fiete

Modular neural networks outperform nonmodular neural networks on tasksranging from visual question answering to robotics. These performanceimprovements are thought to be due to modular networks' superior ability tomodel the compositional and combinatorial structure of real-world problems.However, a theoretical explanation of how modularity improves generalizability,and how to leverage task modularity while training networks remains elusive.Using recent theoretical progress in explaining neural network generalization,we investigate how the amount of training data required to generalize on a taskvaries with the intrinsic dimensionality of a task's input. We showtheoretically that when applied to modularly structured tasks, while nonmodularnetworks require an exponential number of samples with task dimensionality,modular networks' sample complexity is independent of task dimensionality:modular networks can generalize in high dimensions. We then develop a novellearning rule for modular networks to exploit this advantage and empiricallyshow the improved generalization of the rule, both in- and out-of-distribution,on high-dimensional, modular tasks.

模块化神经网络在从视觉问题解答到机器人等任务中的表现优于非模块化神经网络。我们利用最近在解释神经网络泛化方面取得的理论进展，研究了泛化任务所需的训练数据量如何随任务输入的内在维度而变化。我们从理论上证明，当应用于模块化结构的任务时，非模块化网络需要的样本数量与任务维度成指数关系，而模块化网络的样本复杂度与任务维度无关：模块化网络可以在高维度上泛化。然后，我们为模块化网络开发了一种新的学习规则，以利用这一优势，并通过实证证明了该规则在高维模块化任务中的泛化能力，无论是在分布内还是分布外。

{"title":"Breaking Neural Network Scaling Laws with Modularity","authors":"Akhilan Boopathy, Sunshine Jiang, William Yue, Jaedong Hwang, Abhiram Iyer, Ila Fiete","doi":"arxiv-2409.05780","DOIUrl":"https://doi.org/arxiv-2409.05780","url":null,"abstract":"Modular neural networks outperform nonmodular neural networks on tasks\u0000ranging from visual question answering to robotics. These performance\u0000improvements are thought to be due to modular networks' superior ability to\u0000model the compositional and combinatorial structure of real-world problems.\u0000However, a theoretical explanation of how modularity improves generalizability,\u0000and how to leverage task modularity while training networks remains elusive.\u0000Using recent theoretical progress in explaining neural network generalization,\u0000we investigate how the amount of training data required to generalize on a task\u0000varies with the intrinsic dimensionality of a task's input. We show\u0000theoretically that when applied to modularly structured tasks, while nonmodular\u0000networks require an exponential number of samples with task dimensionality,\u0000modular networks' sample complexity is independent of task dimensionality:\u0000modular networks can generalize in high dimensions. We then develop a novel\u0000learning rule for modular networks to exploit this advantage and empirically\u0000show the improved generalization of the rule, both in- and out-of-distribution,\u0000on high-dimensional, modular tasks.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142206745","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0