首页 > 最新文献

arXiv - STAT - Methodology最新文献

英文 中文
Causal Analysis of Shapley Values: Conditional vs. Marginal 夏普利值的因果分析:条件值与边际值
Pub Date : 2024-09-10 DOI: arxiv-2409.06157
Ilya Rozenfeld
Shapley values, a game theoretic concept, has been one of the most populartools for explaining Machine Learning (ML) models in recent years.Unfortunately, the two most common approaches, conditional and marginal, tocalculating Shapley values can lead to different results along with someundesirable side effects when features are correlated. This in turn has led tothe situation in the literature where contradictory recommendations regardingchoice of an approach are provided by different authors. In this paper we aimto resolve this controversy through the use of causal arguments. We show thatthe differences arise from the implicit assumptions that are made within eachmethod to deal with missing causal information. We also demonstrate that theconditional approach is fundamentally unsound from a causal perspective. This,together with previous work in [1], leads to the conclusion that the marginalapproach should be preferred over the conditional one.
Shapley 值是一个博弈论概念,近年来已成为解释机器学习(ML)模型的最流行工具之一。不幸的是,计算 Shapley 值的两种最常见方法(条件法和边际法)会导致不同的结果,当特征相关时还会产生一些令人不满意的副作用。这反过来又导致了文献中不同作者对方法选择提出了相互矛盾的建议。本文旨在通过因果论证来解决这一争议。我们表明,差异源于每种方法在处理缺失因果信息时所作的隐含假设。我们还证明,从因果关系的角度来看,条件方法从根本上是不健全的。结合之前的研究[1],我们得出结论:边际方法应优于条件方法。
{"title":"Causal Analysis of Shapley Values: Conditional vs. Marginal","authors":"Ilya Rozenfeld","doi":"arxiv-2409.06157","DOIUrl":"https://doi.org/arxiv-2409.06157","url":null,"abstract":"Shapley values, a game theoretic concept, has been one of the most popular\u0000tools for explaining Machine Learning (ML) models in recent years.\u0000Unfortunately, the two most common approaches, conditional and marginal, to\u0000calculating Shapley values can lead to different results along with some\u0000undesirable side effects when features are correlated. This in turn has led to\u0000the situation in the literature where contradictory recommendations regarding\u0000choice of an approach are provided by different authors. In this paper we aim\u0000to resolve this controversy through the use of causal arguments. We show that\u0000the differences arise from the implicit assumptions that are made within each\u0000method to deal with missing causal information. We also demonstrate that the\u0000conditional approach is fundamentally unsound from a causal perspective. This,\u0000together with previous work in [1], leads to the conclusion that the marginal\u0000approach should be preferred over the conditional one.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"192 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient nonparametric estimators of discriminationmeasures with censored survival data 有删减生存数据的歧视度量的高效非参数估计器
Pub Date : 2024-09-09 DOI: arxiv-2409.05632
Marie S. Breum, Torben Martinussen
Discrimination measures such as concordance statistics (e.g. the c-index orthe concordance probability) and the cumulative-dynamic time-dependent areaunder the ROC-curve (AUC) are widely used in the medical literature forevaluating the predictive accuracy of a scoring rule which relates a set ofprognostic markers to the risk of experiencing a particular event. Often thescoring rule being evaluated in terms of discriminatory ability is the linearpredictor of a survival regression model such as the Cox proportional hazardsmodel. This has the undesirable feature that the scoring rule depends on thecensoring distribution when the model is misspecified. In this work we focus onlinear scoring rules where the coefficient vector is a nonparametric estimanddefined in the setting where there is no censoring. We propose so-calleddebiased estimators of the aforementioned discrimination measures for thisclass of scoring rules. The proposed estimators make efficient use of the dataand minimize bias by allowing for the use of data-adaptive methods for modelfitting. Moreover, the estimators do not rely on correct specification of thecensoring model to produce consistent estimation. We compare the estimators toexisting methods in a simulation study, and we illustrate the method by anapplication to a brain cancer study.
医学文献中广泛使用的判别指标包括一致性统计量(如 c 指数或一致性概率)和 ROC 曲线下的累积-动态-时间相关区域(AUC),用于评估评分规则的预测准确性。通常,根据判别能力评估的评分规则是生存回归模型(如 Cox 比例危险模型)的线性预测因子。这有一个不可取的特点,即当模型被错误地指定时,评分规则取决于补偿分布。在这项工作中,我们将重点放在系数向量为非参数估计的评分规则上,并在没有删减的情况下进行定义。我们针对这类评分规则提出了上述区分度的所谓偏差估计器。所提出的估计器允许使用数据自适应方法进行模态拟合,从而有效地利用了数据并最大限度地减少了偏差。此外,估计器不依赖于对评分模型的正确规范来产生一致的估计结果。我们在一项模拟研究中将这些估计方法与现有方法进行了比较,并将其应用于一项脑癌研究,以说明该方法。
{"title":"Efficient nonparametric estimators of discriminationmeasures with censored survival data","authors":"Marie S. Breum, Torben Martinussen","doi":"arxiv-2409.05632","DOIUrl":"https://doi.org/arxiv-2409.05632","url":null,"abstract":"Discrimination measures such as concordance statistics (e.g. the c-index or\u0000the concordance probability) and the cumulative-dynamic time-dependent area\u0000under the ROC-curve (AUC) are widely used in the medical literature for\u0000evaluating the predictive accuracy of a scoring rule which relates a set of\u0000prognostic markers to the risk of experiencing a particular event. Often the\u0000scoring rule being evaluated in terms of discriminatory ability is the linear\u0000predictor of a survival regression model such as the Cox proportional hazards\u0000model. This has the undesirable feature that the scoring rule depends on the\u0000censoring distribution when the model is misspecified. In this work we focus on\u0000linear scoring rules where the coefficient vector is a nonparametric estimand\u0000defined in the setting where there is no censoring. We propose so-called\u0000debiased estimators of the aforementioned discrimination measures for this\u0000class of scoring rules. The proposed estimators make efficient use of the data\u0000and minimize bias by allowing for the use of data-adaptive methods for model\u0000fitting. Moreover, the estimators do not rely on correct specification of the\u0000censoring model to produce consistent estimation. We compare the estimators to\u0000existing methods in a simulation study, and we illustrate the method by an\u0000application to a brain cancer study.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"42 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196639","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Eigengap Ratio Test for Determining the Number of Communities in Network Data 用于确定网络数据中社群数量的 Eigengap 比率测试
Pub Date : 2024-09-09 DOI: arxiv-2409.05276
Yujia Wu, Jingfei Zhang, Wei Lan, Chih-Ling Tsai
To characterize the community structure in network data, researchers haveintroduced various block-type models, including the stochastic block model,degree-corrected stochastic block model, mixed membership block model,degree-corrected mixed membership block model, and others. A critical step inapplying these models effectively is determining the number of communities inthe network. However, to our knowledge, existing methods for estimating thenumber of network communities often require model estimations or are unable tosimultaneously account for network sparsity and a divergent number ofcommunities. In this paper, we propose an eigengap-ratio based test thataddress these challenges. The test is straightforward to compute, requires noparameter tuning, and can be applied to a wide range of block models withoutthe need to estimate network distribution parameters. Furthermore, it iseffective for both dense and sparse networks with a divergent number ofcommunities. We show that the proposed test statistic converges to a functionof the type-I Tracy-Widom distributions under the null hypothesis, and that thetest is asymptotically powerful under alternatives. Simulation studies on bothdense and sparse networks demonstrate the efficacy of the proposed method.Three real-world examples are presented to illustrate the usefulness of theproposed test.
为了描述网络数据中的群落结构,研究人员引入了各种块状模型,包括随机块状模型、度校正随机块状模型、混合成员块状模型、度校正混合成员块状模型等。有效应用这些模型的关键步骤是确定网络中的群落数量。然而,据我们所知,现有的估计网络社区数量的方法往往需要对模型进行估计,或者无法同时考虑网络稀疏性和社区数量的差异。在本文中,我们提出了一种基于 eigengap 比率的测试方法来解决这些难题。该检验计算简单,不需要调整参数,可应用于各种区块模型,无需估计网络分布参数。此外,它对具有不同群体数量的密集和稀疏网络都有效。我们证明,在零假设下,所提出的检验统计量收敛于 I 型 Tracy-Widom 分布的函数,并且在替代假设下,该检验在渐近上是强大的。在密集和稀疏网络上进行的仿真研究证明了所提方法的有效性,并列举了三个实际案例来说明所提检验的实用性。
{"title":"An Eigengap Ratio Test for Determining the Number of Communities in Network Data","authors":"Yujia Wu, Jingfei Zhang, Wei Lan, Chih-Ling Tsai","doi":"arxiv-2409.05276","DOIUrl":"https://doi.org/arxiv-2409.05276","url":null,"abstract":"To characterize the community structure in network data, researchers have\u0000introduced various block-type models, including the stochastic block model,\u0000degree-corrected stochastic block model, mixed membership block model,\u0000degree-corrected mixed membership block model, and others. A critical step in\u0000applying these models effectively is determining the number of communities in\u0000the network. However, to our knowledge, existing methods for estimating the\u0000number of network communities often require model estimations or are unable to\u0000simultaneously account for network sparsity and a divergent number of\u0000communities. In this paper, we propose an eigengap-ratio based test that\u0000address these challenges. The test is straightforward to compute, requires no\u0000parameter tuning, and can be applied to a wide range of block models without\u0000the need to estimate network distribution parameters. Furthermore, it is\u0000effective for both dense and sparse networks with a divergent number of\u0000communities. We show that the proposed test statistic converges to a function\u0000of the type-I Tracy-Widom distributions under the null hypothesis, and that the\u0000test is asymptotically powerful under alternatives. Simulation studies on both\u0000dense and sparse networks demonstrate the efficacy of the proposed method.\u0000Three real-world examples are presented to illustrate the usefulness of the\u0000proposed test.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"25 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Priors from Envisioned Posterior Judgments: A Novel Elicitation Approach With Application to Bayesian Clinical Trials 来自设想的后验判断的先验:应用于贝叶斯临床试验的新颖诱导方法
Pub Date : 2024-09-09 DOI: arxiv-2409.05271
Yongdong Ouyang, Janice J Eng, Denghuang Zhan, Hubert Wong
The uptake of formalized prior elicitation from experts in Bayesian clinicaltrials has been limited, largely due to the challenges associated with complexstatistical modeling, the lack of practical tools, and the cognitive burden onexperts required to quantify their uncertainty using probabilistic language.Additionally, existing methods do not address prior-posterior coherence, i.e.,does the posterior distribution, obtained mathematically from combining theestimated prior with the trial data, reflect the expert's actual posteriorbeliefs? We propose a new elicitation approach that seeks to ensureprior-posterior coherence and reduce the expert's cognitive burden. This isachieved by eliciting responses about the expert's envisioned posteriorjudgments under various potential data outcomes and inferring the priordistribution by minimizing the discrepancies between these responses and theexpected responses obtained from the posterior distribution. The feasibilityand potential value of the new approach are illustrated through an applicationto a real trial currently underway.
在贝叶斯临床试验中,向专家正式征询先验值的做法一直受到限制,这主要是由于复杂的统计建模所带来的挑战、实用工具的缺乏以及专家使用概率语言量化其不确定性所带来的认知负担。此外,现有方法并未解决先验-后验一致性问题,即通过将估计的先验值与试验数据相结合而得到的后验分布是否反映了专家的实际后验信念?我们提出了一种新的诱导方法,旨在确保先验-后验一致性并减轻专家的认知负担。具体做法是:在各种可能的数据结果下,诱导专家回答其设想的后验判断,并通过最小化这些回答与从后验分布中得到的预期回答之间的差异来推断前验分布。通过对目前正在进行的一项实际试验的应用,说明了这种新方法的可行性和潜在价值。
{"title":"Priors from Envisioned Posterior Judgments: A Novel Elicitation Approach With Application to Bayesian Clinical Trials","authors":"Yongdong Ouyang, Janice J Eng, Denghuang Zhan, Hubert Wong","doi":"arxiv-2409.05271","DOIUrl":"https://doi.org/arxiv-2409.05271","url":null,"abstract":"The uptake of formalized prior elicitation from experts in Bayesian clinical\u0000trials has been limited, largely due to the challenges associated with complex\u0000statistical modeling, the lack of practical tools, and the cognitive burden on\u0000experts required to quantify their uncertainty using probabilistic language.\u0000Additionally, existing methods do not address prior-posterior coherence, i.e.,\u0000does the posterior distribution, obtained mathematically from combining the\u0000estimated prior with the trial data, reflect the expert's actual posterior\u0000beliefs? We propose a new elicitation approach that seeks to ensure\u0000prior-posterior coherence and reduce the expert's cognitive burden. This is\u0000achieved by eliciting responses about the expert's envisioned posterior\u0000judgments under various potential data outcomes and inferring the prior\u0000distribution by minimizing the discrepancies between these responses and the\u0000expected responses obtained from the posterior distribution. The feasibility\u0000and potential value of the new approach are illustrated through an application\u0000to a real trial currently underway.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"170 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Recursive Nested Filtering for Efficient Amortized Bayesian Experimental Design 高效摊销贝叶斯实验设计的递归嵌套过滤法
Pub Date : 2024-09-09 DOI: arxiv-2409.05354
Sahel Iqbal, Hany Abdulsamad, Sara Pérez-Vieites, Simo Särkkä, Adrien Corenflos
This paper introduces the Inside-Out Nested Particle Filter (IO-NPF), anovel, fully recursive, algorithm for amortized sequential Bayesianexperimental design in the non-exchangeable setting. We frame policyoptimization as maximum likelihood estimation in a non-Markovian state-spacemodel, achieving (at most) $mathcal{O}(T^2)$ computational complexity in thenumber of experiments. We provide theoretical convergence guarantees andintroduce a backward sampling algorithm to reduce trajectory degeneracy. IO-NPFoffers a practical, extensible, and provably consistent approach to sequentialBayesian experimental design, demonstrating improved efficiency over existingmethods.
本文介绍了嵌套粒子过滤器(Inside-Out Nested Particle Filter,IO-NPF),这是一种在不可交换设置中用于摊销顺序贝叶斯实验设计的完全递归的高级算法。我们将策略优化设定为非马尔可夫状态空间模型中的最大似然估计,在实验次数上实现了(最多)$mathcal{O}(T^2)$的计算复杂度。我们提供了理论上的收敛保证,并引入了一种后向采样算法来减少轨迹退化。IO-NPF 为连续贝叶斯实验设计提供了一种实用、可扩展和可证明一致的方法,与现有方法相比提高了效率。
{"title":"Recursive Nested Filtering for Efficient Amortized Bayesian Experimental Design","authors":"Sahel Iqbal, Hany Abdulsamad, Sara Pérez-Vieites, Simo Särkkä, Adrien Corenflos","doi":"arxiv-2409.05354","DOIUrl":"https://doi.org/arxiv-2409.05354","url":null,"abstract":"This paper introduces the Inside-Out Nested Particle Filter (IO-NPF), a\u0000novel, fully recursive, algorithm for amortized sequential Bayesian\u0000experimental design in the non-exchangeable setting. We frame policy\u0000optimization as maximum likelihood estimation in a non-Markovian state-space\u0000model, achieving (at most) $mathcal{O}(T^2)$ computational complexity in the\u0000number of experiments. We provide theoretical convergence guarantees and\u0000introduce a backward sampling algorithm to reduce trajectory degeneracy. IO-NPF\u0000offers a practical, extensible, and provably consistent approach to sequential\u0000Bayesian experimental design, demonstrating improved efficiency over existing\u0000methods.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"25 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196668","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multilevel testing of constraints induced by structural equation modeling in fMRI effective connectivity analysis: A proof of concept 对 fMRI 有效连通性分析中结构方程建模引起的制约因素进行多层次测试:概念验证
Pub Date : 2024-09-09 DOI: arxiv-2409.05630
G. Marrelec, A. Giron
In functional MRI (fMRI), effective connectivity analysis aims at inferringthe causal influences that brain regions exert on one another. A common methodfor this type of analysis is structural equation modeling (SEM). We herepropose a novel method to test the validity of a given model of structuralequation. Given a structural model in the form of a directed graph, the methodextracts the set of all constraints of conditional independence induced by theabsence of links between pairs of regions in the model and tests for theirvalidity in a Bayesian framework, either individually (constraint byconstraint), jointly (e.g., by gathering all constraints associated with agiven missing link), or globally (i.e., all constraints associated with thestructural model). This approach has two main advantages. First, it only testswhat is testable from observational data and does allow for false causalinterpretation. Second, it makes it possible to test each constraint (or groupof constraints) separately and, therefore, quantify in what measure eachconstraint (or, e..g., missing link) is respected in the data. We validate ourapproach using a simulation study and illustrate its potential benefits throughthe reanalysis of published data.
在功能磁共振成像(fMRI)中,有效连通性分析旨在推断大脑区域之间的因果影响。此类分析的常用方法是结构方程建模(SEM)。我们在此提出一种新方法来检验给定结构方程模型的有效性。给定一个有向图形式的结构模型,该方法提取模型中区域对之间缺失链接所引起的所有条件独立性约束的集合,并在贝叶斯框架下测试它们的有效性,既可以单独测试(逐个约束),也可以联合测试(例如,收集与给定缺失链接相关的所有约束),还可以全局测试(即与结构模型相关的所有约束)。这种方法有两大优势。首先,它只测试观察数据中可测试的内容,不允许错误的因果解释。其次,它可以分别测试每个约束条件(或每组约束条件),从而量化每个约束条件(或缺失环节)在数据中得到尊重的程度。我们通过模拟研究验证了这一方法,并通过对已发表数据的重新分析说明了这一方法的潜在优势。
{"title":"Multilevel testing of constraints induced by structural equation modeling in fMRI effective connectivity analysis: A proof of concept","authors":"G. Marrelec, A. Giron","doi":"arxiv-2409.05630","DOIUrl":"https://doi.org/arxiv-2409.05630","url":null,"abstract":"In functional MRI (fMRI), effective connectivity analysis aims at inferring\u0000the causal influences that brain regions exert on one another. A common method\u0000for this type of analysis is structural equation modeling (SEM). We here\u0000propose a novel method to test the validity of a given model of structural\u0000equation. Given a structural model in the form of a directed graph, the method\u0000extracts the set of all constraints of conditional independence induced by the\u0000absence of links between pairs of regions in the model and tests for their\u0000validity in a Bayesian framework, either individually (constraint by\u0000constraint), jointly (e.g., by gathering all constraints associated with a\u0000given missing link), or globally (i.e., all constraints associated with the\u0000structural model). This approach has two main advantages. First, it only tests\u0000what is testable from observational data and does allow for false causal\u0000interpretation. Second, it makes it possible to test each constraint (or group\u0000of constraints) separately and, therefore, quantify in what measure each\u0000constraint (or, e..g., missing link) is respected in the data. We validate our\u0000approach using a simulation study and illustrate its potential benefits through\u0000the reanalysis of published data.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"67 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196665","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Predicting Electricity Consumption with Random Walks on Gaussian Processes 用高斯过程的随机漫步预测用电量
Pub Date : 2024-09-09 DOI: arxiv-2409.05934
Chloé Hashimoto-Cullen, Benjamin Guedj
We consider time-series forecasting problems where data is scarce, difficultto gather, or induces a prohibitive computational cost. As a first attempt, wefocus on short-term electricity consumption in France, which is of strategicimportance for energy suppliers and public stakeholders. The complexity of thisproblem and the many levels of geospatial granularity motivate the use of anensemble of Gaussian Processes (GPs). Whilst GPs are remarkable predictors,they are computationally expensive to train, which calls for a frugal few-shotlearning approach. By taking into account performance on GPs trained on adataset and designing a random walk on these, we mitigate the training cost ofour entire Bayesian decision-making procedure. We introduce our algorithmcalled textsc{Domino} (ranDOM walk on gaussIaN prOcesses) and presentnumerical experiments to support its merits.
我们考虑的是数据稀缺、难以收集或计算成本过高的时间序列预测问题。作为首次尝试,我们将重点放在法国的短期用电量上,这对能源供应商和公共利益相关者来说具有重要的战略意义。这一问题的复杂性和多级地理空间粒度促使我们使用高斯过程(GPs)组合。虽然 GPs 是出色的预测工具,但其训练的计算成本很高,因此需要一种节俭的少量学习方法。通过考虑在数据集上训练的 GPs 的性能,并在这些 GPs 上设计随机行走,我们减轻了整个贝叶斯决策过程的训练成本。我们介绍了我们的算法,称为textsc{Domino}(ranDOM walk on gaussIaN prOcesses),并通过数值实验来证明它的优点。
{"title":"Predicting Electricity Consumption with Random Walks on Gaussian Processes","authors":"Chloé Hashimoto-Cullen, Benjamin Guedj","doi":"arxiv-2409.05934","DOIUrl":"https://doi.org/arxiv-2409.05934","url":null,"abstract":"We consider time-series forecasting problems where data is scarce, difficult\u0000to gather, or induces a prohibitive computational cost. As a first attempt, we\u0000focus on short-term electricity consumption in France, which is of strategic\u0000importance for energy suppliers and public stakeholders. The complexity of this\u0000problem and the many levels of geospatial granularity motivate the use of an\u0000ensemble of Gaussian Processes (GPs). Whilst GPs are remarkable predictors,\u0000they are computationally expensive to train, which calls for a frugal few-shot\u0000learning approach. By taking into account performance on GPs trained on a\u0000dataset and designing a random walk on these, we mitigate the training cost of\u0000our entire Bayesian decision-making procedure. We introduce our algorithm\u0000called textsc{Domino} (ranDOM walk on gaussIaN prOcesses) and present\u0000numerical experiments to support its merits.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An unbiased rank-based estimator of the Mann-Whitney variance including the case of ties 包括并列情况在内的曼-惠特尼方差无偏等级估计器
Pub Date : 2024-09-08 DOI: arxiv-2409.05038
Edgar Brunner, Frank Konietschke
Many estimators of the variance of the well-known unbiased and uniform mostpowerful estimator $htheta$ of the Mann-Whitney effect, $theta = P(X < Y) +nfrac12 P(X=Y)$, are considered in the literature. Some of these estimatorsare only valid in case of no ties or are biased in case of small sample sizeswhere the amount of the bias is not discussed. Here we derive an unbiasedestimator that is based on different rankings, the so-called 'placements'(Orban and Wolfe, 1980), and is therefore easy to compute. This estimator doesnot require the assumption of continuous dfs and is also valid in the case ofties. Moreover, it is shown that this estimator is non-negative and has a sharpupper bound which may be considered an empirical version of the well-knownBirnbaum-Klose inequality. The derivation of this estimator provides an optionto compute the biases of some commonly used estimators in the literature.Simulations demonstrate that, for small sample sizes, the biases of theseestimators depend on the underlying dfs and thus are not under control. Thismeans that in the case of a biased estimator, simulation results for the type-Ierror of a test or the coverage probability of a ci do not only depend on thequality of the approximation of $htheta$ by a normal db but also anadditional unknown bias caused by the variance estimator. Finally, it is shownthat this estimator is $L_2$-consistent.
文献中考虑了许多著名的曼-惠特尼效应无偏且统一的最有力估计值 $htheta$ 的方差估计值,即 $theta = P(X < Y) +nfrac12 P(X=Y)$ 。其中一些估计值仅在无并列情况下有效,或者在样本量较小的情况下有偏差,而偏差的大小没有讨论。在此,我们根据不同的排名,即所谓的 "位置"(Orban 和 Wolfe,1980 年),推导出一个无偏估计器,因此很容易计算。这个估计值不需要连续的假设,而且在排名的情况下也是有效的。此外,研究还表明,这个估计值是非负的,并且有一个尖锐的上界,可以看作是著名的伯恩鲍姆-克洛泽不等式的经验版本。该估计器的推导为计算文献中一些常用估计器的偏差提供了一个选项。模拟证明,对于小样本量,这些估计器的偏差依赖于基础数据,因此不受控制。这意味着,在有偏差估计器的情况下,检验的类型误差或覆盖概率的模拟结果不仅取决于正态分布对 $htheta$ 的近似质量,还取决于方差估计器引起的额外未知偏差。最后,研究表明该估计器与 $L_2$ 是一致的。
{"title":"An unbiased rank-based estimator of the Mann-Whitney variance including the case of ties","authors":"Edgar Brunner, Frank Konietschke","doi":"arxiv-2409.05038","DOIUrl":"https://doi.org/arxiv-2409.05038","url":null,"abstract":"Many estimators of the variance of the well-known unbiased and uniform most\u0000powerful estimator $htheta$ of the Mann-Whitney effect, $theta = P(X < Y) +\u0000nfrac12 P(X=Y)$, are considered in the literature. Some of these estimators\u0000are only valid in case of no ties or are biased in case of small sample sizes\u0000where the amount of the bias is not discussed. Here we derive an unbiased\u0000estimator that is based on different rankings, the so-called 'placements'\u0000(Orban and Wolfe, 1980), and is therefore easy to compute. This estimator does\u0000not require the assumption of continuous dfs and is also valid in the case of\u0000ties. Moreover, it is shown that this estimator is non-negative and has a sharp\u0000upper bound which may be considered an empirical version of the well-known\u0000Birnbaum-Klose inequality. The derivation of this estimator provides an option\u0000to compute the biases of some commonly used estimators in the literature.\u0000Simulations demonstrate that, for small sample sizes, the biases of these\u0000estimators depend on the underlying dfs and thus are not under control. This\u0000means that in the case of a biased estimator, simulation results for the type-I\u0000error of a test or the coverage probability of a ci do not only depend on the\u0000quality of the approximation of $htheta$ by a normal db but also an\u0000additional unknown bias caused by the variance estimator. Finally, it is shown\u0000that this estimator is $L_2$-consistent.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Estimating velocities of infectious disease spread through spatio-temporal log-Gaussian Cox point processes 通过时空对数高斯考克斯点过程估计传染病传播速度
Pub Date : 2024-09-08 DOI: arxiv-2409.05036
Fernando Rodriguez Avellaneda, Jorge Mateu, Paula Moraga
Understanding the spread of infectious diseases such as COVID-19 is crucialfor informed decision-making and resource allocation. A critical component ofdisease behavior is the velocity with which disease spreads, defined as therate of change between time and space. In this paper, we propose aspatio-temporal modeling approach to determine the velocities of infectiousdisease spread. Our approach assumes that the locations and times of peopleinfected can be considered as a spatio-temporal point pattern that arises as arealization of a spatio-temporal log-Gaussian Cox process. The intensity ofthis process is estimated using fast Bayesian inference by employing theintegrated nested Laplace approximation (INLA) and the Stochastic PartialDifferential Equations (SPDE) approaches. The velocity is then calculated usingfinite differences that approximate the derivatives of the intensity function.Finally, the directions and magnitudes of the velocities can be mapped atspecific times to examine better the spread of the disease throughout theregion. We demonstrate our method by analyzing COVID-19 spread in Cali,Colombia, during the 2020-2021 pandemic.
了解 COVID-19 等传染病的传播情况对于知情决策和资源分配至关重要。疾病行为的一个重要组成部分是疾病传播的速度,即时间和空间之间的变化速度。在本文中,我们提出了一种时空建模方法来确定传染性疾病的传播速度。我们的方法假定,感染者的地点和时间可被视为一种时空点模式,它是时空对数高斯 Cox 过程的放大。通过采用积分嵌套拉普拉斯近似(INLA)和随机偏微分方程(SPDE)方法,利用快速贝叶斯推理估算了这一过程的强度。最后,可以在特定时间绘制速度的方向和大小图,以更好地研究疾病在整个区域的传播情况。我们通过分析 2020-2021 年大流行期间 COVID-19 在哥伦比亚卡利的传播情况来演示我们的方法。
{"title":"Estimating velocities of infectious disease spread through spatio-temporal log-Gaussian Cox point processes","authors":"Fernando Rodriguez Avellaneda, Jorge Mateu, Paula Moraga","doi":"arxiv-2409.05036","DOIUrl":"https://doi.org/arxiv-2409.05036","url":null,"abstract":"Understanding the spread of infectious diseases such as COVID-19 is crucial\u0000for informed decision-making and resource allocation. A critical component of\u0000disease behavior is the velocity with which disease spreads, defined as the\u0000rate of change between time and space. In this paper, we propose a\u0000spatio-temporal modeling approach to determine the velocities of infectious\u0000disease spread. Our approach assumes that the locations and times of people\u0000infected can be considered as a spatio-temporal point pattern that arises as a\u0000realization of a spatio-temporal log-Gaussian Cox process. The intensity of\u0000this process is estimated using fast Bayesian inference by employing the\u0000integrated nested Laplace approximation (INLA) and the Stochastic Partial\u0000Differential Equations (SPDE) approaches. The velocity is then calculated using\u0000finite differences that approximate the derivatives of the intensity function.\u0000Finally, the directions and magnitudes of the velocities can be mapped at\u0000specific times to examine better the spread of the disease throughout the\u0000region. We demonstrate our method by analyzing COVID-19 spread in Cali,\u0000Colombia, during the 2020-2021 pandemic.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"113 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Inference for Large Scale Regression Models with Dependent Errors 具有依赖误差的大规模回归模型推理
Pub Date : 2024-09-08 DOI: arxiv-2409.05160
Lionel Voirol, Haotian Xu, Yuming Zhang, Luca Insolia, Roberto Molinari, Stéphane Guerrier
The exponential growth in data sizes and storage costs has broughtconsiderable challenges to the data science community, requiring solutions torun learning methods on such data. While machine learning has scaled to achievepredictive accuracy in big data settings, statistical inference and uncertaintyquantification tools are still lagging. Priority scientific fields collect vastdata to understand phenomena typically studied with statistical methods likeregression. In this setting, regression parameter estimation can benefit fromefficient computational procedures, but the main challenge lies in computingerror process parameters with complex covariance structures. Identifying andestimating these structures is essential for inference and often used foruncertainty quantification in machine learning with Gaussian Processes.However, estimating these structures becomes burdensome as data scales,requiring approximations that compromise the reliability of outputs. Theseapproximations are even more unreliable when complexities like long-rangedependencies or missing data are present. This work defines and proves thestatistical properties of the Generalized Method of Wavelet Moments withExogenous variables (GMWMX), a highly scalable, stable, and statistically validmethod for estimating and delivering inference for linear models usingstochastic processes in the presence of data complexities like latentdependence structures and missing data. Applied examples from Earth Sciencesand extensive simulations highlight the advantages of the GMWMX.
数据规模和存储成本的指数级增长给数据科学界带来了相当大的挑战,需要在这些数据上运行学习方法的解决方案。虽然机器学习已经可以在大数据环境中实现预测准确性,但统计推理和不确定性量化工具仍然滞后。重点科学领域收集大量数据,以了解通常使用回归等统计方法研究的现象。在这种情况下,回归参数估计可以受益于高效的计算程序,但主要挑战在于计算具有复杂协方差结构的错误过程参数。识别和估计这些结构对推理至关重要,在使用高斯过程的机器学习中,经常用于不确定性量化。然而,随着数据规模的扩大,估计这些结构变得非常繁琐,需要进行近似,从而影响输出的可靠性。然而,随着数据规模的扩大,估算这些结构变得非常繁琐,需要使用近似值来影响输出的可靠性。如果存在长距离依赖或数据缺失等复杂情况,这些近似值就更加不可靠了。这项工作定义并证明了具有外生变量的广义小波矩量法(GMWMX)的统计特性,这是一种高度可扩展、稳定和统计有效的方法,用于在存在数据复杂性(如潜在依赖结构和缺失数据)的情况下,使用随机过程对线性模型进行估计和推理。来自地球科学的应用实例和大量模拟突出了 GMWMX 的优势。
{"title":"Inference for Large Scale Regression Models with Dependent Errors","authors":"Lionel Voirol, Haotian Xu, Yuming Zhang, Luca Insolia, Roberto Molinari, Stéphane Guerrier","doi":"arxiv-2409.05160","DOIUrl":"https://doi.org/arxiv-2409.05160","url":null,"abstract":"The exponential growth in data sizes and storage costs has brought\u0000considerable challenges to the data science community, requiring solutions to\u0000run learning methods on such data. While machine learning has scaled to achieve\u0000predictive accuracy in big data settings, statistical inference and uncertainty\u0000quantification tools are still lagging. Priority scientific fields collect vast\u0000data to understand phenomena typically studied with statistical methods like\u0000regression. In this setting, regression parameter estimation can benefit from\u0000efficient computational procedures, but the main challenge lies in computing\u0000error process parameters with complex covariance structures. Identifying and\u0000estimating these structures is essential for inference and often used for\u0000uncertainty quantification in machine learning with Gaussian Processes.\u0000However, estimating these structures becomes burdensome as data scales,\u0000requiring approximations that compromise the reliability of outputs. These\u0000approximations are even more unreliable when complexities like long-range\u0000dependencies or missing data are present. This work defines and proves the\u0000statistical properties of the Generalized Method of Wavelet Moments with\u0000Exogenous variables (GMWMX), a highly scalable, stable, and statistically valid\u0000method for estimating and delivering inference for linear models using\u0000stochastic processes in the presence of data complexities like latent\u0000dependence structures and missing data. Applied examples from Earth Sciences\u0000and extensive simulations highlight the advantages of the GMWMX.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
arXiv - STAT - Methodology
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1