Shapley values, a game theoretic concept, has been one of the most populartools for explaining Machine Learning (ML) models in recent years.Unfortunately, the two most common approaches, conditional and marginal, tocalculating Shapley values can lead to different results along with someundesirable side effects when features are correlated. This in turn has led tothe situation in the literature where contradictory recommendations regardingchoice of an approach are provided by different authors. In this paper we aimto resolve this controversy through the use of causal arguments. We show thatthe differences arise from the implicit assumptions that are made within eachmethod to deal with missing causal information. We also demonstrate that theconditional approach is fundamentally unsound from a causal perspective. This,together with previous work in [1], leads to the conclusion that the marginalapproach should be preferred over the conditional one.
{"title":"Causal Analysis of Shapley Values: Conditional vs. Marginal","authors":"Ilya Rozenfeld","doi":"arxiv-2409.06157","DOIUrl":"https://doi.org/arxiv-2409.06157","url":null,"abstract":"Shapley values, a game theoretic concept, has been one of the most popular\u0000tools for explaining Machine Learning (ML) models in recent years.\u0000Unfortunately, the two most common approaches, conditional and marginal, to\u0000calculating Shapley values can lead to different results along with some\u0000undesirable side effects when features are correlated. This in turn has led to\u0000the situation in the literature where contradictory recommendations regarding\u0000choice of an approach are provided by different authors. In this paper we aim\u0000to resolve this controversy through the use of causal arguments. We show that\u0000the differences arise from the implicit assumptions that are made within each\u0000method to deal with missing causal information. We also demonstrate that the\u0000conditional approach is fundamentally unsound from a causal perspective. This,\u0000together with previous work in [1], leads to the conclusion that the marginal\u0000approach should be preferred over the conditional one.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"192 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Discrimination measures such as concordance statistics (e.g. the c-index orthe concordance probability) and the cumulative-dynamic time-dependent areaunder the ROC-curve (AUC) are widely used in the medical literature forevaluating the predictive accuracy of a scoring rule which relates a set ofprognostic markers to the risk of experiencing a particular event. Often thescoring rule being evaluated in terms of discriminatory ability is the linearpredictor of a survival regression model such as the Cox proportional hazardsmodel. This has the undesirable feature that the scoring rule depends on thecensoring distribution when the model is misspecified. In this work we focus onlinear scoring rules where the coefficient vector is a nonparametric estimanddefined in the setting where there is no censoring. We propose so-calleddebiased estimators of the aforementioned discrimination measures for thisclass of scoring rules. The proposed estimators make efficient use of the dataand minimize bias by allowing for the use of data-adaptive methods for modelfitting. Moreover, the estimators do not rely on correct specification of thecensoring model to produce consistent estimation. We compare the estimators toexisting methods in a simulation study, and we illustrate the method by anapplication to a brain cancer study.
医学文献中广泛使用的判别指标包括一致性统计量(如 c 指数或一致性概率)和 ROC 曲线下的累积-动态-时间相关区域(AUC),用于评估评分规则的预测准确性。通常,根据判别能力评估的评分规则是生存回归模型(如 Cox 比例危险模型)的线性预测因子。这有一个不可取的特点,即当模型被错误地指定时,评分规则取决于补偿分布。在这项工作中,我们将重点放在系数向量为非参数估计的评分规则上,并在没有删减的情况下进行定义。我们针对这类评分规则提出了上述区分度的所谓偏差估计器。所提出的估计器允许使用数据自适应方法进行模态拟合,从而有效地利用了数据并最大限度地减少了偏差。此外,估计器不依赖于对评分模型的正确规范来产生一致的估计结果。我们在一项模拟研究中将这些估计方法与现有方法进行了比较,并将其应用于一项脑癌研究,以说明该方法。
{"title":"Efficient nonparametric estimators of discriminationmeasures with censored survival data","authors":"Marie S. Breum, Torben Martinussen","doi":"arxiv-2409.05632","DOIUrl":"https://doi.org/arxiv-2409.05632","url":null,"abstract":"Discrimination measures such as concordance statistics (e.g. the c-index or\u0000the concordance probability) and the cumulative-dynamic time-dependent area\u0000under the ROC-curve (AUC) are widely used in the medical literature for\u0000evaluating the predictive accuracy of a scoring rule which relates a set of\u0000prognostic markers to the risk of experiencing a particular event. Often the\u0000scoring rule being evaluated in terms of discriminatory ability is the linear\u0000predictor of a survival regression model such as the Cox proportional hazards\u0000model. This has the undesirable feature that the scoring rule depends on the\u0000censoring distribution when the model is misspecified. In this work we focus on\u0000linear scoring rules where the coefficient vector is a nonparametric estimand\u0000defined in the setting where there is no censoring. We propose so-called\u0000debiased estimators of the aforementioned discrimination measures for this\u0000class of scoring rules. The proposed estimators make efficient use of the data\u0000and minimize bias by allowing for the use of data-adaptive methods for model\u0000fitting. Moreover, the estimators do not rely on correct specification of the\u0000censoring model to produce consistent estimation. We compare the estimators to\u0000existing methods in a simulation study, and we illustrate the method by an\u0000application to a brain cancer study.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"42 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196639","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
To characterize the community structure in network data, researchers haveintroduced various block-type models, including the stochastic block model,degree-corrected stochastic block model, mixed membership block model,degree-corrected mixed membership block model, and others. A critical step inapplying these models effectively is determining the number of communities inthe network. However, to our knowledge, existing methods for estimating thenumber of network communities often require model estimations or are unable tosimultaneously account for network sparsity and a divergent number ofcommunities. In this paper, we propose an eigengap-ratio based test thataddress these challenges. The test is straightforward to compute, requires noparameter tuning, and can be applied to a wide range of block models withoutthe need to estimate network distribution parameters. Furthermore, it iseffective for both dense and sparse networks with a divergent number ofcommunities. We show that the proposed test statistic converges to a functionof the type-I Tracy-Widom distributions under the null hypothesis, and that thetest is asymptotically powerful under alternatives. Simulation studies on bothdense and sparse networks demonstrate the efficacy of the proposed method.Three real-world examples are presented to illustrate the usefulness of theproposed test.
为了描述网络数据中的群落结构,研究人员引入了各种块状模型,包括随机块状模型、度校正随机块状模型、混合成员块状模型、度校正混合成员块状模型等。有效应用这些模型的关键步骤是确定网络中的群落数量。然而,据我们所知,现有的估计网络社区数量的方法往往需要对模型进行估计,或者无法同时考虑网络稀疏性和社区数量的差异。在本文中,我们提出了一种基于 eigengap 比率的测试方法来解决这些难题。该检验计算简单,不需要调整参数,可应用于各种区块模型,无需估计网络分布参数。此外,它对具有不同群体数量的密集和稀疏网络都有效。我们证明,在零假设下,所提出的检验统计量收敛于 I 型 Tracy-Widom 分布的函数,并且在替代假设下,该检验在渐近上是强大的。在密集和稀疏网络上进行的仿真研究证明了所提方法的有效性,并列举了三个实际案例来说明所提检验的实用性。
{"title":"An Eigengap Ratio Test for Determining the Number of Communities in Network Data","authors":"Yujia Wu, Jingfei Zhang, Wei Lan, Chih-Ling Tsai","doi":"arxiv-2409.05276","DOIUrl":"https://doi.org/arxiv-2409.05276","url":null,"abstract":"To characterize the community structure in network data, researchers have\u0000introduced various block-type models, including the stochastic block model,\u0000degree-corrected stochastic block model, mixed membership block model,\u0000degree-corrected mixed membership block model, and others. A critical step in\u0000applying these models effectively is determining the number of communities in\u0000the network. However, to our knowledge, existing methods for estimating the\u0000number of network communities often require model estimations or are unable to\u0000simultaneously account for network sparsity and a divergent number of\u0000communities. In this paper, we propose an eigengap-ratio based test that\u0000address these challenges. The test is straightforward to compute, requires no\u0000parameter tuning, and can be applied to a wide range of block models without\u0000the need to estimate network distribution parameters. Furthermore, it is\u0000effective for both dense and sparse networks with a divergent number of\u0000communities. We show that the proposed test statistic converges to a function\u0000of the type-I Tracy-Widom distributions under the null hypothesis, and that the\u0000test is asymptotically powerful under alternatives. Simulation studies on both\u0000dense and sparse networks demonstrate the efficacy of the proposed method.\u0000Three real-world examples are presented to illustrate the usefulness of the\u0000proposed test.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"25 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yongdong Ouyang, Janice J Eng, Denghuang Zhan, Hubert Wong
The uptake of formalized prior elicitation from experts in Bayesian clinicaltrials has been limited, largely due to the challenges associated with complexstatistical modeling, the lack of practical tools, and the cognitive burden onexperts required to quantify their uncertainty using probabilistic language.Additionally, existing methods do not address prior-posterior coherence, i.e.,does the posterior distribution, obtained mathematically from combining theestimated prior with the trial data, reflect the expert's actual posteriorbeliefs? We propose a new elicitation approach that seeks to ensureprior-posterior coherence and reduce the expert's cognitive burden. This isachieved by eliciting responses about the expert's envisioned posteriorjudgments under various potential data outcomes and inferring the priordistribution by minimizing the discrepancies between these responses and theexpected responses obtained from the posterior distribution. The feasibilityand potential value of the new approach are illustrated through an applicationto a real trial currently underway.
{"title":"Priors from Envisioned Posterior Judgments: A Novel Elicitation Approach With Application to Bayesian Clinical Trials","authors":"Yongdong Ouyang, Janice J Eng, Denghuang Zhan, Hubert Wong","doi":"arxiv-2409.05271","DOIUrl":"https://doi.org/arxiv-2409.05271","url":null,"abstract":"The uptake of formalized prior elicitation from experts in Bayesian clinical\u0000trials has been limited, largely due to the challenges associated with complex\u0000statistical modeling, the lack of practical tools, and the cognitive burden on\u0000experts required to quantify their uncertainty using probabilistic language.\u0000Additionally, existing methods do not address prior-posterior coherence, i.e.,\u0000does the posterior distribution, obtained mathematically from combining the\u0000estimated prior with the trial data, reflect the expert's actual posterior\u0000beliefs? We propose a new elicitation approach that seeks to ensure\u0000prior-posterior coherence and reduce the expert's cognitive burden. This is\u0000achieved by eliciting responses about the expert's envisioned posterior\u0000judgments under various potential data outcomes and inferring the prior\u0000distribution by minimizing the discrepancies between these responses and the\u0000expected responses obtained from the posterior distribution. The feasibility\u0000and potential value of the new approach are illustrated through an application\u0000to a real trial currently underway.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"170 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sahel Iqbal, Hany Abdulsamad, Sara Pérez-Vieites, Simo Särkkä, Adrien Corenflos
This paper introduces the Inside-Out Nested Particle Filter (IO-NPF), anovel, fully recursive, algorithm for amortized sequential Bayesianexperimental design in the non-exchangeable setting. We frame policyoptimization as maximum likelihood estimation in a non-Markovian state-spacemodel, achieving (at most) $mathcal{O}(T^2)$ computational complexity in thenumber of experiments. We provide theoretical convergence guarantees andintroduce a backward sampling algorithm to reduce trajectory degeneracy. IO-NPFoffers a practical, extensible, and provably consistent approach to sequentialBayesian experimental design, demonstrating improved efficiency over existingmethods.
{"title":"Recursive Nested Filtering for Efficient Amortized Bayesian Experimental Design","authors":"Sahel Iqbal, Hany Abdulsamad, Sara Pérez-Vieites, Simo Särkkä, Adrien Corenflos","doi":"arxiv-2409.05354","DOIUrl":"https://doi.org/arxiv-2409.05354","url":null,"abstract":"This paper introduces the Inside-Out Nested Particle Filter (IO-NPF), a\u0000novel, fully recursive, algorithm for amortized sequential Bayesian\u0000experimental design in the non-exchangeable setting. We frame policy\u0000optimization as maximum likelihood estimation in a non-Markovian state-space\u0000model, achieving (at most) $mathcal{O}(T^2)$ computational complexity in the\u0000number of experiments. We provide theoretical convergence guarantees and\u0000introduce a backward sampling algorithm to reduce trajectory degeneracy. IO-NPF\u0000offers a practical, extensible, and provably consistent approach to sequential\u0000Bayesian experimental design, demonstrating improved efficiency over existing\u0000methods.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"25 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196668","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In functional MRI (fMRI), effective connectivity analysis aims at inferringthe causal influences that brain regions exert on one another. A common methodfor this type of analysis is structural equation modeling (SEM). We herepropose a novel method to test the validity of a given model of structuralequation. Given a structural model in the form of a directed graph, the methodextracts the set of all constraints of conditional independence induced by theabsence of links between pairs of regions in the model and tests for theirvalidity in a Bayesian framework, either individually (constraint byconstraint), jointly (e.g., by gathering all constraints associated with agiven missing link), or globally (i.e., all constraints associated with thestructural model). This approach has two main advantages. First, it only testswhat is testable from observational data and does allow for false causalinterpretation. Second, it makes it possible to test each constraint (or groupof constraints) separately and, therefore, quantify in what measure eachconstraint (or, e..g., missing link) is respected in the data. We validate ourapproach using a simulation study and illustrate its potential benefits throughthe reanalysis of published data.
{"title":"Multilevel testing of constraints induced by structural equation modeling in fMRI effective connectivity analysis: A proof of concept","authors":"G. Marrelec, A. Giron","doi":"arxiv-2409.05630","DOIUrl":"https://doi.org/arxiv-2409.05630","url":null,"abstract":"In functional MRI (fMRI), effective connectivity analysis aims at inferring\u0000the causal influences that brain regions exert on one another. A common method\u0000for this type of analysis is structural equation modeling (SEM). We here\u0000propose a novel method to test the validity of a given model of structural\u0000equation. Given a structural model in the form of a directed graph, the method\u0000extracts the set of all constraints of conditional independence induced by the\u0000absence of links between pairs of regions in the model and tests for their\u0000validity in a Bayesian framework, either individually (constraint by\u0000constraint), jointly (e.g., by gathering all constraints associated with a\u0000given missing link), or globally (i.e., all constraints associated with the\u0000structural model). This approach has two main advantages. First, it only tests\u0000what is testable from observational data and does allow for false causal\u0000interpretation. Second, it makes it possible to test each constraint (or group\u0000of constraints) separately and, therefore, quantify in what measure each\u0000constraint (or, e..g., missing link) is respected in the data. We validate our\u0000approach using a simulation study and illustrate its potential benefits through\u0000the reanalysis of published data.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"67 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196665","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We consider time-series forecasting problems where data is scarce, difficultto gather, or induces a prohibitive computational cost. As a first attempt, wefocus on short-term electricity consumption in France, which is of strategicimportance for energy suppliers and public stakeholders. The complexity of thisproblem and the many levels of geospatial granularity motivate the use of anensemble of Gaussian Processes (GPs). Whilst GPs are remarkable predictors,they are computationally expensive to train, which calls for a frugal few-shotlearning approach. By taking into account performance on GPs trained on adataset and designing a random walk on these, we mitigate the training cost ofour entire Bayesian decision-making procedure. We introduce our algorithmcalled textsc{Domino} (ranDOM walk on gaussIaN prOcesses) and presentnumerical experiments to support its merits.
我们考虑的是数据稀缺、难以收集或计算成本过高的时间序列预测问题。作为首次尝试,我们将重点放在法国的短期用电量上,这对能源供应商和公共利益相关者来说具有重要的战略意义。这一问题的复杂性和多级地理空间粒度促使我们使用高斯过程(GPs)组合。虽然 GPs 是出色的预测工具,但其训练的计算成本很高,因此需要一种节俭的少量学习方法。通过考虑在数据集上训练的 GPs 的性能,并在这些 GPs 上设计随机行走,我们减轻了整个贝叶斯决策过程的训练成本。我们介绍了我们的算法,称为textsc{Domino}(ranDOM walk on gaussIaN prOcesses),并通过数值实验来证明它的优点。
{"title":"Predicting Electricity Consumption with Random Walks on Gaussian Processes","authors":"Chloé Hashimoto-Cullen, Benjamin Guedj","doi":"arxiv-2409.05934","DOIUrl":"https://doi.org/arxiv-2409.05934","url":null,"abstract":"We consider time-series forecasting problems where data is scarce, difficult\u0000to gather, or induces a prohibitive computational cost. As a first attempt, we\u0000focus on short-term electricity consumption in France, which is of strategic\u0000importance for energy suppliers and public stakeholders. The complexity of this\u0000problem and the many levels of geospatial granularity motivate the use of an\u0000ensemble of Gaussian Processes (GPs). Whilst GPs are remarkable predictors,\u0000they are computationally expensive to train, which calls for a frugal few-shot\u0000learning approach. By taking into account performance on GPs trained on a\u0000dataset and designing a random walk on these, we mitigate the training cost of\u0000our entire Bayesian decision-making procedure. We introduce our algorithm\u0000called textsc{Domino} (ranDOM walk on gaussIaN prOcesses) and present\u0000numerical experiments to support its merits.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Many estimators of the variance of the well-known unbiased and uniform mostpowerful estimator $htheta$ of the Mann-Whitney effect, $theta = P(X < Y) +nfrac12 P(X=Y)$, are considered in the literature. Some of these estimatorsare only valid in case of no ties or are biased in case of small sample sizeswhere the amount of the bias is not discussed. Here we derive an unbiasedestimator that is based on different rankings, the so-called 'placements'(Orban and Wolfe, 1980), and is therefore easy to compute. This estimator doesnot require the assumption of continuous dfs and is also valid in the case ofties. Moreover, it is shown that this estimator is non-negative and has a sharpupper bound which may be considered an empirical version of the well-knownBirnbaum-Klose inequality. The derivation of this estimator provides an optionto compute the biases of some commonly used estimators in the literature.Simulations demonstrate that, for small sample sizes, the biases of theseestimators depend on the underlying dfs and thus are not under control. Thismeans that in the case of a biased estimator, simulation results for the type-Ierror of a test or the coverage probability of a ci do not only depend on thequality of the approximation of $htheta$ by a normal db but also anadditional unknown bias caused by the variance estimator. Finally, it is shownthat this estimator is $L_2$-consistent.
{"title":"An unbiased rank-based estimator of the Mann-Whitney variance including the case of ties","authors":"Edgar Brunner, Frank Konietschke","doi":"arxiv-2409.05038","DOIUrl":"https://doi.org/arxiv-2409.05038","url":null,"abstract":"Many estimators of the variance of the well-known unbiased and uniform most\u0000powerful estimator $htheta$ of the Mann-Whitney effect, $theta = P(X < Y) +\u0000nfrac12 P(X=Y)$, are considered in the literature. Some of these estimators\u0000are only valid in case of no ties or are biased in case of small sample sizes\u0000where the amount of the bias is not discussed. Here we derive an unbiased\u0000estimator that is based on different rankings, the so-called 'placements'\u0000(Orban and Wolfe, 1980), and is therefore easy to compute. This estimator does\u0000not require the assumption of continuous dfs and is also valid in the case of\u0000ties. Moreover, it is shown that this estimator is non-negative and has a sharp\u0000upper bound which may be considered an empirical version of the well-known\u0000Birnbaum-Klose inequality. The derivation of this estimator provides an option\u0000to compute the biases of some commonly used estimators in the literature.\u0000Simulations demonstrate that, for small sample sizes, the biases of these\u0000estimators depend on the underlying dfs and thus are not under control. This\u0000means that in the case of a biased estimator, simulation results for the type-I\u0000error of a test or the coverage probability of a ci do not only depend on the\u0000quality of the approximation of $htheta$ by a normal db but also an\u0000additional unknown bias caused by the variance estimator. Finally, it is shown\u0000that this estimator is $L_2$-consistent.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fernando Rodriguez Avellaneda, Jorge Mateu, Paula Moraga
Understanding the spread of infectious diseases such as COVID-19 is crucialfor informed decision-making and resource allocation. A critical component ofdisease behavior is the velocity with which disease spreads, defined as therate of change between time and space. In this paper, we propose aspatio-temporal modeling approach to determine the velocities of infectiousdisease spread. Our approach assumes that the locations and times of peopleinfected can be considered as a spatio-temporal point pattern that arises as arealization of a spatio-temporal log-Gaussian Cox process. The intensity ofthis process is estimated using fast Bayesian inference by employing theintegrated nested Laplace approximation (INLA) and the Stochastic PartialDifferential Equations (SPDE) approaches. The velocity is then calculated usingfinite differences that approximate the derivatives of the intensity function.Finally, the directions and magnitudes of the velocities can be mapped atspecific times to examine better the spread of the disease throughout theregion. We demonstrate our method by analyzing COVID-19 spread in Cali,Colombia, during the 2020-2021 pandemic.
{"title":"Estimating velocities of infectious disease spread through spatio-temporal log-Gaussian Cox point processes","authors":"Fernando Rodriguez Avellaneda, Jorge Mateu, Paula Moraga","doi":"arxiv-2409.05036","DOIUrl":"https://doi.org/arxiv-2409.05036","url":null,"abstract":"Understanding the spread of infectious diseases such as COVID-19 is crucial\u0000for informed decision-making and resource allocation. A critical component of\u0000disease behavior is the velocity with which disease spreads, defined as the\u0000rate of change between time and space. In this paper, we propose a\u0000spatio-temporal modeling approach to determine the velocities of infectious\u0000disease spread. Our approach assumes that the locations and times of people\u0000infected can be considered as a spatio-temporal point pattern that arises as a\u0000realization of a spatio-temporal log-Gaussian Cox process. The intensity of\u0000this process is estimated using fast Bayesian inference by employing the\u0000integrated nested Laplace approximation (INLA) and the Stochastic Partial\u0000Differential Equations (SPDE) approaches. The velocity is then calculated using\u0000finite differences that approximate the derivatives of the intensity function.\u0000Finally, the directions and magnitudes of the velocities can be mapped at\u0000specific times to examine better the spread of the disease throughout the\u0000region. We demonstrate our method by analyzing COVID-19 spread in Cali,\u0000Colombia, during the 2020-2021 pandemic.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"113 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The exponential growth in data sizes and storage costs has broughtconsiderable challenges to the data science community, requiring solutions torun learning methods on such data. While machine learning has scaled to achievepredictive accuracy in big data settings, statistical inference and uncertaintyquantification tools are still lagging. Priority scientific fields collect vastdata to understand phenomena typically studied with statistical methods likeregression. In this setting, regression parameter estimation can benefit fromefficient computational procedures, but the main challenge lies in computingerror process parameters with complex covariance structures. Identifying andestimating these structures is essential for inference and often used foruncertainty quantification in machine learning with Gaussian Processes.However, estimating these structures becomes burdensome as data scales,requiring approximations that compromise the reliability of outputs. Theseapproximations are even more unreliable when complexities like long-rangedependencies or missing data are present. This work defines and proves thestatistical properties of the Generalized Method of Wavelet Moments withExogenous variables (GMWMX), a highly scalable, stable, and statistically validmethod for estimating and delivering inference for linear models usingstochastic processes in the presence of data complexities like latentdependence structures and missing data. Applied examples from Earth Sciencesand extensive simulations highlight the advantages of the GMWMX.
{"title":"Inference for Large Scale Regression Models with Dependent Errors","authors":"Lionel Voirol, Haotian Xu, Yuming Zhang, Luca Insolia, Roberto Molinari, Stéphane Guerrier","doi":"arxiv-2409.05160","DOIUrl":"https://doi.org/arxiv-2409.05160","url":null,"abstract":"The exponential growth in data sizes and storage costs has brought\u0000considerable challenges to the data science community, requiring solutions to\u0000run learning methods on such data. While machine learning has scaled to achieve\u0000predictive accuracy in big data settings, statistical inference and uncertainty\u0000quantification tools are still lagging. Priority scientific fields collect vast\u0000data to understand phenomena typically studied with statistical methods like\u0000regression. In this setting, regression parameter estimation can benefit from\u0000efficient computational procedures, but the main challenge lies in computing\u0000error process parameters with complex covariance structures. Identifying and\u0000estimating these structures is essential for inference and often used for\u0000uncertainty quantification in machine learning with Gaussian Processes.\u0000However, estimating these structures becomes burdensome as data scales,\u0000requiring approximations that compromise the reliability of outputs. These\u0000approximations are even more unreliable when complexities like long-range\u0000dependencies or missing data are present. This work defines and proves the\u0000statistical properties of the Generalized Method of Wavelet Moments with\u0000Exogenous variables (GMWMX), a highly scalable, stable, and statistically valid\u0000method for estimating and delivering inference for linear models using\u0000stochastic processes in the presence of data complexities like latent\u0000dependence structures and missing data. Applied examples from Earth Sciences\u0000and extensive simulations highlight the advantages of the GMWMX.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}