Shapley values, a game theoretic concept, has been one of the most popular tools for explaining Machine Learning (ML) models in recent years. Unfortunately, the two most common approaches, conditional and marginal, to calculating Shapley values can lead to different results along with some undesirable side effects when features are correlated. This in turn has led to the situation in the literature where contradictory recommendations regarding choice of an approach are provided by different authors. In this paper we aim to resolve this controversy through the use of causal arguments. We show that the differences arise from the implicit assumptions that are made within each method to deal with missing causal information. We also demonstrate that the conditional approach is fundamentally unsound from a causal perspective. This, together with previous work in [1], leads to the conclusion that the marginal approach should be preferred over the conditional one.
{"title":"Causal Analysis of Shapley Values: Conditional vs. Marginal","authors":"Ilya Rozenfeld","doi":"arxiv-2409.06157","DOIUrl":"https://doi.org/arxiv-2409.06157","url":null,"abstract":"Shapley values, a game theoretic concept, has been one of the most popular\u0000tools for explaining Machine Learning (ML) models in recent years.\u0000Unfortunately, the two most common approaches, conditional and marginal, to\u0000calculating Shapley values can lead to different results along with some\u0000undesirable side effects when features are correlated. This in turn has led to\u0000the situation in the literature where contradictory recommendations regarding\u0000choice of an approach are provided by different authors. In this paper we aim\u0000to resolve this controversy through the use of causal arguments. We show that\u0000the differences arise from the implicit assumptions that are made within each\u0000method to deal with missing causal information. We also demonstrate that the\u0000conditional approach is fundamentally unsound from a causal perspective. This,\u0000together with previous work in [1], leads to the conclusion that the marginal\u0000approach should be preferred over the conditional one.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"192 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Discrimination measures such as concordance statistics (e.g. the c-index or the concordance probability) and the cumulative-dynamic time-dependent area under the ROC-curve (AUC) are widely used in the medical literature for evaluating the predictive accuracy of a scoring rule which relates a set of prognostic markers to the risk of experiencing a particular event. Often the scoring rule being evaluated in terms of discriminatory ability is the linear predictor of a survival regression model such as the Cox proportional hazards model. This has the undesirable feature that the scoring rule depends on the censoring distribution when the model is misspecified. In this work we focus on linear scoring rules where the coefficient vector is a nonparametric estimand defined in the setting where there is no censoring. We propose so-called debiased estimators of the aforementioned discrimination measures for this class of scoring rules. The proposed estimators make efficient use of the data and minimize bias by allowing for the use of data-adaptive methods for model fitting. Moreover, the estimators do not rely on correct specification of the censoring model to produce consistent estimation. We compare the estimators to existing methods in a simulation study, and we illustrate the method by an application to a brain cancer study.
医学文献中广泛使用的判别指标包括一致性统计量(如 c 指数或一致性概率)和 ROC 曲线下的累积-动态-时间相关区域(AUC),用于评估评分规则的预测准确性。通常,根据判别能力评估的评分规则是生存回归模型(如 Cox 比例危险模型)的线性预测因子。这有一个不可取的特点,即当模型被错误地指定时,评分规则取决于补偿分布。在这项工作中,我们将重点放在系数向量为非参数估计的评分规则上,并在没有删减的情况下进行定义。我们针对这类评分规则提出了上述区分度的所谓偏差估计器。所提出的估计器允许使用数据自适应方法进行模态拟合,从而有效地利用了数据并最大限度地减少了偏差。此外,估计器不依赖于对评分模型的正确规范来产生一致的估计结果。我们在一项模拟研究中将这些估计方法与现有方法进行了比较,并将其应用于一项脑癌研究,以说明该方法。
{"title":"Efficient nonparametric estimators of discriminationmeasures with censored survival data","authors":"Marie S. Breum, Torben Martinussen","doi":"arxiv-2409.05632","DOIUrl":"https://doi.org/arxiv-2409.05632","url":null,"abstract":"Discrimination measures such as concordance statistics (e.g. the c-index or\u0000the concordance probability) and the cumulative-dynamic time-dependent area\u0000under the ROC-curve (AUC) are widely used in the medical literature for\u0000evaluating the predictive accuracy of a scoring rule which relates a set of\u0000prognostic markers to the risk of experiencing a particular event. Often the\u0000scoring rule being evaluated in terms of discriminatory ability is the linear\u0000predictor of a survival regression model such as the Cox proportional hazards\u0000model. This has the undesirable feature that the scoring rule depends on the\u0000censoring distribution when the model is misspecified. In this work we focus on\u0000linear scoring rules where the coefficient vector is a nonparametric estimand\u0000defined in the setting where there is no censoring. We propose so-called\u0000debiased estimators of the aforementioned discrimination measures for this\u0000class of scoring rules. The proposed estimators make efficient use of the data\u0000and minimize bias by allowing for the use of data-adaptive methods for model\u0000fitting. Moreover, the estimators do not rely on correct specification of the\u0000censoring model to produce consistent estimation. We compare the estimators to\u0000existing methods in a simulation study, and we illustrate the method by an\u0000application to a brain cancer study.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"42 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196639","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
To characterize the community structure in network data, researchers have introduced various block-type models, including the stochastic block model, degree-corrected stochastic block model, mixed membership block model, degree-corrected mixed membership block model, and others. A critical step in applying these models effectively is determining the number of communities in the network. However, to our knowledge, existing methods for estimating the number of network communities often require model estimations or are unable to simultaneously account for network sparsity and a divergent number of communities. In this paper, we propose an eigengap-ratio based test that address these challenges. The test is straightforward to compute, requires no parameter tuning, and can be applied to a wide range of block models without the need to estimate network distribution parameters. Furthermore, it is effective for both dense and sparse networks with a divergent number of communities. We show that the proposed test statistic converges to a function of the type-I Tracy-Widom distributions under the null hypothesis, and that the test is asymptotically powerful under alternatives. Simulation studies on both dense and sparse networks demonstrate the efficacy of the proposed method. Three real-world examples are presented to illustrate the usefulness of the proposed test.
为了描述网络数据中的群落结构,研究人员引入了各种块状模型,包括随机块状模型、度校正随机块状模型、混合成员块状模型、度校正混合成员块状模型等。有效应用这些模型的关键步骤是确定网络中的群落数量。然而,据我们所知,现有的估计网络社区数量的方法往往需要对模型进行估计,或者无法同时考虑网络稀疏性和社区数量的差异。在本文中,我们提出了一种基于 eigengap 比率的测试方法来解决这些难题。该检验计算简单,不需要调整参数,可应用于各种区块模型,无需估计网络分布参数。此外,它对具有不同群体数量的密集和稀疏网络都有效。我们证明,在零假设下,所提出的检验统计量收敛于 I 型 Tracy-Widom 分布的函数,并且在替代假设下,该检验在渐近上是强大的。在密集和稀疏网络上进行的仿真研究证明了所提方法的有效性,并列举了三个实际案例来说明所提检验的实用性。
{"title":"An Eigengap Ratio Test for Determining the Number of Communities in Network Data","authors":"Yujia Wu, Jingfei Zhang, Wei Lan, Chih-Ling Tsai","doi":"arxiv-2409.05276","DOIUrl":"https://doi.org/arxiv-2409.05276","url":null,"abstract":"To characterize the community structure in network data, researchers have\u0000introduced various block-type models, including the stochastic block model,\u0000degree-corrected stochastic block model, mixed membership block model,\u0000degree-corrected mixed membership block model, and others. A critical step in\u0000applying these models effectively is determining the number of communities in\u0000the network. However, to our knowledge, existing methods for estimating the\u0000number of network communities often require model estimations or are unable to\u0000simultaneously account for network sparsity and a divergent number of\u0000communities. In this paper, we propose an eigengap-ratio based test that\u0000address these challenges. The test is straightforward to compute, requires no\u0000parameter tuning, and can be applied to a wide range of block models without\u0000the need to estimate network distribution parameters. Furthermore, it is\u0000effective for both dense and sparse networks with a divergent number of\u0000communities. We show that the proposed test statistic converges to a function\u0000of the type-I Tracy-Widom distributions under the null hypothesis, and that the\u0000test is asymptotically powerful under alternatives. Simulation studies on both\u0000dense and sparse networks demonstrate the efficacy of the proposed method.\u0000Three real-world examples are presented to illustrate the usefulness of the\u0000proposed test.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"25 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yongdong Ouyang, Janice J Eng, Denghuang Zhan, Hubert Wong
The uptake of formalized prior elicitation from experts in Bayesian clinical trials has been limited, largely due to the challenges associated with complex statistical modeling, the lack of practical tools, and the cognitive burden on experts required to quantify their uncertainty using probabilistic language. Additionally, existing methods do not address prior-posterior coherence, i.e., does the posterior distribution, obtained mathematically from combining the estimated prior with the trial data, reflect the expert's actual posterior beliefs? We propose a new elicitation approach that seeks to ensure prior-posterior coherence and reduce the expert's cognitive burden. This is achieved by eliciting responses about the expert's envisioned posterior judgments under various potential data outcomes and inferring the prior distribution by minimizing the discrepancies between these responses and the expected responses obtained from the posterior distribution. The feasibility and potential value of the new approach are illustrated through an application to a real trial currently underway.
{"title":"Priors from Envisioned Posterior Judgments: A Novel Elicitation Approach With Application to Bayesian Clinical Trials","authors":"Yongdong Ouyang, Janice J Eng, Denghuang Zhan, Hubert Wong","doi":"arxiv-2409.05271","DOIUrl":"https://doi.org/arxiv-2409.05271","url":null,"abstract":"The uptake of formalized prior elicitation from experts in Bayesian clinical\u0000trials has been limited, largely due to the challenges associated with complex\u0000statistical modeling, the lack of practical tools, and the cognitive burden on\u0000experts required to quantify their uncertainty using probabilistic language.\u0000Additionally, existing methods do not address prior-posterior coherence, i.e.,\u0000does the posterior distribution, obtained mathematically from combining the\u0000estimated prior with the trial data, reflect the expert's actual posterior\u0000beliefs? We propose a new elicitation approach that seeks to ensure\u0000prior-posterior coherence and reduce the expert's cognitive burden. This is\u0000achieved by eliciting responses about the expert's envisioned posterior\u0000judgments under various potential data outcomes and inferring the prior\u0000distribution by minimizing the discrepancies between these responses and the\u0000expected responses obtained from the posterior distribution. The feasibility\u0000and potential value of the new approach are illustrated through an application\u0000to a real trial currently underway.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"170 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sahel Iqbal, Hany Abdulsamad, Sara Pérez-Vieites, Simo Särkkä, Adrien Corenflos
This paper introduces the Inside-Out Nested Particle Filter (IO-NPF), a novel, fully recursive, algorithm for amortized sequential Bayesian experimental design in the non-exchangeable setting. We frame policy optimization as maximum likelihood estimation in a non-Markovian state-space model, achieving (at most) $mathcal{O}(T^2)$ computational complexity in the number of experiments. We provide theoretical convergence guarantees and introduce a backward sampling algorithm to reduce trajectory degeneracy. IO-NPF offers a practical, extensible, and provably consistent approach to sequential Bayesian experimental design, demonstrating improved efficiency over existing methods.
{"title":"Recursive Nested Filtering for Efficient Amortized Bayesian Experimental Design","authors":"Sahel Iqbal, Hany Abdulsamad, Sara Pérez-Vieites, Simo Särkkä, Adrien Corenflos","doi":"arxiv-2409.05354","DOIUrl":"https://doi.org/arxiv-2409.05354","url":null,"abstract":"This paper introduces the Inside-Out Nested Particle Filter (IO-NPF), a\u0000novel, fully recursive, algorithm for amortized sequential Bayesian\u0000experimental design in the non-exchangeable setting. We frame policy\u0000optimization as maximum likelihood estimation in a non-Markovian state-space\u0000model, achieving (at most) $mathcal{O}(T^2)$ computational complexity in the\u0000number of experiments. We provide theoretical convergence guarantees and\u0000introduce a backward sampling algorithm to reduce trajectory degeneracy. IO-NPF\u0000offers a practical, extensible, and provably consistent approach to sequential\u0000Bayesian experimental design, demonstrating improved efficiency over existing\u0000methods.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"25 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196668","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In functional MRI (fMRI), effective connectivity analysis aims at inferring the causal influences that brain regions exert on one another. A common method for this type of analysis is structural equation modeling (SEM). We here propose a novel method to test the validity of a given model of structural equation. Given a structural model in the form of a directed graph, the method extracts the set of all constraints of conditional independence induced by the absence of links between pairs of regions in the model and tests for their validity in a Bayesian framework, either individually (constraint by constraint), jointly (e.g., by gathering all constraints associated with a given missing link), or globally (i.e., all constraints associated with the structural model). This approach has two main advantages. First, it only tests what is testable from observational data and does allow for false causal interpretation. Second, it makes it possible to test each constraint (or group of constraints) separately and, therefore, quantify in what measure each constraint (or, e..g., missing link) is respected in the data. We validate our approach using a simulation study and illustrate its potential benefits through the reanalysis of published data.
{"title":"Multilevel testing of constraints induced by structural equation modeling in fMRI effective connectivity analysis: A proof of concept","authors":"G. Marrelec, A. Giron","doi":"arxiv-2409.05630","DOIUrl":"https://doi.org/arxiv-2409.05630","url":null,"abstract":"In functional MRI (fMRI), effective connectivity analysis aims at inferring\u0000the causal influences that brain regions exert on one another. A common method\u0000for this type of analysis is structural equation modeling (SEM). We here\u0000propose a novel method to test the validity of a given model of structural\u0000equation. Given a structural model in the form of a directed graph, the method\u0000extracts the set of all constraints of conditional independence induced by the\u0000absence of links between pairs of regions in the model and tests for their\u0000validity in a Bayesian framework, either individually (constraint by\u0000constraint), jointly (e.g., by gathering all constraints associated with a\u0000given missing link), or globally (i.e., all constraints associated with the\u0000structural model). This approach has two main advantages. First, it only tests\u0000what is testable from observational data and does allow for false causal\u0000interpretation. Second, it makes it possible to test each constraint (or group\u0000of constraints) separately and, therefore, quantify in what measure each\u0000constraint (or, e..g., missing link) is respected in the data. We validate our\u0000approach using a simulation study and illustrate its potential benefits through\u0000the reanalysis of published data.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"67 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196665","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We consider time-series forecasting problems where data is scarce, difficult to gather, or induces a prohibitive computational cost. As a first attempt, we focus on short-term electricity consumption in France, which is of strategic importance for energy suppliers and public stakeholders. The complexity of this problem and the many levels of geospatial granularity motivate the use of an ensemble of Gaussian Processes (GPs). Whilst GPs are remarkable predictors, they are computationally expensive to train, which calls for a frugal few-shot learning approach. By taking into account performance on GPs trained on a dataset and designing a random walk on these, we mitigate the training cost of our entire Bayesian decision-making procedure. We introduce our algorithm called textsc{Domino} (ranDOM walk on gaussIaN prOcesses) and present numerical experiments to support its merits.
我们考虑的是数据稀缺、难以收集或计算成本过高的时间序列预测问题。作为首次尝试,我们将重点放在法国的短期用电量上,这对能源供应商和公共利益相关者来说具有重要的战略意义。这一问题的复杂性和多级地理空间粒度促使我们使用高斯过程(GPs)组合。虽然 GPs 是出色的预测工具,但其训练的计算成本很高,因此需要一种节俭的少量学习方法。通过考虑在数据集上训练的 GPs 的性能,并在这些 GPs 上设计随机行走,我们减轻了整个贝叶斯决策过程的训练成本。我们介绍了我们的算法,称为textsc{Domino}(ranDOM walk on gaussIaN prOcesses),并通过数值实验来证明它的优点。
{"title":"Predicting Electricity Consumption with Random Walks on Gaussian Processes","authors":"Chloé Hashimoto-Cullen, Benjamin Guedj","doi":"arxiv-2409.05934","DOIUrl":"https://doi.org/arxiv-2409.05934","url":null,"abstract":"We consider time-series forecasting problems where data is scarce, difficult\u0000to gather, or induces a prohibitive computational cost. As a first attempt, we\u0000focus on short-term electricity consumption in France, which is of strategic\u0000importance for energy suppliers and public stakeholders. The complexity of this\u0000problem and the many levels of geospatial granularity motivate the use of an\u0000ensemble of Gaussian Processes (GPs). Whilst GPs are remarkable predictors,\u0000they are computationally expensive to train, which calls for a frugal few-shot\u0000learning approach. By taking into account performance on GPs trained on a\u0000dataset and designing a random walk on these, we mitigate the training cost of\u0000our entire Bayesian decision-making procedure. We introduce our algorithm\u0000called textsc{Domino} (ranDOM walk on gaussIaN prOcesses) and present\u0000numerical experiments to support its merits.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Many estimators of the variance of the well-known unbiased and uniform most powerful estimator $htheta$ of the Mann-Whitney effect, $theta = P(X < Y) + nfrac12 P(X=Y)$, are considered in the literature. Some of these estimators are only valid in case of no ties or are biased in case of small sample sizes where the amount of the bias is not discussed. Here we derive an unbiased estimator that is based on different rankings, the so-called 'placements' (Orban and Wolfe, 1980), and is therefore easy to compute. This estimator does not require the assumption of continuous dfs and is also valid in the case of ties. Moreover, it is shown that this estimator is non-negative and has a sharp upper bound which may be considered an empirical version of the well-known Birnbaum-Klose inequality. The derivation of this estimator provides an option to compute the biases of some commonly used estimators in the literature. Simulations demonstrate that, for small sample sizes, the biases of these estimators depend on the underlying dfs and thus are not under control. This means that in the case of a biased estimator, simulation results for the type-I error of a test or the coverage probability of a ci do not only depend on the quality of the approximation of $htheta$ by a normal db but also an additional unknown bias caused by the variance estimator. Finally, it is shown that this estimator is $L_2$-consistent.
{"title":"An unbiased rank-based estimator of the Mann-Whitney variance including the case of ties","authors":"Edgar Brunner, Frank Konietschke","doi":"arxiv-2409.05038","DOIUrl":"https://doi.org/arxiv-2409.05038","url":null,"abstract":"Many estimators of the variance of the well-known unbiased and uniform most\u0000powerful estimator $htheta$ of the Mann-Whitney effect, $theta = P(X < Y) +\u0000nfrac12 P(X=Y)$, are considered in the literature. Some of these estimators\u0000are only valid in case of no ties or are biased in case of small sample sizes\u0000where the amount of the bias is not discussed. Here we derive an unbiased\u0000estimator that is based on different rankings, the so-called 'placements'\u0000(Orban and Wolfe, 1980), and is therefore easy to compute. This estimator does\u0000not require the assumption of continuous dfs and is also valid in the case of\u0000ties. Moreover, it is shown that this estimator is non-negative and has a sharp\u0000upper bound which may be considered an empirical version of the well-known\u0000Birnbaum-Klose inequality. The derivation of this estimator provides an option\u0000to compute the biases of some commonly used estimators in the literature.\u0000Simulations demonstrate that, for small sample sizes, the biases of these\u0000estimators depend on the underlying dfs and thus are not under control. This\u0000means that in the case of a biased estimator, simulation results for the type-I\u0000error of a test or the coverage probability of a ci do not only depend on the\u0000quality of the approximation of $htheta$ by a normal db but also an\u0000additional unknown bias caused by the variance estimator. Finally, it is shown\u0000that this estimator is $L_2$-consistent.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fernando Rodriguez Avellaneda, Jorge Mateu, Paula Moraga
Understanding the spread of infectious diseases such as COVID-19 is crucial for informed decision-making and resource allocation. A critical component of disease behavior is the velocity with which disease spreads, defined as the rate of change between time and space. In this paper, we propose a spatio-temporal modeling approach to determine the velocities of infectious disease spread. Our approach assumes that the locations and times of people infected can be considered as a spatio-temporal point pattern that arises as a realization of a spatio-temporal log-Gaussian Cox process. The intensity of this process is estimated using fast Bayesian inference by employing the integrated nested Laplace approximation (INLA) and the Stochastic Partial Differential Equations (SPDE) approaches. The velocity is then calculated using finite differences that approximate the derivatives of the intensity function. Finally, the directions and magnitudes of the velocities can be mapped at specific times to examine better the spread of the disease throughout the region. We demonstrate our method by analyzing COVID-19 spread in Cali, Colombia, during the 2020-2021 pandemic.
{"title":"Estimating velocities of infectious disease spread through spatio-temporal log-Gaussian Cox point processes","authors":"Fernando Rodriguez Avellaneda, Jorge Mateu, Paula Moraga","doi":"arxiv-2409.05036","DOIUrl":"https://doi.org/arxiv-2409.05036","url":null,"abstract":"Understanding the spread of infectious diseases such as COVID-19 is crucial\u0000for informed decision-making and resource allocation. A critical component of\u0000disease behavior is the velocity with which disease spreads, defined as the\u0000rate of change between time and space. In this paper, we propose a\u0000spatio-temporal modeling approach to determine the velocities of infectious\u0000disease spread. Our approach assumes that the locations and times of people\u0000infected can be considered as a spatio-temporal point pattern that arises as a\u0000realization of a spatio-temporal log-Gaussian Cox process. The intensity of\u0000this process is estimated using fast Bayesian inference by employing the\u0000integrated nested Laplace approximation (INLA) and the Stochastic Partial\u0000Differential Equations (SPDE) approaches. The velocity is then calculated using\u0000finite differences that approximate the derivatives of the intensity function.\u0000Finally, the directions and magnitudes of the velocities can be mapped at\u0000specific times to examine better the spread of the disease throughout the\u0000region. We demonstrate our method by analyzing COVID-19 spread in Cali,\u0000Colombia, during the 2020-2021 pandemic.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"113 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The exponential growth in data sizes and storage costs has brought considerable challenges to the data science community, requiring solutions to run learning methods on such data. While machine learning has scaled to achieve predictive accuracy in big data settings, statistical inference and uncertainty quantification tools are still lagging. Priority scientific fields collect vast data to understand phenomena typically studied with statistical methods like regression. In this setting, regression parameter estimation can benefit from efficient computational procedures, but the main challenge lies in computing error process parameters with complex covariance structures. Identifying and estimating these structures is essential for inference and often used for uncertainty quantification in machine learning with Gaussian Processes. However, estimating these structures becomes burdensome as data scales, requiring approximations that compromise the reliability of outputs. These approximations are even more unreliable when complexities like long-range dependencies or missing data are present. This work defines and proves the statistical properties of the Generalized Method of Wavelet Moments with Exogenous variables (GMWMX), a highly scalable, stable, and statistically valid method for estimating and delivering inference for linear models using stochastic processes in the presence of data complexities like latent dependence structures and missing data. Applied examples from Earth Sciences and extensive simulations highlight the advantages of the GMWMX.
{"title":"Inference for Large Scale Regression Models with Dependent Errors","authors":"Lionel Voirol, Haotian Xu, Yuming Zhang, Luca Insolia, Roberto Molinari, Stéphane Guerrier","doi":"arxiv-2409.05160","DOIUrl":"https://doi.org/arxiv-2409.05160","url":null,"abstract":"The exponential growth in data sizes and storage costs has brought\u0000considerable challenges to the data science community, requiring solutions to\u0000run learning methods on such data. While machine learning has scaled to achieve\u0000predictive accuracy in big data settings, statistical inference and uncertainty\u0000quantification tools are still lagging. Priority scientific fields collect vast\u0000data to understand phenomena typically studied with statistical methods like\u0000regression. In this setting, regression parameter estimation can benefit from\u0000efficient computational procedures, but the main challenge lies in computing\u0000error process parameters with complex covariance structures. Identifying and\u0000estimating these structures is essential for inference and often used for\u0000uncertainty quantification in machine learning with Gaussian Processes.\u0000However, estimating these structures becomes burdensome as data scales,\u0000requiring approximations that compromise the reliability of outputs. These\u0000approximations are even more unreliable when complexities like long-range\u0000dependencies or missing data are present. This work defines and proves the\u0000statistical properties of the Generalized Method of Wavelet Moments with\u0000Exogenous variables (GMWMX), a highly scalable, stable, and statistically valid\u0000method for estimating and delivering inference for linear models using\u0000stochastic processes in the presence of data complexities like latent\u0000dependence structures and missing data. Applied examples from Earth Sciences\u0000and extensive simulations highlight the advantages of the GMWMX.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}