Identifying dynamical system (DS) is a vital task in science and engineering. Traditional methods require numerous calls to the DS solver, rendering likelihood-based or least-squares inference frameworks impractical. For efficient parameter inference, two state-of-the-art techniques are the kernel method for modeling and the "one-step framework" for jointly inferring unknown parameters and hyperparameters. The kernel method is a quick and straightforward technique, but it cannot estimate solutions and their derivatives, which must strictly adhere to physical laws. We propose a model-embedded "one-step" Bayesian framework for joint inference of unknown parameters and hyperparameters by maximizing the marginal likelihood. This approach models the solution and its derivatives using Gaussian process regression (GPR), taking into account smoothness and continuity properties, and treats differential equations as constraints that can be naturally integrated into the Bayesian framework in the linear case. Additionally, we prove the convergence of the model-embedded Gaussian process regression (ME-GPR) for theoretical development. Motivated by Taylor expansion, we introduce a piecewise first-order linearization strategy to handle nonlinear dynamic systems. We derive estimates and confidence intervals, demonstrating that they exhibit low bias and good coverage properties for both simulated models and real data.
{"title":"Model-Embedded Gaussian Process Regression for Parameter Estimation in Dynamical System","authors":"Ying Zhou, Jinglai Li, Xiang Zhou, Hongqiao Wang","doi":"arxiv-2409.11745","DOIUrl":"https://doi.org/arxiv-2409.11745","url":null,"abstract":"Identifying dynamical system (DS) is a vital task in science and engineering.\u0000Traditional methods require numerous calls to the DS solver, rendering\u0000likelihood-based or least-squares inference frameworks impractical. For\u0000efficient parameter inference, two state-of-the-art techniques are the kernel\u0000method for modeling and the \"one-step framework\" for jointly inferring unknown\u0000parameters and hyperparameters. The kernel method is a quick and\u0000straightforward technique, but it cannot estimate solutions and their\u0000derivatives, which must strictly adhere to physical laws. We propose a\u0000model-embedded \"one-step\" Bayesian framework for joint inference of unknown\u0000parameters and hyperparameters by maximizing the marginal likelihood. This\u0000approach models the solution and its derivatives using Gaussian process\u0000regression (GPR), taking into account smoothness and continuity properties, and\u0000treats differential equations as constraints that can be naturally integrated\u0000into the Bayesian framework in the linear case. Additionally, we prove the\u0000convergence of the model-embedded Gaussian process regression (ME-GPR) for\u0000theoretical development. Motivated by Taylor expansion, we introduce a\u0000piecewise first-order linearization strategy to handle nonlinear dynamic\u0000systems. We derive estimates and confidence intervals, demonstrating that they\u0000exhibit low bias and good coverage properties for both simulated models and\u0000real data.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":"28 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142268547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Juliette Mukangango, Amanda Muyskens, Benjamin W. Priest
Gaussian Process (GP) regression is a flexible modeling technique used to predict outputs and to capture uncertainty in the predictions. However, the GP regression process becomes computationally intensive when the training spatial dataset has a large number of observations. To address this challenge, we introduce a scalable GP algorithm, termed MuyGPs, which incorporates nearest neighbor and leave-one-out cross-validation during training. This approach enables the evaluation of large spatial datasets with state-of-the-art accuracy and speed in certain spatial problems. Despite these advantages, conventional quadratic loss functions used in the MuyGPs optimization such as Root Mean Squared Error(RMSE), are highly influenced by outliers. We explore the behavior of MuyGPs in cases involving outlying observations, and subsequently, develop a robust approach to handle and mitigate their impact. Specifically, we introduce a novel leave-one-out loss function based on the pseudo-Huber function (LOOPH) that effectively accounts for outliers in large spatial datasets within the MuyGPs framework. Our simulation study shows that the "LOOPH" loss method maintains accuracy despite outlying observations, establishing MuyGPs as a powerful tool for mitigating unusual observation impacts in the large data regime. In the analysis of U.S. ozone data, MuyGPs provides accurate predictions and uncertainty quantification, demonstrating its utility in managing data anomalies. Through these efforts, we advance the understanding of GP regression in spatial contexts.
高斯过程(GP)回归是一种灵活的建模技术,用于预测输出和捕捉预测中的不确定性。然而,当训练空间数据集具有大量观测数据时,GP 回归过程就会变得计算密集。为了应对这一挑战,我们引入了一种可扩展的 GP 算法,称为 MuyGPs,该算法在训练过程中结合了近邻和一出交叉验证。在某些空间问题上,这种方法能以最先进的精度和速度对大型空间数据集进行评估。尽管有这些优点,MuyGPs 优化中使用的传统二次损失函数(如均方根误差(RMSE))受异常值的影响很大。我们探讨了 MuyGPs 在涉及离群观测值的情况下的行为,并随后开发了一种稳健的方法来处理和减轻离群的影响。具体来说,我们在伪胡伯函数(LOOPH)的基础上引入了一种新的 "leave-one-out "损失函数,该函数能在 MuyGPs 框架内有效地考虑大型空间数据集中的离群值。我们的模拟研究表明,尽管观测数据离群,"LOOPH "损失方法仍能保持准确性,从而使 MuyGPs 成为在大型数据时代减轻异常观测影响的有力工具。在对美国臭氧数据的分析中,MuyGPs 提供了准确的预测和不确定性量化,证明了其在管理数据异常方面的实用性。通过这些努力,我们推进了对空间背景下 GP 回归的理解。
{"title":"A Robust Approach to Gaussian Processes Implementation","authors":"Juliette Mukangango, Amanda Muyskens, Benjamin W. Priest","doi":"arxiv-2409.11577","DOIUrl":"https://doi.org/arxiv-2409.11577","url":null,"abstract":"Gaussian Process (GP) regression is a flexible modeling technique used to\u0000predict outputs and to capture uncertainty in the predictions. However, the GP\u0000regression process becomes computationally intensive when the training spatial\u0000dataset has a large number of observations. To address this challenge, we\u0000introduce a scalable GP algorithm, termed MuyGPs, which incorporates nearest\u0000neighbor and leave-one-out cross-validation during training. This approach\u0000enables the evaluation of large spatial datasets with state-of-the-art accuracy\u0000and speed in certain spatial problems. Despite these advantages, conventional\u0000quadratic loss functions used in the MuyGPs optimization such as Root Mean\u0000Squared Error(RMSE), are highly influenced by outliers. We explore the behavior\u0000of MuyGPs in cases involving outlying observations, and subsequently, develop a\u0000robust approach to handle and mitigate their impact. Specifically, we introduce\u0000a novel leave-one-out loss function based on the pseudo-Huber function (LOOPH)\u0000that effectively accounts for outliers in large spatial datasets within the\u0000MuyGPs framework. Our simulation study shows that the \"LOOPH\" loss method\u0000maintains accuracy despite outlying observations, establishing MuyGPs as a\u0000powerful tool for mitigating unusual observation impacts in the large data\u0000regime. In the analysis of U.S. ozone data, MuyGPs provides accurate\u0000predictions and uncertainty quantification, demonstrating its utility in\u0000managing data anomalies. Through these efforts, we advance the understanding of\u0000GP regression in spatial contexts.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142251169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anton Lebedev, Annika Möslein, Olha I. Yaman, Del Rajan, Philip Intallura
In this paper we show how different sources of random numbers influence the outcomes of Monte Carlo simulations. We compare industry-standard pseudo-random number generators (PRNGs) to a quantum random number generator (QRNG) and show, using examples of Monte Carlo simulations with exact solutions, that the QRNG yields statistically significantly better approximations than the PRNGs. Our results demonstrate that higher accuracy can be achieved in the commonly known Monte Carlo method for approximating $pi$. For Buffon's needle experiment, we further quantify a potential reduction in approximation errors by up to $1.89times$ for optimal parameter choices when using a QRNG and a reduction of the sample size by $sim 8times$ for sub-optimal parameter choices. We attribute the observed higher accuracy to the underlying differences in the random sampling, where a uniformity analysis reveals a tendency of the QRNG to sample the solution space more homogeneously. Additionally, we compare the results obtained with the QRNG and PRNG in solving the non-linear stochastic Schr"odinger equation, benchmarked against the analytical solution. We observe higher accuracy of the approximations of the QRNG and demonstrate that equivalent results can be achieved at 1/3 to 1/10-th of the costs.
{"title":"Effects of the entropy source on Monte Carlo simulations","authors":"Anton Lebedev, Annika Möslein, Olha I. Yaman, Del Rajan, Philip Intallura","doi":"arxiv-2409.11539","DOIUrl":"https://doi.org/arxiv-2409.11539","url":null,"abstract":"In this paper we show how different sources of random numbers influence the\u0000outcomes of Monte Carlo simulations. We compare industry-standard pseudo-random\u0000number generators (PRNGs) to a quantum random number generator (QRNG) and show,\u0000using examples of Monte Carlo simulations with exact solutions, that the QRNG\u0000yields statistically significantly better approximations than the PRNGs. Our\u0000results demonstrate that higher accuracy can be achieved in the commonly known\u0000Monte Carlo method for approximating $pi$. For Buffon's needle experiment, we\u0000further quantify a potential reduction in approximation errors by up to\u0000$1.89times$ for optimal parameter choices when using a QRNG and a reduction of\u0000the sample size by $sim 8times$ for sub-optimal parameter choices. We\u0000attribute the observed higher accuracy to the underlying differences in the\u0000random sampling, where a uniformity analysis reveals a tendency of the QRNG to\u0000sample the solution space more homogeneously. Additionally, we compare the\u0000results obtained with the QRNG and PRNG in solving the non-linear stochastic\u0000Schr\"odinger equation, benchmarked against the analytical solution. We observe\u0000higher accuracy of the approximations of the QRNG and demonstrate that\u0000equivalent results can be achieved at 1/3 to 1/10-th of the costs.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":"53 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142251168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tingwei Meng, Zongren Zou, Jérôme Darbon, George Em Karniadakis
The interplay between stochastic processes and optimal control has been extensively explored in the literature. With the recent surge in the use of diffusion models, stochastic processes have increasingly been applied to sample generation. This paper builds on the log transform, known as the Cole-Hopf transform in Brownian motion contexts, and extends it within a more abstract framework that includes a linear operator. Within this framework, we found that the well-known relationship between the Cole-Hopf transform and optimal transport is a particular instance where the linear operator acts as the infinitesimal generator of a stochastic process. We also introduce a novel scenario where the linear operator is the adjoint of the generator, linking to Bayesian inference under specific initial and terminal conditions. Leveraging this theoretical foundation, we develop a new algorithm, named the HJ-sampler, for Bayesian inference for the inverse problem of a stochastic differential equation with given terminal observations. The HJ-sampler involves two stages: (1) solving the viscous Hamilton-Jacobi partial differential equations, and (2) sampling from the associated stochastic optimal control problem. Our proposed algorithm naturally allows for flexibility in selecting the numerical solver for viscous HJ PDEs. We introduce two variants of the solver: the Riccati-HJ-sampler, based on the Riccati method, and the SGM-HJ-sampler, which utilizes diffusion models. We demonstrate the effectiveness and flexibility of the proposed methods by applying them to solve Bayesian inverse problems involving various stochastic processes and prior distributions, including applications that address model misspecifications and quantifying model uncertainty.
{"title":"HJ-sampler: A Bayesian sampler for inverse problems of a stochastic process by leveraging Hamilton-Jacobi PDEs and score-based generative models","authors":"Tingwei Meng, Zongren Zou, Jérôme Darbon, George Em Karniadakis","doi":"arxiv-2409.09614","DOIUrl":"https://doi.org/arxiv-2409.09614","url":null,"abstract":"The interplay between stochastic processes and optimal control has been\u0000extensively explored in the literature. With the recent surge in the use of\u0000diffusion models, stochastic processes have increasingly been applied to sample\u0000generation. This paper builds on the log transform, known as the Cole-Hopf\u0000transform in Brownian motion contexts, and extends it within a more abstract\u0000framework that includes a linear operator. Within this framework, we found that\u0000the well-known relationship between the Cole-Hopf transform and optimal\u0000transport is a particular instance where the linear operator acts as the\u0000infinitesimal generator of a stochastic process. We also introduce a novel\u0000scenario where the linear operator is the adjoint of the generator, linking to\u0000Bayesian inference under specific initial and terminal conditions. Leveraging\u0000this theoretical foundation, we develop a new algorithm, named the HJ-sampler,\u0000for Bayesian inference for the inverse problem of a stochastic differential\u0000equation with given terminal observations. The HJ-sampler involves two stages:\u0000(1) solving the viscous Hamilton-Jacobi partial differential equations, and (2)\u0000sampling from the associated stochastic optimal control problem. Our proposed\u0000algorithm naturally allows for flexibility in selecting the numerical solver\u0000for viscous HJ PDEs. We introduce two variants of the solver: the\u0000Riccati-HJ-sampler, based on the Riccati method, and the SGM-HJ-sampler, which\u0000utilizes diffusion models. We demonstrate the effectiveness and flexibility of\u0000the proposed methods by applying them to solve Bayesian inverse problems\u0000involving various stochastic processes and prior distributions, including\u0000applications that address model misspecifications and quantifying model\u0000uncertainty.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":"171 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142251171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shape graphs are complex geometrical structures commonly found in biological and anatomical systems. A shape graph is a collection of nodes, some connected by curvilinear edges with arbitrary shapes. Their high complexity stems from the large number of nodes and edges and the complex shapes of edges. With an eye for statistical analysis, one seeks low-complexity representations that retain as much of the global structures of the original shape graphs as possible. This paper develops a framework for reducing graph complexity using hierarchical clustering procedures that replace groups of nodes and edges with their simpler representatives. It demonstrates this framework using graphs of retinal blood vessels in two dimensions and neurons in three dimensions. The paper also presents experiments on classifications of shape graphs using progressively reduced levels of graph complexity. The accuracy of disease detection in retinal blood vessels drops quickly when the complexity is reduced, with accuracy loss particularly associated with discarding terminal edges. Accuracy in identifying neural cell types remains stable with complexity reduction.
{"title":"Reducing Shape-Graph Complexity with Application to Classification of Retinal Blood Vessels and Neurons","authors":"Benjamin Beaudett, Anuj Srivastava","doi":"arxiv-2409.09168","DOIUrl":"https://doi.org/arxiv-2409.09168","url":null,"abstract":"Shape graphs are complex geometrical structures commonly found in biological\u0000and anatomical systems. A shape graph is a collection of nodes, some connected\u0000by curvilinear edges with arbitrary shapes. Their high complexity stems from\u0000the large number of nodes and edges and the complex shapes of edges. With an\u0000eye for statistical analysis, one seeks low-complexity representations that\u0000retain as much of the global structures of the original shape graphs as\u0000possible. This paper develops a framework for reducing graph complexity using\u0000hierarchical clustering procedures that replace groups of nodes and edges with\u0000their simpler representatives. It demonstrates this framework using graphs of\u0000retinal blood vessels in two dimensions and neurons in three dimensions. The\u0000paper also presents experiments on classifications of shape graphs using\u0000progressively reduced levels of graph complexity. The accuracy of disease\u0000detection in retinal blood vessels drops quickly when the complexity is\u0000reduced, with accuracy loss particularly associated with discarding terminal\u0000edges. Accuracy in identifying neural cell types remains stable with complexity\u0000reduction.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":"198 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142251170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alex Glyn-Davies, Connor Duffin, Ieva Kazlauskaite, Mark Girolami, Ö. Deniz Akyildiz
In this paper, we develop a class of interacting particle Langevin algorithms to solve inverse problems for partial differential equations (PDEs). In particular, we leverage the statistical finite elements (statFEM) formulation to obtain a finite-dimensional latent variable statistical model where the parameter is that of the (discretised) forward map and the latent variable is the statFEM solution of the PDE which is assumed to be partially observed. We then adapt a recently proposed expectation-maximisation like scheme, interacting particle Langevin algorithm (IPLA), for this problem and obtain a joint estimation procedure for the parameters and the latent variables. We consider three main examples: (i) estimating the forcing for linear Poisson PDE, (ii) estimating the forcing for nonlinear Poisson PDE, and (iii) estimating diffusivity for linear Poisson PDE. We provide computational complexity estimates for forcing estimation in the linear case. We also provide comprehensive numerical experiments and preconditioning strategies that significantly improve the performance, showing that the proposed class of methods can be the choice for parameter inference in PDE models.
{"title":"Statistical Finite Elements via Interacting Particle Langevin Dynamics","authors":"Alex Glyn-Davies, Connor Duffin, Ieva Kazlauskaite, Mark Girolami, Ö. Deniz Akyildiz","doi":"arxiv-2409.07101","DOIUrl":"https://doi.org/arxiv-2409.07101","url":null,"abstract":"In this paper, we develop a class of interacting particle Langevin algorithms\u0000to solve inverse problems for partial differential equations (PDEs). In\u0000particular, we leverage the statistical finite elements (statFEM) formulation\u0000to obtain a finite-dimensional latent variable statistical model where the\u0000parameter is that of the (discretised) forward map and the latent variable is\u0000the statFEM solution of the PDE which is assumed to be partially observed. We\u0000then adapt a recently proposed expectation-maximisation like scheme,\u0000interacting particle Langevin algorithm (IPLA), for this problem and obtain a\u0000joint estimation procedure for the parameters and the latent variables. We\u0000consider three main examples: (i) estimating the forcing for linear Poisson\u0000PDE, (ii) estimating the forcing for nonlinear Poisson PDE, and (iii)\u0000estimating diffusivity for linear Poisson PDE. We provide computational\u0000complexity estimates for forcing estimation in the linear case. We also provide\u0000comprehensive numerical experiments and preconditioning strategies that\u0000significantly improve the performance, showing that the proposed class of\u0000methods can be the choice for parameter inference in PDE models.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":"73 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142189454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As networks continue to increase in size, current methods must be capable of handling large numbers of nodes and edges in order to be practically relevant. Instead of working directly with the entire (large) network, analyzing sub-networks has become a popular approach. Due to a network's inherent inter-connectedness, sub-sampling is not a trivial task. While this problem has gained attention in recent years, it has not received sufficient attention from the statistics community. In this work, we provide a thorough comparison of seven graph sub-sampling algorithms by applying them to divide-and-conquer algorithms for community structure and core-periphery (CP) structure. After discussing the various algorithms and sub-sampling routines, we derive theoretical results for the mis-classification rate of the divide-and-conquer algorithm for CP structure under various sub-sampling schemes. We then perform extensive experiments on both simulated and real-world data to compare the various methods. For the community detection task, we found that sampling nodes uniformly at random yields the best performance. For CP structure on the other hand, there was no single winner, but algorithms which sampled core nodes at a higher rate consistently outperformed other sampling routines, e.g., random edge sampling and random walk sampling. The varying performance of the sampling algorithms on different tasks demonstrates the importance of carefully selecting a sub-sampling routine for the specific application.
{"title":"Graph sub-sampling for divide-and-conquer algorithms in large networks","authors":"Eric Yanchenko","doi":"arxiv-2409.06994","DOIUrl":"https://doi.org/arxiv-2409.06994","url":null,"abstract":"As networks continue to increase in size, current methods must be capable of\u0000handling large numbers of nodes and edges in order to be practically relevant.\u0000Instead of working directly with the entire (large) network, analyzing\u0000sub-networks has become a popular approach. Due to a network's inherent\u0000inter-connectedness, sub-sampling is not a trivial task. While this problem has\u0000gained attention in recent years, it has not received sufficient attention from\u0000the statistics community. In this work, we provide a thorough comparison of\u0000seven graph sub-sampling algorithms by applying them to divide-and-conquer\u0000algorithms for community structure and core-periphery (CP) structure. After\u0000discussing the various algorithms and sub-sampling routines, we derive\u0000theoretical results for the mis-classification rate of the divide-and-conquer\u0000algorithm for CP structure under various sub-sampling schemes. We then perform\u0000extensive experiments on both simulated and real-world data to compare the\u0000various methods. For the community detection task, we found that sampling nodes\u0000uniformly at random yields the best performance. For CP structure on the other\u0000hand, there was no single winner, but algorithms which sampled core nodes at a\u0000higher rate consistently outperformed other sampling routines, e.g., random\u0000edge sampling and random walk sampling. The varying performance of the sampling\u0000algorithms on different tasks demonstrates the importance of carefully\u0000selecting a sub-sampling routine for the specific application.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142189455","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Causal discovery is designed to identify causal relationships in data, a task that has become increasingly complex due to the computational demands of traditional methods such as VarLiNGAM, which combines Vector Autoregressive Model with Linear Non-Gaussian Acyclic Model for time series data. This study is dedicated to optimising causal discovery specifically for time series data, which is common in practical applications. Time series causal discovery is particularly challenging due to the need to account for temporal dependencies and potential time lag effects. By designing a specialised dataset generator and reducing the computational complexity of the VarLiNGAM model from ( O(m^3 cdot n) ) to ( O(m^3 + m^2 cdot n) ), this study significantly improves the feasibility of processing large datasets. The proposed methods have been validated on advanced computational platforms and tested across simulated, real-world, and large-scale datasets, showcasing enhanced efficiency and performance. The optimised algorithm achieved 7 to 13 times speedup compared with the original algorithm and around 4.5 times speedup compared with the GPU-accelerated version on large-scale datasets with feature sizes between 200 and 400. Our methods aim to push the boundaries of current causal discovery capabilities, making them more robust, scalable, and applicable to real-world scenarios, thus facilitating breakthroughs in various fields such as healthcare and finance.
{"title":"Optimizing VarLiNGAM for Scalable and Efficient Time Series Causal Discovery","authors":"Ziyang Jiao, Ce Guo, Wayne Luk","doi":"arxiv-2409.05500","DOIUrl":"https://doi.org/arxiv-2409.05500","url":null,"abstract":"Causal discovery is designed to identify causal relationships in data, a task\u0000that has become increasingly complex due to the computational demands of\u0000traditional methods such as VarLiNGAM, which combines Vector Autoregressive\u0000Model with Linear Non-Gaussian Acyclic Model for time series data. This study is dedicated to optimising causal discovery specifically for time\u0000series data, which is common in practical applications. Time series causal\u0000discovery is particularly challenging due to the need to account for temporal\u0000dependencies and potential time lag effects. By designing a specialised dataset\u0000generator and reducing the computational complexity of the VarLiNGAM model from\u0000( O(m^3 cdot n) ) to ( O(m^3 + m^2 cdot n) ), this study significantly\u0000improves the feasibility of processing large datasets. The proposed methods\u0000have been validated on advanced computational platforms and tested across\u0000simulated, real-world, and large-scale datasets, showcasing enhanced efficiency\u0000and performance. The optimised algorithm achieved 7 to 13 times speedup\u0000compared with the original algorithm and around 4.5 times speedup compared with\u0000the GPU-accelerated version on large-scale datasets with feature sizes between\u0000200 and 400. Our methods aim to push the boundaries of current causal discovery\u0000capabilities, making them more robust, scalable, and applicable to real-world\u0000scenarios, thus facilitating breakthroughs in various fields such as healthcare\u0000and finance.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":"113 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142224593","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jordan Awan, Adam Edwards, Paul Bartholomew, Andrew Sillers
In differential privacy (DP) mechanisms, it can be beneficial to release "redundant" outputs, in the sense that a quantity can be estimated by combining different combinations of privatized values. Indeed, this structure is present in the DP 2020 Decennial Census products published by the U.S. Census Bureau. With this structure, the DP output can be improved by enforcing self-consistency (i.e., estimators obtained by combining different values result in the same estimate) and we show that the minimum variance processing is a linear projection. However, standard projection algorithms are too computationally expensive in terms of both memory and execution time for applications such as the Decennial Census. We propose the Scalable Efficient Algorithm for Best Linear Unbiased Estimate (SEA BLUE), based on a two step process of aggregation and differencing that 1) enforces self-consistency through a linear and unbiased procedure, 2) is computationally and memory efficient, 3) achieves the minimum variance solution under certain structural assumptions, and 4) is empirically shown to be robust to violations of these structural assumptions. We propose three methods of calculating confidence intervals from our estimates, under various assumptions. We apply SEA BLUE to two 2010 Census demonstration products, illustrating its scalability and validity.
{"title":"Best Linear Unbiased Estimate from Privatized Histograms","authors":"Jordan Awan, Adam Edwards, Paul Bartholomew, Andrew Sillers","doi":"arxiv-2409.04387","DOIUrl":"https://doi.org/arxiv-2409.04387","url":null,"abstract":"In differential privacy (DP) mechanisms, it can be beneficial to release\u0000\"redundant\" outputs, in the sense that a quantity can be estimated by combining\u0000different combinations of privatized values. Indeed, this structure is present\u0000in the DP 2020 Decennial Census products published by the U.S. Census Bureau.\u0000With this structure, the DP output can be improved by enforcing\u0000self-consistency (i.e., estimators obtained by combining different values\u0000result in the same estimate) and we show that the minimum variance processing\u0000is a linear projection. However, standard projection algorithms are too\u0000computationally expensive in terms of both memory and execution time for\u0000applications such as the Decennial Census. We propose the Scalable Efficient\u0000Algorithm for Best Linear Unbiased Estimate (SEA BLUE), based on a two step\u0000process of aggregation and differencing that 1) enforces self-consistency\u0000through a linear and unbiased procedure, 2) is computationally and memory\u0000efficient, 3) achieves the minimum variance solution under certain structural\u0000assumptions, and 4) is empirically shown to be robust to violations of these\u0000structural assumptions. We propose three methods of calculating confidence\u0000intervals from our estimates, under various assumptions. We apply SEA BLUE to\u0000two 2010 Census demonstration products, illustrating its scalability and\u0000validity.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":"47 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142189456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Here, we introduce a novel framework for modelling the spatiotemporal dynamics of disease spread known as conditional logistic individual-level models (CL-ILM's). This framework alleviates much of the computational burden associated with traditional spatiotemporal individual-level models for epidemics, and facilitates the use of standard software for fitting logistic models when analysing spatiotemporal disease patterns. The models can be fitted in either a frequentist or Bayesian framework. Here, we apply the new spatial CL-ILM to both simulated and semi-real data from the UK 2001 foot-and-mouth disease epidemic.
{"title":"Conditional logistic individual-level models of spatial infectious disease dynamics","authors":"Tahmina Akter, Rob Deardon","doi":"arxiv-2409.02353","DOIUrl":"https://doi.org/arxiv-2409.02353","url":null,"abstract":"Here, we introduce a novel framework for modelling the spatiotemporal\u0000dynamics of disease spread known as conditional logistic individual-level\u0000models (CL-ILM's). This framework alleviates much of the computational burden\u0000associated with traditional spatiotemporal individual-level models for\u0000epidemics, and facilitates the use of standard software for fitting logistic\u0000models when analysing spatiotemporal disease patterns. The models can be fitted\u0000in either a frequentist or Bayesian framework. Here, we apply the new spatial\u0000CL-ILM to both simulated and semi-real data from the UK 2001 foot-and-mouth\u0000disease epidemic.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":"51 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142189458","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}