When the target of inference is a real-valued function of probability parameters in the k-sample multinomial problem, variance estimation may be challenging. In small samples, methods like the nonparametric bootstrap or delta method may perform poorly. We propose a novel general method in this setting for computing exact p-values and confidence intervals which means that type I error rates are correctly bounded and confidence intervals have at least nominal coverage at all sample sizes. Our method is applicable to any real-valued function of multinomial probabilities, accommodating an arbitrary number of samples with varying category counts. We describe the method and provide an implementation of it in R, with some computational optimization to ensure broad applicability. Simulations demonstrate our method's ability to maintain correct coverage rates in settings where the nonparametric bootstrap fails.
当推断的目标是 k 样本多项式问题中概率参数的实值函数时,方差估计可能会很困难。在小样本中,像非参数自举阶梯法这样的方法可能会表现不佳。在这种情况下,我们提出了一种计算精确 p 值和置信区间的新颖通用方法,这意味着在所有样本大小下,I 型误差率都能得到正确的约束,置信区间至少有名义覆盖率。我们的方法适用于多项式概率的任何实值函数,可容纳任意数量的具有不同类别计数的样本。我们描述了该方法,并提供了它在 R 语言中的实现,同时进行了一些计算优化,以确保广泛的适用性。模拟证明了我们的方法能够在非参数引导法失效的情况下保持正确的覆盖率。
{"title":"Exact confidence intervals for functions of parameters in the k-sample multinomial problem","authors":"Michael C Sachs, Erin E Gabriel, Michael P Fay","doi":"arxiv-2406.19141","DOIUrl":"https://doi.org/arxiv-2406.19141","url":null,"abstract":"When the target of inference is a real-valued function of probability\u0000parameters in the k-sample multinomial problem, variance estimation may be\u0000challenging. In small samples, methods like the nonparametric bootstrap or\u0000delta method may perform poorly. We propose a novel general method in this\u0000setting for computing exact p-values and confidence intervals which means that\u0000type I error rates are correctly bounded and confidence intervals have at least\u0000nominal coverage at all sample sizes. Our method is applicable to any\u0000real-valued function of multinomial probabilities, accommodating an arbitrary\u0000number of samples with varying category counts. We describe the method and\u0000provide an implementation of it in R, with some computational optimization to\u0000ensure broad applicability. Simulations demonstrate our method's ability to\u0000maintain correct coverage rates in settings where the nonparametric bootstrap\u0000fails.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mathieu Fourment, Matthew Macaulay, Christiaan J Swanepoel, Xiang Ji, Marc A Suchard, Frederick A Matsen IV
Bayesian inference has predominantly relied on the Markov chain Monte Carlo (MCMC) algorithm for many years. However, MCMC is computationally laborious, especially for complex phylogenetic models of time trees. This bottleneck has led to the search for alternatives, such as variational Bayes, which can scale better to large datasets. In this paper, we introduce torchtree, a framework written in Python that allows developers to easily implement rich phylogenetic models and algorithms using a fixed tree topology. One can either use automatic differentiation, or leverage torchtree's plug-in system to compute gradients analytically for model components for which automatic differentiation is slow. We demonstrate that the torchtree variational inference framework performs similarly to BEAST in terms of speed and approximation accuracy. Furthermore, we explore the use of the forward KL divergence as an optimizing criterion for variational inference, which can handle discontinuous and non-differentiable models. Our experiments show that inference using the forward KL divergence tends to be faster per iteration compared to the evidence lower bound (ELBO) criterion, although the ELBO-based inference may converge faster in some cases. Overall, torchtree provides a flexible and efficient framework for phylogenetic model development and inference using PyTorch.
{"title":"Torchtree: flexible phylogenetic model development and inference using PyTorch","authors":"Mathieu Fourment, Matthew Macaulay, Christiaan J Swanepoel, Xiang Ji, Marc A Suchard, Frederick A Matsen IV","doi":"arxiv-2406.18044","DOIUrl":"https://doi.org/arxiv-2406.18044","url":null,"abstract":"Bayesian inference has predominantly relied on the Markov chain Monte Carlo\u0000(MCMC) algorithm for many years. However, MCMC is computationally laborious,\u0000especially for complex phylogenetic models of time trees. This bottleneck has\u0000led to the search for alternatives, such as variational Bayes, which can scale\u0000better to large datasets. In this paper, we introduce torchtree, a framework\u0000written in Python that allows developers to easily implement rich phylogenetic\u0000models and algorithms using a fixed tree topology. One can either use automatic\u0000differentiation, or leverage torchtree's plug-in system to compute gradients\u0000analytically for model components for which automatic differentiation is slow.\u0000We demonstrate that the torchtree variational inference framework performs\u0000similarly to BEAST in terms of speed and approximation accuracy. Furthermore,\u0000we explore the use of the forward KL divergence as an optimizing criterion for\u0000variational inference, which can handle discontinuous and non-differentiable\u0000models. Our experiments show that inference using the forward KL divergence\u0000tends to be faster per iteration compared to the evidence lower bound (ELBO)\u0000criterion, although the ELBO-based inference may converge faster in some cases.\u0000Overall, torchtree provides a flexible and efficient framework for phylogenetic\u0000model development and inference using PyTorch.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We propose a linear-complexity method for sampling from truncated multivariate normal (TMVN) distributions with high fidelity by applying nearest-neighbor approximations to a product-of-conditionals decomposition of the TMVN density. To make the sequential sampling based on the decomposition feasible, we introduce a novel method that avoids the intractable high-dimensional TMVN distribution by sampling sequentially from $m$-dimensional TMVN distributions, where $m$ is a tuning parameter controlling the fidelity. This allows us to overcome the existing methods' crucial problem of rapidly decreasing acceptance rates for increasing dimension. Throughout our experiments with up to tens of thousands of dimensions, we can produce high-fidelity samples with $m$ in the dozens, achieving superior scalability compared to existing state-of-the-art methods. We study a tetrachloroethylene concentration dataset that has $3{,}971$ observed responses and $20{,}730$ undetected responses, together modeled as a partially censored Gaussian process, where our method enables posterior inference for the censored responses through sampling a $20{,}730$-dimensional TMVN distribution.
{"title":"Scalable Sampling of Truncated Multivariate Normals Using Sequential Nearest-Neighbor Approximation","authors":"Jian Cao, Matthias Katzfuss","doi":"arxiv-2406.17307","DOIUrl":"https://doi.org/arxiv-2406.17307","url":null,"abstract":"We propose a linear-complexity method for sampling from truncated\u0000multivariate normal (TMVN) distributions with high fidelity by applying\u0000nearest-neighbor approximations to a product-of-conditionals decomposition of\u0000the TMVN density. To make the sequential sampling based on the decomposition\u0000feasible, we introduce a novel method that avoids the intractable\u0000high-dimensional TMVN distribution by sampling sequentially from\u0000$m$-dimensional TMVN distributions, where $m$ is a tuning parameter controlling\u0000the fidelity. This allows us to overcome the existing methods' crucial problem\u0000of rapidly decreasing acceptance rates for increasing dimension. Throughout our\u0000experiments with up to tens of thousands of dimensions, we can produce\u0000high-fidelity samples with $m$ in the dozens, achieving superior scalability\u0000compared to existing state-of-the-art methods. We study a tetrachloroethylene\u0000concentration dataset that has $3{,}971$ observed responses and $20{,}730$\u0000undetected responses, together modeled as a partially censored Gaussian\u0000process, where our method enables posterior inference for the censored\u0000responses through sampling a $20{,}730$-dimensional TMVN distribution.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141523218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jere Koskela, Paul A. Jenkins, Adam M. Johansen, Dario Spano
We show that genealogical trees arising from a broad class of non-neutral models of population evolution converge to the Kingman coalescent under a suitable rescaling of time. As well as non-neutral biological evolution, our results apply to genetic algorithms encompassing the prominent class of sequential Monte Carlo (SMC) methods. The time rescaling we need differs slightly from that used in classical results for convergence to the Kingman coalescent, which has implications for the performance of different resampling schemes in SMC algorithms. In addition, our work substantially simplifies earlier proofs of convergence to the Kingman coalescent, and corrects an error common to several earlier results.
{"title":"Genealogical processes of non-neutral population models under rapid mutation","authors":"Jere Koskela, Paul A. Jenkins, Adam M. Johansen, Dario Spano","doi":"arxiv-2406.16465","DOIUrl":"https://doi.org/arxiv-2406.16465","url":null,"abstract":"We show that genealogical trees arising from a broad class of non-neutral\u0000models of population evolution converge to the Kingman coalescent under a\u0000suitable rescaling of time. As well as non-neutral biological evolution, our\u0000results apply to genetic algorithms encompassing the prominent class of\u0000sequential Monte Carlo (SMC) methods. The time rescaling we need differs\u0000slightly from that used in classical results for convergence to the Kingman\u0000coalescent, which has implications for the performance of different resampling\u0000schemes in SMC algorithms. In addition, our work substantially simplifies\u0000earlier proofs of convergence to the Kingman coalescent, and corrects an error\u0000common to several earlier results.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141523192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Parameter inference for linear and non-Gaussian state space models is challenging because the likelihood function contains an intractable integral over the latent state variables. Exact inference using Markov chain Monte Carlo is computationally expensive, particularly for long time series data. Variational Bayes methods are useful when exact inference is infeasible. These methods approximate the posterior density of the parameters by a simple and tractable distribution found through optimisation. In this paper, we propose a novel sequential variational Bayes approach that makes use of the Whittle likelihood for computationally efficient parameter inference in this class of state space models. Our algorithm, which we call Recursive Variational Gaussian Approximation with the Whittle Likelihood (R-VGA-Whittle), updates the variational parameters by processing data in the frequency domain. At each iteration, R-VGA-Whittle requires the gradient and Hessian of the Whittle log-likelihood, which are available in closed form for a wide class of models. Through several examples using a linear Gaussian state space model and a univariate/bivariate non-Gaussian stochastic volatility model, we show that R-VGA-Whittle provides good approximations to posterior distributions of the parameters and is very computationally efficient when compared to asymptotically exact methods such as Hamiltonian Monte Carlo.
{"title":"Recursive variational Gaussian approximation with the Whittle likelihood for linear non-Gaussian state space models","authors":"Bao Anh Vu, David Gunawan, Andrew Zammit-Mangion","doi":"arxiv-2406.15998","DOIUrl":"https://doi.org/arxiv-2406.15998","url":null,"abstract":"Parameter inference for linear and non-Gaussian state space models is\u0000challenging because the likelihood function contains an intractable integral\u0000over the latent state variables. Exact inference using Markov chain Monte Carlo\u0000is computationally expensive, particularly for long time series data.\u0000Variational Bayes methods are useful when exact inference is infeasible. These\u0000methods approximate the posterior density of the parameters by a simple and\u0000tractable distribution found through optimisation. In this paper, we propose a\u0000novel sequential variational Bayes approach that makes use of the Whittle\u0000likelihood for computationally efficient parameter inference in this class of\u0000state space models. Our algorithm, which we call Recursive Variational Gaussian\u0000Approximation with the Whittle Likelihood (R-VGA-Whittle), updates the\u0000variational parameters by processing data in the frequency domain. At each\u0000iteration, R-VGA-Whittle requires the gradient and Hessian of the Whittle\u0000log-likelihood, which are available in closed form for a wide class of models.\u0000Through several examples using a linear Gaussian state space model and a\u0000univariate/bivariate non-Gaussian stochastic volatility model, we show that\u0000R-VGA-Whittle provides good approximations to posterior distributions of the\u0000parameters and is very computationally efficient when compared to\u0000asymptotically exact methods such as Hamiltonian Monte Carlo.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Estimating Monte Carlo error is critical to valid simulation results in Markov chain Monte Carlo (MCMC) and initial sequence estimators were one of the first methods introduced for this. Over the last few years, focus has been on multivariate assessment of simulation error, and many multivariate generalizations of univariate methods have been developed. The multivariate initial sequence estimator is known to exhibit superior finite-sample performance compared to its competitors. However, the multivariate initial sequence estimator can be prohibitively slow, limiting its widespread use. We provide an efficient alternative to the multivariate initial sequence estimator that inherits both its asymptotic properties as well as the finite-sample superior performance. The effectiveness of the proposed estimator is shown via some MCMC example implementations. Further, we also present univariate and multivariate initial sequence estimators for when parallel MCMC chains are run and demonstrate their effectiveness over popular alternative.
{"title":"Efficient Multivariate Initial Sequence Estimators for MCMC","authors":"Arka Banerjee, Dootika Vats","doi":"arxiv-2406.15874","DOIUrl":"https://doi.org/arxiv-2406.15874","url":null,"abstract":"Estimating Monte Carlo error is critical to valid simulation results in\u0000Markov chain Monte Carlo (MCMC) and initial sequence estimators were one of the\u0000first methods introduced for this. Over the last few years, focus has been on\u0000multivariate assessment of simulation error, and many multivariate\u0000generalizations of univariate methods have been developed. The multivariate\u0000initial sequence estimator is known to exhibit superior finite-sample\u0000performance compared to its competitors. However, the multivariate initial\u0000sequence estimator can be prohibitively slow, limiting its widespread use. We\u0000provide an efficient alternative to the multivariate initial sequence estimator\u0000that inherits both its asymptotic properties as well as the finite-sample\u0000superior performance. The effectiveness of the proposed estimator is shown via\u0000some MCMC example implementations. Further, we also present univariate and\u0000multivariate initial sequence estimators for when parallel MCMC chains are run\u0000and demonstrate their effectiveness over popular alternative.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141523219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Khanh N. Dinh, Zijin Xiang, Zhihan Liu, Simon Tavaré
Approximate Bayesian Computation (ABC) is a popular inference method when likelihoods are hard to come by. Practical bottlenecks of ABC applications include selecting statistics that summarize the data without losing too much information or introducing uncertainty, and choosing distance functions and tolerance thresholds that balance accuracy and computational efficiency. Recent studies have shown that ABC methods using random forest (RF) methodology perform well while circumventing many of ABC's drawbacks. However, RF construction is computationally expensive for large numbers of trees and model simulations, and there can be high uncertainty in the posterior if the prior distribution is uninformative. Here we adapt distributional random forests to the ABC setting, and introduce Approximate Bayesian Computation sequential Monte Carlo with random forests (ABC-SMC-(D)RF). This updates the prior distribution iteratively to focus on the most likely regions in the parameter space. We show that ABC-SMC-(D)RF can accurately infer posterior distributions for a wide range of deterministic and stochastic models in different scientific areas.
{"title":"Approximate Bayesian Computation sequential Monte Carlo via random forests","authors":"Khanh N. Dinh, Zijin Xiang, Zhihan Liu, Simon Tavaré","doi":"arxiv-2406.15865","DOIUrl":"https://doi.org/arxiv-2406.15865","url":null,"abstract":"Approximate Bayesian Computation (ABC) is a popular inference method when\u0000likelihoods are hard to come by. Practical bottlenecks of ABC applications\u0000include selecting statistics that summarize the data without losing too much\u0000information or introducing uncertainty, and choosing distance functions and\u0000tolerance thresholds that balance accuracy and computational efficiency. Recent\u0000studies have shown that ABC methods using random forest (RF) methodology\u0000perform well while circumventing many of ABC's drawbacks. However, RF\u0000construction is computationally expensive for large numbers of trees and model\u0000simulations, and there can be high uncertainty in the posterior if the prior\u0000distribution is uninformative. Here we adapt distributional random forests to\u0000the ABC setting, and introduce Approximate Bayesian Computation sequential\u0000Monte Carlo with random forests (ABC-SMC-(D)RF). This updates the prior\u0000distribution iteratively to focus on the most likely regions in the parameter\u0000space. We show that ABC-SMC-(D)RF can accurately infer posterior distributions\u0000for a wide range of deterministic and stochastic models in different scientific\u0000areas.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141523220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Roben Delos Reyes, Hugo Lyons Keenan, Cameron Zachreson
Behaviour change lies at the heart of many observable collective phenomena such as the transmission and control of infectious diseases, adoption of public health policies, and migration of animals to new habitats. Representing the process of individual behaviour change in computer simulations of these phenomena remains an open challenge. Often, computational models use phenomenological implementations with limited support from behavioural data. Without a strong connection to observable quantities, such models have limited utility for simulating observed and counterfactual scenarios of emergent phenomena because they cannot be validated or calibrated. Here, we present a simple stochastic individual-based model of reversal learning that captures fundamental properties of individual behaviour change, namely, the capacity to learn based on accumulated reward signals, and the transient persistence of learned behaviour after rewards are removed or altered. The model has only two parameters, and we use approximate Bayesian computation to demonstrate that they are fully identifiable from empirical reversal learning time series data. Finally, we demonstrate how the model can be extended to account for the increased complexity of behavioural dynamics over longer time scales involving fluctuating stimuli. This work is a step towards the development and evaluation of fully identifiable individual-level behaviour change models that can function as validated submodels for complex simulations of collective behaviour change.
{"title":"An agent-based model of behaviour change calibrated to reversal learning data","authors":"Roben Delos Reyes, Hugo Lyons Keenan, Cameron Zachreson","doi":"arxiv-2406.14062","DOIUrl":"https://doi.org/arxiv-2406.14062","url":null,"abstract":"Behaviour change lies at the heart of many observable collective phenomena\u0000such as the transmission and control of infectious diseases, adoption of public\u0000health policies, and migration of animals to new habitats. Representing the\u0000process of individual behaviour change in computer simulations of these\u0000phenomena remains an open challenge. Often, computational models use\u0000phenomenological implementations with limited support from behavioural data.\u0000Without a strong connection to observable quantities, such models have limited\u0000utility for simulating observed and counterfactual scenarios of emergent\u0000phenomena because they cannot be validated or calibrated. Here, we present a\u0000simple stochastic individual-based model of reversal learning that captures\u0000fundamental properties of individual behaviour change, namely, the capacity to\u0000learn based on accumulated reward signals, and the transient persistence of\u0000learned behaviour after rewards are removed or altered. The model has only two\u0000parameters, and we use approximate Bayesian computation to demonstrate that\u0000they are fully identifiable from empirical reversal learning time series data.\u0000Finally, we demonstrate how the model can be extended to account for the\u0000increased complexity of behavioural dynamics over longer time scales involving\u0000fluctuating stimuli. This work is a step towards the development and evaluation\u0000of fully identifiable individual-level behaviour change models that can\u0000function as validated submodels for complex simulations of collective behaviour\u0000change.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Duy Nguyen, Ca Hoang, Phat K. Huynh, Tien Truong, Dang Nguyen, Abhay Sharma, Trung Q. Le
Cardiovascular diseases (CVDs) are notably prevalent among patients with obstructive sleep apnea (OSA), posing unique challenges in predicting CVD progression due to the intricate interactions of comorbidities. Traditional models typically lack the necessary dynamic and longitudinal scope to accurately forecast CVD trajectories in OSA patients. This study introduces a novel multi-level phenotypic model to analyze the progression and interplay of these conditions over time, utilizing data from the Wisconsin Sleep Cohort, which includes 1,123 participants followed for decades. Our methodology comprises three advanced steps: (1) Conducting feature importance analysis through tree-based models to underscore critical predictive variables like total cholesterol, low-density lipoprotein (LDL), and diabetes. (2) Developing a logistic mixed-effects model (LGMM) to track longitudinal transitions and pinpoint significant factors, which displayed a diagnostic accuracy of 0.9556. (3) Implementing t-distributed Stochastic Neighbor Embedding (t-SNE) alongside Gaussian Mixture Models (GMM) to segment patient data into distinct phenotypic clusters that reflect varied risk profiles and disease progression pathways. This phenotypic clustering revealed two main groups, with one showing a markedly increased risk of major adverse cardiovascular events (MACEs), underscored by the significant predictive role of nocturnal hypoxia and sympathetic nervous system activity from sleep data. Analysis of transitions and trajectories with t-SNE and GMM highlighted different progression rates within the cohort, with one cluster progressing more slowly towards severe CVD states than the other. This study offers a comprehensive understanding of the dynamic relationship between CVD and OSA, providing valuable tools for predicting disease onset and tailoring treatment approaches.
心血管疾病(CVDs)在患有结构性睡眠呼吸暂停(OSA)的患者中非常普遍,由于合并症之间错综复杂的相互作用,给预测心血管疾病的进展带来了独特的挑战。传统模型通常缺乏必要的动态和纵向范围,无法准确预测 OSA 患者的心血管疾病发展轨迹。本研究利用威斯康星睡眠队列(Wisconsin Sleep Cohort)的数据,引入了一种新的多层次表型模型来分析这些疾病随时间的发展和相互作用。我们的方法包括三个先进步骤:(1)通过树状模型进行特征重要性分析,以强调关键的预测变量,如总胆固醇、低密度脂蛋白(LDL)和糖尿病。(2)开发逻辑混合效应模型(LGMM)来追踪纵向转变并指出重要因素,诊断准确率为 0.9556。3)实施 t 分布随机邻域嵌入(t-SNE)和高斯混合模型(GMM),将患者数据分割成不同的表型聚类,以反映不同的风险特征和疾病进展途径。这种表型聚类揭示了两个主要群体,其中一个群体发生主要不良心血管事件(MACE)的风险明显增加,而睡眠数据中的夜间缺氧和交感神经系统活动的重要预测作用则凸显了这一点。利用 t-SNE 和 GMM 对过渡和轨迹进行的分析突显了队列中不同的进展速度,其中一个群组比另一个群组在严重心血管疾病状态的进展速度更慢。这项研究全面揭示了心血管疾病与 OSA 之间的动态关系,为预测疾病的发生和定制治疗方法提供了宝贵的工具。
{"title":"Multi-level Phenotypic Models of Cardiovascular Disease and Obstructive Sleep Apnea Comorbidities: A Longitudinal Wisconsin Sleep Cohort Study","authors":"Duy Nguyen, Ca Hoang, Phat K. Huynh, Tien Truong, Dang Nguyen, Abhay Sharma, Trung Q. Le","doi":"arxiv-2406.18602","DOIUrl":"https://doi.org/arxiv-2406.18602","url":null,"abstract":"Cardiovascular diseases (CVDs) are notably prevalent among patients with\u0000obstructive sleep apnea (OSA), posing unique challenges in predicting CVD\u0000progression due to the intricate interactions of comorbidities. Traditional\u0000models typically lack the necessary dynamic and longitudinal scope to\u0000accurately forecast CVD trajectories in OSA patients. This study introduces a\u0000novel multi-level phenotypic model to analyze the progression and interplay of\u0000these conditions over time, utilizing data from the Wisconsin Sleep Cohort,\u0000which includes 1,123 participants followed for decades. Our methodology\u0000comprises three advanced steps: (1) Conducting feature importance analysis\u0000through tree-based models to underscore critical predictive variables like\u0000total cholesterol, low-density lipoprotein (LDL), and diabetes. (2) Developing\u0000a logistic mixed-effects model (LGMM) to track longitudinal transitions and\u0000pinpoint significant factors, which displayed a diagnostic accuracy of 0.9556.\u0000(3) Implementing t-distributed Stochastic Neighbor Embedding (t-SNE) alongside\u0000Gaussian Mixture Models (GMM) to segment patient data into distinct phenotypic\u0000clusters that reflect varied risk profiles and disease progression pathways.\u0000This phenotypic clustering revealed two main groups, with one showing a\u0000markedly increased risk of major adverse cardiovascular events (MACEs),\u0000underscored by the significant predictive role of nocturnal hypoxia and\u0000sympathetic nervous system activity from sleep data. Analysis of transitions\u0000and trajectories with t-SNE and GMM highlighted different progression rates\u0000within the cohort, with one cluster progressing more slowly towards severe CVD\u0000states than the other. This study offers a comprehensive understanding of the\u0000dynamic relationship between CVD and OSA, providing valuable tools for\u0000predicting disease onset and tailoring treatment approaches.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141531890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the era of Big Data, analyzing high-dimensional and large datasets presents significant computational challenges. Although Bayesian statistics is well-suited for these complex data structures, Markov chain Monte Carlo (MCMC) method, which are essential for Bayesian estimation, suffers from computation cost because of its sequential nature. For faster and more effective computation, this paper introduces an algorithm to enhance a parallelizing MCMC method to handle this computation problem. We highlight the critical role of the overlapped area of posterior distributions after data partitioning, and propose a method using a machine learning classifier to effectively identify and extract MCMC draws from the area to approximate the actual posterior distribution. Our main contribution is the development of a Kullback-Leibler (KL) divergence-based criterion that simplifies hyperparameter tuning in training a classifier and makes the method nearly hyperparameter-free. Simulation studies validate the efficacy of our proposed methods.
{"title":"Parallelizing MCMC with Machine Learning Classifier and Its Criterion Based on Kullback-Leibler Divergence","authors":"Tomoki Matsumoto, Yuichiro Kanazawa","doi":"arxiv-2406.11246","DOIUrl":"https://doi.org/arxiv-2406.11246","url":null,"abstract":"In the era of Big Data, analyzing high-dimensional and large datasets\u0000presents significant computational challenges. Although Bayesian statistics is\u0000well-suited for these complex data structures, Markov chain Monte Carlo (MCMC)\u0000method, which are essential for Bayesian estimation, suffers from computation\u0000cost because of its sequential nature. For faster and more effective\u0000computation, this paper introduces an algorithm to enhance a parallelizing MCMC\u0000method to handle this computation problem. We highlight the critical role of\u0000the overlapped area of posterior distributions after data partitioning, and\u0000propose a method using a machine learning classifier to effectively identify\u0000and extract MCMC draws from the area to approximate the actual posterior\u0000distribution. Our main contribution is the development of a Kullback-Leibler\u0000(KL) divergence-based criterion that simplifies hyperparameter tuning in\u0000training a classifier and makes the method nearly hyperparameter-free.\u0000Simulation studies validate the efficacy of our proposed methods.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}