Fitting models to data to obtain distributions of consistent parameter values is important for uncertainty quantification, model comparison, and prediction. Standard Markov Chain Monte Carlo (MCMC) approaches for fitting ordinary differential equations (ODEs) to time-series data involve proposing trial parameter sets, numerically integrating the ODEs forward in time, and accepting or rejecting the trial parameter sets. When the model dynamics depend nonlinearly on the parameters, as is generally the case, trial parameter sets are often rejected, and MCMC approaches become prohibitively computationally costly to converge. Here, we build on methods for numerical continuation and trajectory optimization to introduce an approach in which we use Langevin dynamics in the joint space of variables and parameters to sample models that satisfy constraints on the dynamics. We demonstrate the method by sampling Hopf bifurcations and limit cycles of a model of a biochemical oscillator in a Bayesian framework for parameter estimation, and we obtain more than a hundred fold speedup relative to a leading ensemble MCMC approach that requires numerically integrating the ODEs forward in time. We describe numerical experiments that provide insight into the speedup. The method is general and can be used in any framework for parameter estimation and model selection.
{"title":"Sampling parameters of ordinary differential equations with Langevin dynamics that satisfy constraints","authors":"Chris Chi, Jonathan Weare, Aaron R. Dinner","doi":"arxiv-2408.15505","DOIUrl":"https://doi.org/arxiv-2408.15505","url":null,"abstract":"Fitting models to data to obtain distributions of consistent parameter values\u0000is important for uncertainty quantification, model comparison, and prediction.\u0000Standard Markov Chain Monte Carlo (MCMC) approaches for fitting ordinary\u0000differential equations (ODEs) to time-series data involve proposing trial\u0000parameter sets, numerically integrating the ODEs forward in time, and accepting\u0000or rejecting the trial parameter sets. When the model dynamics depend\u0000nonlinearly on the parameters, as is generally the case, trial parameter sets\u0000are often rejected, and MCMC approaches become prohibitively computationally\u0000costly to converge. Here, we build on methods for numerical continuation and\u0000trajectory optimization to introduce an approach in which we use Langevin\u0000dynamics in the joint space of variables and parameters to sample models that\u0000satisfy constraints on the dynamics. We demonstrate the method by sampling Hopf\u0000bifurcations and limit cycles of a model of a biochemical oscillator in a\u0000Bayesian framework for parameter estimation, and we obtain more than a hundred\u0000fold speedup relative to a leading ensemble MCMC approach that requires\u0000numerically integrating the ODEs forward in time. We describe numerical\u0000experiments that provide insight into the speedup. The method is general and\u0000can be used in any framework for parameter estimation and model selection.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":"73 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142189496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Andrew Iskauskas, Jamie A. Cohen, Danny Scarponi, Ian Vernon, Michael Goldstein, Daniel Klein, Richard G. White, Nicky McCreesh
The study of transmission and progression of human papillomavirus (HPV) is crucial for understanding the incidence of cervical cancers, and has been identified as a priority worldwide. The complexity of the disease necessitates a detailed model of HPV transmission and its progression to cancer; to infer properties of the above we require a careful process that can match to imperfect or incomplete observational data. In this paper, we describe the HPVsim simulator to satisfy the former requirement; to satisfy the latter we couple this stochastic simulator to a process of emulation and history matching using the R package hmer. With these tools, we are able to obtain a comprehensive collection of parameter combinations that could give rise to observed cancer data, and explore the implications of the variability of these parameter sets as it relates to future health interventions.
研究人类乳头瘤病毒(HPV)的传播和发展对了解宫颈癌的发病率至关重要,已被确定为全球的优先事项。由于该疾病的复杂性,有必要建立一个详细的 HPV 传播及其向癌症发展的模型;要推断上述模型的特性,我们需要一个能与不完善或不完整的观察数据相匹配的谨慎过程。在本文中,我们描述了 HPVsim 模拟器,以满足前一项要求;为了满足后一项要求,我们将该随机模拟器与使用 R 软件包 hmer 的仿真和历史匹配过程结合起来。有了这些工具,我们就能全面收集可能导致癌症观测数据的参数组合,并探索这些参数集的可变性对未来健康干预的影响。
{"title":"Investigating Complex HPV Dynamics Using Emulation and History Matching","authors":"Andrew Iskauskas, Jamie A. Cohen, Danny Scarponi, Ian Vernon, Michael Goldstein, Daniel Klein, Richard G. White, Nicky McCreesh","doi":"arxiv-2408.15805","DOIUrl":"https://doi.org/arxiv-2408.15805","url":null,"abstract":"The study of transmission and progression of human papillomavirus (HPV) is\u0000crucial for understanding the incidence of cervical cancers, and has been\u0000identified as a priority worldwide. The complexity of the disease necessitates\u0000a detailed model of HPV transmission and its progression to cancer; to infer\u0000properties of the above we require a careful process that can match to\u0000imperfect or incomplete observational data. In this paper, we describe the\u0000HPVsim simulator to satisfy the former requirement; to satisfy the latter we\u0000couple this stochastic simulator to a process of emulation and history matching\u0000using the R package hmer. With these tools, we are able to obtain a\u0000comprehensive collection of parameter combinations that could give rise to\u0000observed cancer data, and explore the implications of the variability of these\u0000parameter sets as it relates to future health interventions.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":"184 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142224596","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Quantifying the predictive capacity of a neural system, intended as the capability to store information and actively use it for dynamic system evolution, is a key component of neural information processing. Information storage (IS), the main measure quantifying the active utilization of memory in a dynamic system, is only defined for discrete-time processes. While recent theoretical work laid the foundations for the continuous-time analysis of the predictive capacity stored in a process, methods for the effective computation of the related measures are needed to favor widespread utilization on neural data. This work introduces a method for the model-free estimation of the so-called memory utilization rate (MUR), the continuous-time counterpart of the IS, specifically designed to quantify the predictive capacity stored in neural point processes. The method employs nearest-neighbor entropy estimation applied to the inter-spike intervals measured from point-process realizations to quantify the extent of memory used by a spike train. An empirical procedure based on surrogate data is implemented to compensate the estimation bias and detect statistically significant levels of memory. The method is validated in simulated Poisson processes and in realistic models of coupled cortical dynamics and heartbeat dynamics. It is then applied to real spike trains reflecting central and autonomic nervous system activities: in spontaneously growing cortical neuron cultures, the MUR detected increasing memory utilization across maturation stages, associated to emergent bursting synchronized activity; in the study of the neuro-autonomic modulation of human heartbeats, the MUR reflected the sympathetic activation occurring with postural but not with mental stress. The proposed approach offers a computationally reliable tool to analyze spike train data in computational neuroscience and physiology.
量化神经系统的预测能力是神经信息处理的一个关键组成部分,预测能力是指神经系统存储信息并积极利用信息进行动态系统进化的能力。信息存储(IS)是量化动态系统内存主动利用率的主要指标,但它只适用于离散时间过程。虽然最近的理论工作为连续时间分析过程中存储的预测能力奠定了基础,但仍需要有效计算相关度量的方法,以促进神经数据的广泛利用。本研究介绍了一种无模型估算所谓内存利用率(MUR)的方法,即 IS 的连续时间对应值,专门用于量化神经点过程中存储的预测能力。该方法采用最近邻熵估算法,将其应用于从点进程实现中测量的尖峰间间隔,以量化尖峰序列所使用的记忆程度。基于代用数据的经验程序可补偿估计偏差,并检测出具有统计学意义的记忆水平。该方法在模拟泊松过程以及耦合皮层动力学和心跳动力学的现实模型中得到了验证。然后,将该方法应用于反映中枢神经系统和自主神经系统活动的真实尖峰列车:在自发生长的皮层神经元培养物中,MUR 检测到记忆利用率在各个成熟阶段都在增加,这与突发的同步活动有关;在人类心跳的神经-自主神经调节研究中,MUR 反映了交感神经在体力压力下的激活,而不是在精神压力下的激活。所提出的方法为在计算神经科学和生理学中分析尖峰列车数据提供了一种计算上可靠的工具。
{"title":"A Model-Free Method to Quantify Memory Utilization in Neural Point Processes","authors":"Gorana Mijatovic, Sebastiano Stramaglia, Luca Faes","doi":"arxiv-2408.15875","DOIUrl":"https://doi.org/arxiv-2408.15875","url":null,"abstract":"Quantifying the predictive capacity of a neural system, intended as the\u0000capability to store information and actively use it for dynamic system\u0000evolution, is a key component of neural information processing. Information\u0000storage (IS), the main measure quantifying the active utilization of memory in\u0000a dynamic system, is only defined for discrete-time processes. While recent\u0000theoretical work laid the foundations for the continuous-time analysis of the\u0000predictive capacity stored in a process, methods for the effective computation\u0000of the related measures are needed to favor widespread utilization on neural\u0000data. This work introduces a method for the model-free estimation of the\u0000so-called memory utilization rate (MUR), the continuous-time counterpart of the\u0000IS, specifically designed to quantify the predictive capacity stored in neural\u0000point processes. The method employs nearest-neighbor entropy estimation applied\u0000to the inter-spike intervals measured from point-process realizations to\u0000quantify the extent of memory used by a spike train. An empirical procedure\u0000based on surrogate data is implemented to compensate the estimation bias and\u0000detect statistically significant levels of memory. The method is validated in\u0000simulated Poisson processes and in realistic models of coupled cortical\u0000dynamics and heartbeat dynamics. It is then applied to real spike trains\u0000reflecting central and autonomic nervous system activities: in spontaneously\u0000growing cortical neuron cultures, the MUR detected increasing memory\u0000utilization across maturation stages, associated to emergent bursting\u0000synchronized activity; in the study of the neuro-autonomic modulation of human\u0000heartbeats, the MUR reflected the sympathetic activation occurring with\u0000postural but not with mental stress. The proposed approach offers a\u0000computationally reliable tool to analyze spike train data in computational\u0000neuroscience and physiology.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142224594","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In many population-based medical studies, the specific cause of death is unidentified, unreliable or even unavailable. Relative survival analysis addresses this scenario, outside of standard (competing risks) survival analysis, to nevertheless estimate survival with respect to a specific cause. It separates the impact of the disease itself on mortality from other factors, such as age, sex, and general population trends. Different methods were created with the aim to construct consistent and efficient estimators for this purpose. The R package relsurv is the most commonly used today in application. With Julia continuously proving itself to be an efficient and powerful programming language, we felt the need to code a pure Julia take, thus NetSurvival.jl, of the standard routines and estimators in the field. The proposed implementation is clean, future-proof, well tested, and the package is correctly documented inside the rising JuliaSurv GitHub organization, ensuring trustability of the results. Through a comprehensive comparison in terms of performance and interface to relsurv, we highlight the benefits of the Julia developing environment.
在许多基于人口的医学研究中,具体死因无法确定、不可靠甚至无法获得。相对存活率分析就是在标准(竞争风险)存活率分析之外,针对这种情况估算与特定死因相关的存活率,它将疾病本身对死亡率的影响与年龄、性别和总体人口趋势等其他因素区分开来。为实现这一目的,人们创造了不同的方法来构建一致且高效的估计器。R软件包relsurv是目前最常用的应用软件。随着 Julia 不断证明自己是一种高效、强大的编程语言,我们认为有必要对该领域的标准例程和估计器进行纯 Julia 代码转换,即 NetSurvival.jl。我们提出的实现是简洁的、面向未来的、经过良好测试的,而且该软件包在不断上升的 JuliaSurv GitHub 组织内有正确的文档记录,从而确保了结果的可信度。通过对性能和与 relsurv 接口的综合比较,我们强调了 Julia 开发环境的优势。
{"title":"NetSurvival.jl: A glimpse into relative survival analysis with Julia","authors":"Rim Alhajal, Oskar Laverny","doi":"arxiv-2408.15655","DOIUrl":"https://doi.org/arxiv-2408.15655","url":null,"abstract":"In many population-based medical studies, the specific cause of death is\u0000unidentified, unreliable or even unavailable. Relative survival analysis\u0000addresses this scenario, outside of standard (competing risks) survival\u0000analysis, to nevertheless estimate survival with respect to a specific cause.\u0000It separates the impact of the disease itself on mortality from other factors,\u0000such as age, sex, and general population trends. Different methods were created\u0000with the aim to construct consistent and efficient estimators for this purpose.\u0000The R package relsurv is the most commonly used today in application. With\u0000Julia continuously proving itself to be an efficient and powerful programming\u0000language, we felt the need to code a pure Julia take, thus NetSurvival.jl, of\u0000the standard routines and estimators in the field. The proposed implementation\u0000is clean, future-proof, well tested, and the package is correctly documented\u0000inside the rising JuliaSurv GitHub organization, ensuring trustability of the\u0000results. Through a comprehensive comparison in terms of performance and\u0000interface to relsurv, we highlight the benefits of the Julia developing\u0000environment.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142189495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Adaptive Markov chain Monte Carlo (MCMC) algorithms, which automatically tune their parameters based on past samples, have proved extremely useful in practice. The self-tuning mechanism makes them `non-Markovian', which means that their validity cannot be ensured by standard Markov chains theory. Several different techniques have been suggested to analyse their theoretical properties, many of which are technically involved. The technical nature of the theory may make the methods unnecessarily unappealing. We discuss one technique -- based on a martingale decomposition -- with uniformly ergodic Markov transitions. We provide an accessible and self-contained treatment in this setting, and give detailed proofs of the results discussed in the paper, which only require basic understanding of martingale theory and general state space Markov chain concepts. We illustrate how our conditions can accomodate different types of adaptation schemes, and can give useful insight to the requirements which ensure their validity.
{"title":"An invitation to adaptive Markov chain Monte Carlo convergence theory","authors":"Pietari Laitinen, Matti Vihola","doi":"arxiv-2408.14903","DOIUrl":"https://doi.org/arxiv-2408.14903","url":null,"abstract":"Adaptive Markov chain Monte Carlo (MCMC) algorithms, which automatically tune\u0000their parameters based on past samples, have proved extremely useful in\u0000practice. The self-tuning mechanism makes them `non-Markovian', which means\u0000that their validity cannot be ensured by standard Markov chains theory. Several\u0000different techniques have been suggested to analyse their theoretical\u0000properties, many of which are technically involved. The technical nature of the\u0000theory may make the methods unnecessarily unappealing. We discuss one technique\u0000-- based on a martingale decomposition -- with uniformly ergodic Markov\u0000transitions. We provide an accessible and self-contained treatment in this\u0000setting, and give detailed proofs of the results discussed in the paper, which\u0000only require basic understanding of martingale theory and general state space\u0000Markov chain concepts. We illustrate how our conditions can accomodate\u0000different types of adaptation schemes, and can give useful insight to the\u0000requirements which ensure their validity.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":"73 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142189499","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Luke Duttweiler, Jonathan Klus, Brent Coull, Sally W. Thurston
MCMC algorithms are frequently used to perform inference under a Bayesian modeling framework. Convergence diagnostics, such as traceplots, the Gelman-Rubin potential scale reduction factor, and effective sample size, are used to visualize mixing and determine how long to run the sampler. However, these classic diagnostics can be ineffective when the sample space of the algorithm is highly discretized (eg. Bayesian Networks or Dirichlet Process Mixture Models) or the sampler uses frequent non-Euclidean moves. In this article, we develop novel generalized convergence diagnostics produced by mapping the original space to the real-line while respecting a relevant distance function and then evaluating the convergence diagnostics on the mapped values. Simulated examples are provided that demonstrate the success of this method in identifying failures to converge that are missed or unavailable by other methods.
{"title":"The Traceplot Thickens: MCMC Diagnostics for Non-Euclidean Spaces","authors":"Luke Duttweiler, Jonathan Klus, Brent Coull, Sally W. Thurston","doi":"arxiv-2408.15392","DOIUrl":"https://doi.org/arxiv-2408.15392","url":null,"abstract":"MCMC algorithms are frequently used to perform inference under a Bayesian\u0000modeling framework. Convergence diagnostics, such as traceplots, the\u0000Gelman-Rubin potential scale reduction factor, and effective sample size, are\u0000used to visualize mixing and determine how long to run the sampler. However,\u0000these classic diagnostics can be ineffective when the sample space of the\u0000algorithm is highly discretized (eg. Bayesian Networks or Dirichlet Process\u0000Mixture Models) or the sampler uses frequent non-Euclidean moves. In this\u0000article, we develop novel generalized convergence diagnostics produced by\u0000mapping the original space to the real-line while respecting a relevant\u0000distance function and then evaluating the convergence diagnostics on the mapped\u0000values. Simulated examples are provided that demonstrate the success of this\u0000method in identifying failures to converge that are missed or unavailable by\u0000other methods.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142189497","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper addresses the key challenge of estimating the asymptotic covariance associated with the Markov chain central limit theorem, which is essential for visualizing and terminating Markov Chain Monte Carlo (MCMC) simulations. We focus on summarizing batching, spectral, and initial sequence covariance estimation techniques. We emphasize practical recommendations for modern MCMC simulations, where positive correlation is common and leads to negatively biased covariance estimates. Our discussion is centered on computationally efficient methods that remain viable even when the number of iterations is large, offering insights into improving the reliability and accuracy of MCMC output in such scenarios.
{"title":"Implementing MCMC: Multivariate estimation with confidence","authors":"James M. Flegal, Rebecca P. Kurtz-Garcia","doi":"arxiv-2408.15396","DOIUrl":"https://doi.org/arxiv-2408.15396","url":null,"abstract":"This paper addresses the key challenge of estimating the asymptotic\u0000covariance associated with the Markov chain central limit theorem, which is\u0000essential for visualizing and terminating Markov Chain Monte Carlo (MCMC)\u0000simulations. We focus on summarizing batching, spectral, and initial sequence\u0000covariance estimation techniques. We emphasize practical recommendations for\u0000modern MCMC simulations, where positive correlation is common and leads to\u0000negatively biased covariance estimates. Our discussion is centered on\u0000computationally efficient methods that remain viable even when the number of\u0000iterations is large, offering insights into improving the reliability and\u0000accuracy of MCMC output in such scenarios.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":"126 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142224591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In medical research, understanding changes in outcome measurements is crucial for inferring shifts in a patient's underlying health condition. While data from clinical and administrative systems hold promise for advancing this understanding, traditional methods for modelling disease progression struggle with analyzing a large volume of longitudinal data collected irregularly and do not account for the phenomenon where the poorer an individual's health, the more frequently they interact with the healthcare system. In addition, data from the claim and health care system provide no information for terminating events, such as death. To address these challenges, we start from the continuous-time hidden Markov model to understand disease progression by modelling the observed data as an outcome whose distribution depends on the state of a latent Markov chain representing the underlying health state. However, we also allow the underlying health state to influence the timings of the observations via a point process. Furthermore, we create an addition "death" state and model the unobserved terminating event, a transition to this state, via an additional Poisson process whose rate depends on the latent state of the Markov chain. This extension allows us to model disease severity and death not only based on the types of care received but also on the temporal and frequency aspects of different observed events. We present an exact Gibbs sampler procedure that alternates sampling the complete path of the hidden chain (the latent health state throughout the observation window) conditional on the complete paths. When the unobserved, terminating event occurs early in the observation window, there are no more observed events, and naive use of a model with only "live" health states would lead to biases in parameter estimates; our inclusion of a "death" state mitigates against this.
{"title":"Bayesian inference for the Markov-modulated Poisson process with an outcome process","authors":"Yu Luo, Chris Sherlock","doi":"arxiv-2408.15314","DOIUrl":"https://doi.org/arxiv-2408.15314","url":null,"abstract":"In medical research, understanding changes in outcome measurements is crucial\u0000for inferring shifts in a patient's underlying health condition. While data\u0000from clinical and administrative systems hold promise for advancing this\u0000understanding, traditional methods for modelling disease progression struggle\u0000with analyzing a large volume of longitudinal data collected irregularly and do\u0000not account for the phenomenon where the poorer an individual's health, the\u0000more frequently they interact with the healthcare system. In addition, data\u0000from the claim and health care system provide no information for terminating\u0000events, such as death. To address these challenges, we start from the\u0000continuous-time hidden Markov model to understand disease progression by\u0000modelling the observed data as an outcome whose distribution depends on the\u0000state of a latent Markov chain representing the underlying health state.\u0000However, we also allow the underlying health state to influence the timings of\u0000the observations via a point process. Furthermore, we create an addition\u0000\"death\" state and model the unobserved terminating event, a transition to this\u0000state, via an additional Poisson process whose rate depends on the latent state\u0000of the Markov chain. This extension allows us to model disease severity and\u0000death not only based on the types of care received but also on the temporal and\u0000frequency aspects of different observed events. We present an exact Gibbs\u0000sampler procedure that alternates sampling the complete path of the hidden\u0000chain (the latent health state throughout the observation window) conditional\u0000on the complete paths. When the unobserved, terminating event occurs early in\u0000the observation window, there are no more observed events, and naive use of a\u0000model with only \"live\" health states would lead to biases in parameter\u0000estimates; our inclusion of a \"death\" state mitigates against this.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142189498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The focus of this paper is a key component of a methodology for understanding, interpolating, and predicting fish movement patterns based on spatiotemporal data recorded by spatially static acoustic receivers. For periods of time, fish may be far from the receivers, resulting in the absence of observations. The lack of information on the fish's location for extended time periods poses challenges to the understanding of fish movement patterns, and hence, the identification of proper statistical inference frameworks for modeling the trajectories. As the initial step in our methodology, in this paper, we implement an imputation strategy that relies on both Markov chain and Brownian motion principles to enhance our dataset over time. This methodology will be generalizable and applicable to all fish species with similar migration patterns or data with similar structures due to the use of static acoustic receivers.
{"title":"A New Perspective to Fish Trajectory Imputation: A Methodology for Spatiotemporal Modeling of Acoustically Tagged Fish Data","authors":"Mahshid Ahmadian, Edward L. Boone, Grace S. Chiu","doi":"arxiv-2408.13220","DOIUrl":"https://doi.org/arxiv-2408.13220","url":null,"abstract":"The focus of this paper is a key component of a methodology for\u0000understanding, interpolating, and predicting fish movement patterns based on\u0000spatiotemporal data recorded by spatially static acoustic receivers. For\u0000periods of time, fish may be far from the receivers, resulting in the absence\u0000of observations. The lack of information on the fish's location for extended\u0000time periods poses challenges to the understanding of fish movement patterns,\u0000and hence, the identification of proper statistical inference frameworks for\u0000modeling the trajectories. As the initial step in our methodology, in this\u0000paper, we implement an imputation strategy that relies on both Markov chain and\u0000Brownian motion principles to enhance our dataset over time. This methodology\u0000will be generalizable and applicable to all fish species with similar migration\u0000patterns or data with similar structures due to the use of static acoustic\u0000receivers.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142224600","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Micro and survey datasets often contain private information about individuals, like their health status, income or political preferences. Previous studies have shown that, even after data anonymization, a malicious intruder could still be able to identify individuals in the dataset by matching their variables to external information. Disclosure risk measures are statistical measures meant to quantify how big such a risk is for a specific dataset. One of the most common measures is the number of sample unique values that are also population-unique. cite{Man12} have shown how mixed membership models can provide very accurate estimates of this measure. A limitation of that approach is that the number of extreme profiles has to be chosen by the modeller. In this article, we propose a non-parametric version of the model, based on the Hierarchical Dirichlet Process (HDP). The proposed approach does not require any tuning parameter or model selection step and provides accurate estimates of the disclosure risk measure, even with samples as small as 1$%$ of the population size. Moreover, a data augmentation scheme to address the presence of structural zeros is presented. The proposed methodology is tested on a real dataset from the New York census.
{"title":"Disclosure risk assessment with Bayesian non-parametric hierarchical modelling","authors":"Marco Battiston, Lorenzo Rimella","doi":"arxiv-2408.12521","DOIUrl":"https://doi.org/arxiv-2408.12521","url":null,"abstract":"Micro and survey datasets often contain private information about\u0000individuals, like their health status, income or political preferences.\u0000Previous studies have shown that, even after data anonymization, a malicious\u0000intruder could still be able to identify individuals in the dataset by matching\u0000their variables to external information. Disclosure risk measures are\u0000statistical measures meant to quantify how big such a risk is for a specific\u0000dataset. One of the most common measures is the number of sample unique values\u0000that are also population-unique. cite{Man12} have shown how mixed membership\u0000models can provide very accurate estimates of this measure. A limitation of\u0000that approach is that the number of extreme profiles has to be chosen by the\u0000modeller. In this article, we propose a non-parametric version of the model,\u0000based on the Hierarchical Dirichlet Process (HDP). The proposed approach does\u0000not require any tuning parameter or model selection step and provides accurate\u0000estimates of the disclosure risk measure, even with samples as small as 1$%$\u0000of the population size. Moreover, a data augmentation scheme to address the\u0000presence of structural zeros is presented. The proposed methodology is tested\u0000on a real dataset from the New York census.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142189501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}