Electronic health records and other sources of observational data are increasingly used for drawing causal inferences. The estimation of a causal effect using these data not meant for research purposes is subject to confounding and irregularly-spaced covariate-driven observation times affecting the inference. A doubly-weighted estimator accounting for these features has previously been proposed that relies on the correct specification of two nuisance models used for the weights. In this work, we propose a novel consistent multiply robust estimator and demonstrate analytically and in comprehensive simulation studies that it is more flexible and more efficient than the only alternative estimator proposed for the same setting. It is further applied to data from the Add Health study in the United States to estimate the causal effect of therapy counseling on alcohol consumption in American adolescents.
电子健康记录和其他来源的观察数据越来越多地被用于因果推断。使用这些非研究目的的数据来估计因果效应会受到混杂因素和不规则间隔的协变量驱动的观察时间的影响。以前曾提出过一种考虑到这些特征的双重加权估计器,它依赖于对用于加权的两个滋扰模型的正确规范。在这项工作中,我们提出了一种新颖的一致乘稳健估计器,并通过分析和综合模拟研究证明,与针对相同环境提出的唯一替代估计器相比,该估计器更灵活、更高效。我们将其进一步应用于美国 Add Health 研究数据,以估计治疗咨询对美国青少年酒精消费的因果效应。
{"title":"Multiply robust estimation of marginal structural models in observational studies subject to covariate-driven observations.","authors":"Janie Coulombe, Shu Yang","doi":"10.1093/biomtc/ujae065","DOIUrl":"10.1093/biomtc/ujae065","url":null,"abstract":"<p><p>Electronic health records and other sources of observational data are increasingly used for drawing causal inferences. The estimation of a causal effect using these data not meant for research purposes is subject to confounding and irregularly-spaced covariate-driven observation times affecting the inference. A doubly-weighted estimator accounting for these features has previously been proposed that relies on the correct specification of two nuisance models used for the weights. In this work, we propose a novel consistent multiply robust estimator and demonstrate analytically and in comprehensive simulation studies that it is more flexible and more efficient than the only alternative estimator proposed for the same setting. It is further applied to data from the Add Health study in the United States to estimate the causal effect of therapy counseling on alcohol consumption in American adolescents.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11250490/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141619221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Many existing methodologies for analyzing spatiotemporal point patterns are developed based on the assumption of stationarity in both space and time for the second-order intensity or pair correlation. In practice, however, such an assumption often lacks validity or proves to be unrealistic. In this paper, we propose a novel and flexible nonparametric approach for estimating the second-order characteristics of spatiotemporal point processes, accommodating non-stationary temporal correlations. Our proposed method employs kernel smoothing and effectively accounts for spatial and temporal correlations differently. Under a spatially increasing-domain asymptotic framework, we establish consistency of the proposed estimators, which can be constructed using different first-order intensity estimators to enhance practicality. Simulation results reveal that our method, in comparison with existing approaches, significantly improves statistical efficiency. An application to a COVID-19 dataset further illustrates the flexibility and interpretability of our procedure.
{"title":"Nonparametric second-order estimation for spatiotemporal point patterns.","authors":"Decai Liang, Jialing Liu, Ye Shen, Yongtao Guan","doi":"10.1093/biomtc/ujae071","DOIUrl":"https://doi.org/10.1093/biomtc/ujae071","url":null,"abstract":"<p><p>Many existing methodologies for analyzing spatiotemporal point patterns are developed based on the assumption of stationarity in both space and time for the second-order intensity or pair correlation. In practice, however, such an assumption often lacks validity or proves to be unrealistic. In this paper, we propose a novel and flexible nonparametric approach for estimating the second-order characteristics of spatiotemporal point processes, accommodating non-stationary temporal correlations. Our proposed method employs kernel smoothing and effectively accounts for spatial and temporal correlations differently. Under a spatially increasing-domain asymptotic framework, we establish consistency of the proposed estimators, which can be constructed using different first-order intensity estimators to enhance practicality. Simulation results reveal that our method, in comparison with existing approaches, significantly improves statistical efficiency. An application to a COVID-19 dataset further illustrates the flexibility and interpretability of our procedure.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141888419","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Federico Castelletti, Guido Consonni, Marco L Della Vedova
The scope of this paper is a multivariate setting involving categorical variables. Following an external manipulation of one variable, the goal is to evaluate the causal effect on an outcome of interest. A typical scenario involves a system of variables representing lifestyle, physical and mental features, symptoms, and risk factors, with the outcome being the presence or absence of a disease. These variables are interconnected in complex ways, allowing the effect of an intervention to propagate through multiple paths. A distinctive feature of our approach is the estimation of causal effects while accounting for uncertainty in both the dependence structure, which we represent through a directed acyclic graph (DAG), and the DAG-model parameters. Specifically, we propose a Markov chain Monte Carlo algorithm that targets the joint posterior over DAGs and parameters, based on an efficient reversible-jump proposal scheme. We validate our method through extensive simulation studies and demonstrate that it outperforms current state-of-the-art procedures in terms of estimation accuracy. Finally, we apply our methodology to analyze a dataset on depression and anxiety in undergraduate students.
本文的研究范围是涉及分类变量的多变量环境。在对一个变量进行外部操作后,目标是评估其对相关结果的因果影响。一个典型的情景是由代表生活方式、身心特征、症状和风险因素的变量组成的系统,其结果是是否患有某种疾病。这些变量以复杂的方式相互关联,使得干预效果可以通过多种途径传播。我们方法的一个显著特点是在估算因果效应的同时,考虑到依赖结构(我们通过有向无环图(DAG)表示)和 DAG 模型参数的不确定性。具体来说,我们提出了一种马尔可夫链蒙特卡洛算法,该算法基于高效的可逆跳跃建议方案,以 DAG 和参数的联合后验为目标。我们通过大量的模拟研究验证了我们的方法,并证明它在估计精度方面优于目前最先进的程序。最后,我们将我们的方法应用于分析本科生抑郁和焦虑的数据集。
{"title":"Joint structure learning and causal effect estimation for categorical graphical models.","authors":"Federico Castelletti, Guido Consonni, Marco L Della Vedova","doi":"10.1093/biomtc/ujae067","DOIUrl":"https://doi.org/10.1093/biomtc/ujae067","url":null,"abstract":"<p><p>The scope of this paper is a multivariate setting involving categorical variables. Following an external manipulation of one variable, the goal is to evaluate the causal effect on an outcome of interest. A typical scenario involves a system of variables representing lifestyle, physical and mental features, symptoms, and risk factors, with the outcome being the presence or absence of a disease. These variables are interconnected in complex ways, allowing the effect of an intervention to propagate through multiple paths. A distinctive feature of our approach is the estimation of causal effects while accounting for uncertainty in both the dependence structure, which we represent through a directed acyclic graph (DAG), and the DAG-model parameters. Specifically, we propose a Markov chain Monte Carlo algorithm that targets the joint posterior over DAGs and parameters, based on an efficient reversible-jump proposal scheme. We validate our method through extensive simulation studies and demonstrate that it outperforms current state-of-the-art procedures in terms of estimation accuracy. Finally, we apply our methodology to analyze a dataset on depression and anxiety in undergraduate students.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141787239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Huimin Li, Bencong Zhu, Xi Jiang, Lei Guo, Yang Xie, Lin Xu, Qiwei Li
Recent breakthroughs in spatially resolved transcriptomics (SRT) technologies have enabled comprehensive molecular characterization at the spot or cellular level while preserving spatial information. Cells are the fundamental building blocks of tissues, organized into distinct yet connected components. Although many non-spatial and spatial clustering approaches have been used to partition the entire region into mutually exclusive spatial domains based on the SRT high-dimensional molecular profile, most require an ad hoc selection of less interpretable dimensional-reduction techniques. To overcome this challenge, we propose a zero-inflated negative binomial mixture model to cluster spots or cells based on their molecular profiles. To increase interpretability, we employ a feature selection mechanism to provide a low-dimensional summary of the SRT molecular profile in terms of discriminating genes that shed light on the clustering result. We further incorporate the SRT geospatial profile via a Markov random field prior. We demonstrate how this joint modeling strategy improves clustering accuracy, compared with alternative state-of-the-art approaches, through simulation studies and 3 real data applications.
{"title":"An interpretable Bayesian clustering approach with feature selection for analyzing spatially resolved transcriptomics data.","authors":"Huimin Li, Bencong Zhu, Xi Jiang, Lei Guo, Yang Xie, Lin Xu, Qiwei Li","doi":"10.1093/biomtc/ujae066","DOIUrl":"10.1093/biomtc/ujae066","url":null,"abstract":"<p><p>Recent breakthroughs in spatially resolved transcriptomics (SRT) technologies have enabled comprehensive molecular characterization at the spot or cellular level while preserving spatial information. Cells are the fundamental building blocks of tissues, organized into distinct yet connected components. Although many non-spatial and spatial clustering approaches have been used to partition the entire region into mutually exclusive spatial domains based on the SRT high-dimensional molecular profile, most require an ad hoc selection of less interpretable dimensional-reduction techniques. To overcome this challenge, we propose a zero-inflated negative binomial mixture model to cluster spots or cells based on their molecular profiles. To increase interpretability, we employ a feature selection mechanism to provide a low-dimensional summary of the SRT molecular profile in terms of discriminating genes that shed light on the clustering result. We further incorporate the SRT geospatial profile via a Markov random field prior. We demonstrate how this joint modeling strategy improves clustering accuracy, compared with alternative state-of-the-art approaches, through simulation studies and 3 real data applications.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11285114/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141787236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Willem van den Boom, Maria De Iorio, Fang Qian, Alessandra Guglielmi
Time-to-event data are often recorded on a discrete scale with multiple, competing risks as potential causes for the event. In this context, application of continuous survival analysis methods with a single risk suffers from biased estimation. Therefore, we propose the multivariate Bernoulli detector for competing risks with discrete times involving a multivariate change point model on the cause-specific baseline hazards. Through the prior on the number of change points and their location, we impose dependence between change points across risks, as well as allowing for data-driven learning of their number. Then, conditionally on these change points, a multivariate Bernoulli prior is used to infer which risks are involved. Focus of posterior inference is cause-specific hazard rates and dependence across risks. Such dependence is often present due to subject-specific changes across time that affect all risks. Full posterior inference is performed through a tailored local-global Markov chain Monte Carlo (MCMC) algorithm, which exploits a data augmentation trick and MCMC updates from nonconjugate Bayesian nonparametric methods. We illustrate our model in simulations and on ICU data, comparing its performance with existing approaches.
{"title":"The multivariate Bernoulli detector: change point estimation in discrete survival analysis.","authors":"Willem van den Boom, Maria De Iorio, Fang Qian, Alessandra Guglielmi","doi":"10.1093/biomtc/ujae075","DOIUrl":"https://doi.org/10.1093/biomtc/ujae075","url":null,"abstract":"<p><p>Time-to-event data are often recorded on a discrete scale with multiple, competing risks as potential causes for the event. In this context, application of continuous survival analysis methods with a single risk suffers from biased estimation. Therefore, we propose the multivariate Bernoulli detector for competing risks with discrete times involving a multivariate change point model on the cause-specific baseline hazards. Through the prior on the number of change points and their location, we impose dependence between change points across risks, as well as allowing for data-driven learning of their number. Then, conditionally on these change points, a multivariate Bernoulli prior is used to infer which risks are involved. Focus of posterior inference is cause-specific hazard rates and dependence across risks. Such dependence is often present due to subject-specific changes across time that affect all risks. Full posterior inference is performed through a tailored local-global Markov chain Monte Carlo (MCMC) algorithm, which exploits a data augmentation trick and MCMC updates from nonconjugate Bayesian nonparametric methods. We illustrate our model in simulations and on ICU data, comparing its performance with existing approaches.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141970575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuheng Wang, Juan Ye, Xiaohui Li, David L Borchers
Passive acoustic monitoring can be an effective way of monitoring wildlife populations that are acoustically active but difficult to survey visually, but identifying target species calls in recordings is non-trivial. Machine learning (ML) techniques can do detection quickly but may miss calls and produce false positives, i.e., misidentify calls from other sources as being from the target species. While abundance estimation methods can address the former issue effectively, methods to deal with false positives are under-investigated. We propose an acoustic spatial capture-recapture (ASCR) method that deals with false positives by treating species identity as a latent variable. Individual-level outputs from ML techniques are treated as random variables whose distributions depend on the latent identity. This gives rise to a mixture model likelihood that we maximize to estimate call density. We compare our method to existing methods by applying it to an ASCR survey of frogs and simulated acoustic surveys of gibbons based on real gibbon acoustic data. Estimates from our method are closer to ASCR applied to the dataset without false positives than those from a widely used false positive "correction factor" method. Simulations show our method to have bias close to zero and accurate coverage probabilities and to perform substantially better than ASCR without accounting for false positives.
被动声学监测是监测声学活跃但难以目测的野生动物种群的一种有效方法,但在录音中识别目标物种的叫声并非易事。机器学习(ML)技术可以快速完成检测,但可能会漏检和产生假阳性,即把其他来源的叫声误认为是目标物种的叫声。虽然丰度估算方法可以有效解决前一个问题,但处理误报的方法还没有得到充分研究。我们提出了一种声学空间捕获-再捕获(ASCR)方法,通过将物种身份作为一个潜在变量来处理假阳性。来自 ML 技术的个体级输出被视为随机变量,其分布取决于潜在身份。这就产生了一个混合模型似然,我们将其最大化以估计调用密度。通过将我们的方法应用于 ASCR 青蛙调查和基于真实长臂猿声学数据的模拟长臂猿声学调查,我们将其与现有方法进行了比较。与广泛使用的假阳性 "校正因子 "方法相比,我们的方法得出的估计值更接近于应用于数据集的无假阳性 ASCR 方法。模拟结果表明,我们的方法偏差接近于零,覆盖概率准确,在不考虑假阳性的情况下,其性能大大优于 ASCR。
{"title":"Towards automated animal density estimation with acoustic spatial capture-recapture.","authors":"Yuheng Wang, Juan Ye, Xiaohui Li, David L Borchers","doi":"10.1093/biomtc/ujae081","DOIUrl":"https://doi.org/10.1093/biomtc/ujae081","url":null,"abstract":"<p><p>Passive acoustic monitoring can be an effective way of monitoring wildlife populations that are acoustically active but difficult to survey visually, but identifying target species calls in recordings is non-trivial. Machine learning (ML) techniques can do detection quickly but may miss calls and produce false positives, i.e., misidentify calls from other sources as being from the target species. While abundance estimation methods can address the former issue effectively, methods to deal with false positives are under-investigated. We propose an acoustic spatial capture-recapture (ASCR) method that deals with false positives by treating species identity as a latent variable. Individual-level outputs from ML techniques are treated as random variables whose distributions depend on the latent identity. This gives rise to a mixture model likelihood that we maximize to estimate call density. We compare our method to existing methods by applying it to an ASCR survey of frogs and simulated acoustic surveys of gibbons based on real gibbon acoustic data. Estimates from our method are closer to ASCR applied to the dataset without false positives than those from a widely used false positive \"correction factor\" method. Simulations show our method to have bias close to zero and accurate coverage probabilities and to perform substantially better than ASCR without accounting for false positives.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142079070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Niklas Hagemann, Giampiero Marra, Frank Bretz, Kathrin Möllenhoff
A common problem in clinical trials is to test whether the effect of an explanatory variable on a response of interest is similar between two groups, for example, patient or treatment groups. In this regard, similarity is defined as equivalence up to a pre-specified threshold that denotes an acceptable deviation between the two groups. This issue is typically tackled by assessing if the explanatory variable's effect on the response is similar. This assessment is based on, for example, confidence intervals of differences or a suitable distance between two parametric regression models. Typically, these approaches build on the assumption of a univariate continuous or binary outcome variable. However, multivariate outcomes, especially beyond the case of bivariate binary responses, remain underexplored. This paper introduces an approach based on a generalized joint regression framework exploiting the Gaussian copula. Compared to existing methods, our approach accommodates various outcome variable scales, such as continuous, binary, categorical, and ordinal, including mixed outcomes in multi-dimensional spaces. We demonstrate the validity of this approach through a simulation study and an efficacy-toxicity case study, hence highlighting its practical relevance.
{"title":"Testing for similarity of multivariate mixed outcomes using generalized joint regression models with application to efficacy-toxicity responses.","authors":"Niklas Hagemann, Giampiero Marra, Frank Bretz, Kathrin Möllenhoff","doi":"10.1093/biomtc/ujae077","DOIUrl":"https://doi.org/10.1093/biomtc/ujae077","url":null,"abstract":"<p><p>A common problem in clinical trials is to test whether the effect of an explanatory variable on a response of interest is similar between two groups, for example, patient or treatment groups. In this regard, similarity is defined as equivalence up to a pre-specified threshold that denotes an acceptable deviation between the two groups. This issue is typically tackled by assessing if the explanatory variable's effect on the response is similar. This assessment is based on, for example, confidence intervals of differences or a suitable distance between two parametric regression models. Typically, these approaches build on the assumption of a univariate continuous or binary outcome variable. However, multivariate outcomes, especially beyond the case of bivariate binary responses, remain underexplored. This paper introduces an approach based on a generalized joint regression framework exploiting the Gaussian copula. Compared to existing methods, our approach accommodates various outcome variable scales, such as continuous, binary, categorical, and ordinal, including mixed outcomes in multi-dimensional spaces. We demonstrate the validity of this approach through a simulation study and an efficacy-toxicity case study, hence highlighting its practical relevance.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142016283","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Catherine Xinrui Yu, Jiaqi Gu, Zhaomeng Chen, Zihuai He
Testing multiple hypotheses of conditional independence with provable error rate control is a fundamental problem with various applications. To infer conditional independence with family-wise error rate (FWER) control when only summary statistics of marginal dependence are accessible, we adopt GhostKnockoff to directly generate knockoff copies of summary statistics and propose a new filter to select features conditionally dependent on the response. In addition, we develop a computationally efficient algorithm to greatly reduce the computational cost of knockoff copies generation without sacrificing power and FWER control. Experiments on simulated data and a real dataset of Alzheimer's disease genetics demonstrate the advantage of the proposed method over existing alternatives in both statistical power and computational efficiency.
{"title":"Summary statistics knockoffs inference with family-wise error rate control.","authors":"Catherine Xinrui Yu, Jiaqi Gu, Zhaomeng Chen, Zihuai He","doi":"10.1093/biomtc/ujae082","DOIUrl":"10.1093/biomtc/ujae082","url":null,"abstract":"<p><p>Testing multiple hypotheses of conditional independence with provable error rate control is a fundamental problem with various applications. To infer conditional independence with family-wise error rate (FWER) control when only summary statistics of marginal dependence are accessible, we adopt GhostKnockoff to directly generate knockoff copies of summary statistics and propose a new filter to select features conditionally dependent on the response. In addition, we develop a computationally efficient algorithm to greatly reduce the computational cost of knockoff copies generation without sacrificing power and FWER control. Experiments on simulated data and a real dataset of Alzheimer's disease genetics demonstrate the advantage of the proposed method over existing alternatives in both statistical power and computational efficiency.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11367731/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142104014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Isaac H Goldstein, Daniel M Parker, Sunny Jiang, Volodymyr M Minin
Concentrations of pathogen genomes measured in wastewater have recently become available as a new data source to use when modeling the spread of infectious diseases. One promising use for this data source is inference of the effective reproduction number, the average number of individuals a newly infected person will infect. We propose a model where new infections arrive according to a time-varying immigration rate which can be interpreted as an average number of secondary infections produced by one infectious individual per unit time. This model allows us to estimate the effective reproduction number from concentrations of pathogen genomes, while avoiding difficulty to verify assumptions about the dynamics of the susceptible population. As a byproduct of our primary goal, we also produce a new model for estimating the effective reproduction number from case data using the same framework. We test this modeling framework in an agent-based simulation study with a realistic data generating mechanism which accounts for the time-varying dynamics of pathogen shedding. Finally, we apply our new model to estimating the effective reproduction number of SARS-CoV-2, the causative agent of COVID-19, in Los Angeles, CA, using pathogen RNA concentrations collected from a large wastewater treatment facility.
{"title":"Semiparametric inference of effective reproduction number dynamics from wastewater pathogen surveillance data.","authors":"Isaac H Goldstein, Daniel M Parker, Sunny Jiang, Volodymyr M Minin","doi":"10.1093/biomtc/ujae074","DOIUrl":"10.1093/biomtc/ujae074","url":null,"abstract":"<p><p>Concentrations of pathogen genomes measured in wastewater have recently become available as a new data source to use when modeling the spread of infectious diseases. One promising use for this data source is inference of the effective reproduction number, the average number of individuals a newly infected person will infect. We propose a model where new infections arrive according to a time-varying immigration rate which can be interpreted as an average number of secondary infections produced by one infectious individual per unit time. This model allows us to estimate the effective reproduction number from concentrations of pathogen genomes, while avoiding difficulty to verify assumptions about the dynamics of the susceptible population. As a byproduct of our primary goal, we also produce a new model for estimating the effective reproduction number from case data using the same framework. We test this modeling framework in an agent-based simulation study with a realistic data generating mechanism which accounts for the time-varying dynamics of pathogen shedding. Finally, we apply our new model to estimating the effective reproduction number of SARS-CoV-2, the causative agent of COVID-19, in Los Angeles, CA, using pathogen RNA concentrations collected from a large wastewater treatment facility.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141896690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The increasing availability and scale of biobanks and "omic" datasets bring new horizons for understanding biological mechanisms. PathGPS is an exploratory data analysis tool to discover genetic architectures using Genome Wide Association Studies (GWAS) summary data. PathGPS is based on a linear structural equation model where traits are regulated by both genetic and environmental pathways. PathGPS decouples the genetic and environmental components by contrasting the GWAS associations of "signal" genes with those of "noise" genes. From the estimated genetic component, PathGPS then extracts genetic pathways via principal component and factor analysis, leveraging the low-rank and sparse properties. In addition, we provide a bootstrap aggregating ("bagging") algorithm to improve stability under data perturbation and hyperparameter tuning. When applied to a metabolomics dataset and the UK Biobank, PathGPS confirms several known gene-trait clusters and suggests multiple new hypotheses for future investigations.
{"title":"PathGPS: discover shared genetic architecture using GWAS summary data.","authors":"Zijun Gao, Qingyuan Zhao, Trevor Hastie","doi":"10.1093/biomtc/ujae060","DOIUrl":"10.1093/biomtc/ujae060","url":null,"abstract":"<p><p>The increasing availability and scale of biobanks and \"omic\" datasets bring new horizons for understanding biological mechanisms. PathGPS is an exploratory data analysis tool to discover genetic architectures using Genome Wide Association Studies (GWAS) summary data. PathGPS is based on a linear structural equation model where traits are regulated by both genetic and environmental pathways. PathGPS decouples the genetic and environmental components by contrasting the GWAS associations of \"signal\" genes with those of \"noise\" genes. From the estimated genetic component, PathGPS then extracts genetic pathways via principal component and factor analysis, leveraging the low-rank and sparse properties. In addition, we provide a bootstrap aggregating (\"bagging\") algorithm to improve stability under data perturbation and hyperparameter tuning. When applied to a metabolomics dataset and the UK Biobank, PathGPS confirms several known gene-trait clusters and suggests multiple new hypotheses for future investigations.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11247175/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141615885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}