Pub Date : 2024-10-01DOI: 10.1093/biostatistics/kxae007
Shanghong Xie, R Todd Ogden
Linear and generalized linear scalar-on-function modeling have been commonly used to understand the relationship between a scalar response variable (e.g. continuous, binary outcomes) and functional predictors. Such techniques are sensitive to model misspecification when the relationship between the response variable and the functional predictors is complex. On the other hand, support vector machines (SVMs) are among the most robust prediction models but do not take account of the high correlations between repeated measurements and cannot be used for irregular data. In this work, we propose a novel method to integrate functional principal component analysis with SVM techniques for classification and regression to account for the continuous nature of functional data and the nonlinear relationship between the scalar response variable and the functional predictors. We demonstrate the performance of our method through extensive simulation experiments and two real data applications: the classification of alcoholics using electroencephalography signals and the prediction of glucobrassicin concentration using near-infrared reflectance spectroscopy. Our methods especially have more advantages when the measurement errors in functional predictors are relatively large.
{"title":"Functional support vector machine.","authors":"Shanghong Xie, R Todd Ogden","doi":"10.1093/biostatistics/kxae007","DOIUrl":"10.1093/biostatistics/kxae007","url":null,"abstract":"<p><p>Linear and generalized linear scalar-on-function modeling have been commonly used to understand the relationship between a scalar response variable (e.g. continuous, binary outcomes) and functional predictors. Such techniques are sensitive to model misspecification when the relationship between the response variable and the functional predictors is complex. On the other hand, support vector machines (SVMs) are among the most robust prediction models but do not take account of the high correlations between repeated measurements and cannot be used for irregular data. In this work, we propose a novel method to integrate functional principal component analysis with SVM techniques for classification and regression to account for the continuous nature of functional data and the nonlinear relationship between the scalar response variable and the functional predictors. We demonstrate the performance of our method through extensive simulation experiments and two real data applications: the classification of alcoholics using electroencephalography signals and the prediction of glucobrassicin concentration using near-infrared reflectance spectroscopy. Our methods especially have more advantages when the measurement errors in functional predictors are relatively large.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":" ","pages":"1178-1194"},"PeriodicalIF":1.8,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140112299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-01DOI: 10.1093/biostatistics/kxae022
{"title":"Correction to: Exponential family measurement error models for single-cell CRISPR screens.","authors":"","doi":"10.1093/biostatistics/kxae022","DOIUrl":"10.1093/biostatistics/kxae022","url":null,"abstract":"","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":" ","pages":"1273"},"PeriodicalIF":1.8,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141319004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-01DOI: 10.1093/biostatistics/kxad035
Arman Oganisian, Kelly D Getz, Todd A Alonzo, Richard Aplenc, Jason A Roy
We develop a Bayesian semiparametric model for the impact of dynamic treatment rules on survival among patients diagnosed with pediatric acute myeloid leukemia (AML). The data consist of a subset of patients enrolled in a phase III clinical trial in which patients move through a sequence of four treatment courses. At each course, they undergo treatment that may or may not include anthracyclines (ACT). While ACT is known to be effective at treating AML, it is also cardiotoxic and can lead to early death for some patients. Our task is to estimate the potential survival probability under hypothetical dynamic ACT treatment strategies, but there are several impediments. First, since ACT is not randomized, its effect on survival is confounded over time. Second, subjects initiate the next course depending on when they recover from the previous course, making timing potentially informative of subsequent treatment and survival. Third, patients may die or drop out before ever completing the full treatment sequence. We develop a generative Bayesian semiparametric model based on Gamma Process priors to address these complexities. At each treatment course, the model captures subjects' transition to subsequent treatment or death in continuous time. G-computation is used to compute a posterior over potential survival probability that is adjusted for time-varying confounding. Using our approach, we estimate the efficacy of hypothetical treatment rules that dynamically modify ACT based on evolving cardiac function.
我们针对动态治疗规则对小儿急性髓性白血病(AML)患者生存期的影响建立了一个贝叶斯半参数模型。数据由参加 III 期临床试验的患者子集组成,在该试验中,患者依次接受四个疗程的治疗。在每个疗程中,患者接受的治疗可能包括也可能不包括蒽环类药物(ACT)。众所周知,蒽环类药物能有效治疗急性髓细胞白血病,但它也有心脏毒性,可能导致一些患者过早死亡。我们的任务是估算假设的动态 ACT 治疗策略下的潜在生存概率,但这有几个障碍。首先,由于 ACT 不是随机的,它对生存的影响会随着时间的推移而受到干扰。其次,受试者何时开始下一疗程取决于他们何时从上一疗程中康复,这使得时间可能对后续治疗和存活率产生影响。第三,患者可能在完成全部治疗序列之前死亡或退出。我们开发了一种基于伽马过程先验的贝叶斯半参数生成模型来解决这些复杂问题。在每个疗程中,该模型都能连续捕捉受试者向后续治疗或死亡的转变。G 计算用于计算潜在存活概率的后验值,并根据时变混杂因素进行调整。利用我们的方法,我们估算了假设治疗规则的疗效,这些规则根据不断变化的心脏功能动态修改 ACT。
{"title":"Bayesian semiparametric model for sequential treatment decisions with informative timing.","authors":"Arman Oganisian, Kelly D Getz, Todd A Alonzo, Richard Aplenc, Jason A Roy","doi":"10.1093/biostatistics/kxad035","DOIUrl":"10.1093/biostatistics/kxad035","url":null,"abstract":"<p><p>We develop a Bayesian semiparametric model for the impact of dynamic treatment rules on survival among patients diagnosed with pediatric acute myeloid leukemia (AML). The data consist of a subset of patients enrolled in a phase III clinical trial in which patients move through a sequence of four treatment courses. At each course, they undergo treatment that may or may not include anthracyclines (ACT). While ACT is known to be effective at treating AML, it is also cardiotoxic and can lead to early death for some patients. Our task is to estimate the potential survival probability under hypothetical dynamic ACT treatment strategies, but there are several impediments. First, since ACT is not randomized, its effect on survival is confounded over time. Second, subjects initiate the next course depending on when they recover from the previous course, making timing potentially informative of subsequent treatment and survival. Third, patients may die or drop out before ever completing the full treatment sequence. We develop a generative Bayesian semiparametric model based on Gamma Process priors to address these complexities. At each treatment course, the model captures subjects' transition to subsequent treatment or death in continuous time. G-computation is used to compute a posterior over potential survival probability that is adjusted for time-varying confounding. Using our approach, we estimate the efficacy of hypothetical treatment rules that dynamically modify ACT based on evolving cardiac function.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":" ","pages":"947-961"},"PeriodicalIF":1.8,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11471958/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139479547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-01DOI: 10.1093/biostatistics/kxae006
Jin Jin, Guanghao Qi, Zhi Yu, Nilanjan Chatterjee
Mendelian randomization (MR) analysis is increasingly popular for testing the causal effect of exposures on disease outcomes using data from genome-wide association studies. In some settings, the underlying exposure, such as systematic inflammation, may not be directly observable, but measurements can be available on multiple biomarkers or other types of traits that are co-regulated by the exposure. We propose a method for MR analysis on latent exposures (MRLE), which tests the significance for, and the direction of, the effect of a latent exposure by leveraging information from multiple related traits. The method is developed by constructing a set of estimating functions based on the second-order moments of GWAS summary association statistics for the observable traits, under a structural equation model where genetic variants are assumed to have indirect effects through the latent exposure and potentially direct effects on the traits. Simulation studies show that MRLE has well-controlled type I error rates and enhanced power compared to single-trait MR tests under various types of pleiotropy. Applications of MRLE using genetic association statistics across five inflammatory biomarkers (CRP, IL-6, IL-8, TNF-α, and MCP-1) provide evidence for potential causal effects of inflammation on increasing the risk of coronary artery disease, colorectal cancer, and rheumatoid arthritis, while standard MR analysis for individual biomarkers fails to detect consistent evidence for such effects.
{"title":"Mendelian randomization analysis using multiple biomarkers of an underlying common exposure.","authors":"Jin Jin, Guanghao Qi, Zhi Yu, Nilanjan Chatterjee","doi":"10.1093/biostatistics/kxae006","DOIUrl":"10.1093/biostatistics/kxae006","url":null,"abstract":"<p><p>Mendelian randomization (MR) analysis is increasingly popular for testing the causal effect of exposures on disease outcomes using data from genome-wide association studies. In some settings, the underlying exposure, such as systematic inflammation, may not be directly observable, but measurements can be available on multiple biomarkers or other types of traits that are co-regulated by the exposure. We propose a method for MR analysis on latent exposures (MRLE), which tests the significance for, and the direction of, the effect of a latent exposure by leveraging information from multiple related traits. The method is developed by constructing a set of estimating functions based on the second-order moments of GWAS summary association statistics for the observable traits, under a structural equation model where genetic variants are assumed to have indirect effects through the latent exposure and potentially direct effects on the traits. Simulation studies show that MRLE has well-controlled type I error rates and enhanced power compared to single-trait MR tests under various types of pleiotropy. Applications of MRLE using genetic association statistics across five inflammatory biomarkers (CRP, IL-6, IL-8, TNF-α, and MCP-1) provide evidence for potential causal effects of inflammation on increasing the risk of coronary artery disease, colorectal cancer, and rheumatoid arthritis, while standard MR analysis for individual biomarkers fails to detect consistent evidence for such effects.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":" ","pages":"1015-1033"},"PeriodicalIF":1.8,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140066298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-01DOI: 10.1093/biostatistics/kxae001
Wei Jin, Yang Ni, Amanda B Spence, Leah H Rubin, Yanxun Xu
Combination antiretroviral therapy (ART) with at least three different drugs has become the standard of care for people with HIV (PWH) due to its exceptional effectiveness in viral suppression. However, many ART drugs have been reported to associate with neuropsychiatric adverse effects including depression, especially when certain genetic polymorphisms exist. Pharmacogenetics is an important consideration for administering combination ART as it may influence drug efficacy and increase risk for neuropsychiatric conditions. Large-scale longitudinal HIV databases provide researchers opportunities to investigate the pharmacogenetics of combination ART in a data-driven manner. However, with more than 30 FDA-approved ART drugs, the interplay between the large number of possible ART drug combinations and genetic polymorphisms imposes statistical modeling challenges. We develop a Bayesian approach to examine the longitudinal effects of combination ART and their interactions with genetic polymorphisms on depressive symptoms in PWH. The proposed method utilizes a Gaussian process with a composite kernel function to capture the longitudinal combination ART effects by directly incorporating individuals' treatment histories, and a Bayesian classification and regression tree to account for individual heterogeneity. Through both simulation studies and an application to a dataset from the Women's Interagency HIV Study, we demonstrate the clinical utility of the proposed approach in investigating the pharmacogenetics of combination ART and assisting physicians to make effective individualized treatment decisions that can improve health outcomes for PWH.
至少使用三种不同药物的联合抗逆转录病毒疗法(ART)在抑制病毒方面效果显著,因此已成为艾滋病病毒感染者(PWH)的标准治疗方法。然而,据报道,许多抗逆转录病毒疗法药物都会产生神经精神方面的不良反应,包括抑郁症,尤其是在存在某些基因多态性的情况下。药物遗传学是实施联合抗逆转录病毒疗法的一个重要考虑因素,因为它可能会影响药物疗效并增加神经精神疾病的风险。大规模的艾滋病纵向数据库为研究人员提供了以数据为导向研究联合抗逆转录病毒疗法药物遗传学的机会。然而,由于美国 FDA 批准的抗逆转录病毒疗法药物超过 30 种,大量可能的抗逆转录病毒疗法药物组合与基因多态性之间的相互作用给统计建模带来了挑战。我们开发了一种贝叶斯方法来研究抗逆转录病毒疗法组合及其与遗传多态性之间的相互作用对 PWH 抑郁症状的纵向影响。所提出的方法利用具有复合核函数的高斯过程,通过直接纳入个体的治疗历史来捕捉联合抗逆转录病毒疗法的纵向效应,并利用贝叶斯分类和回归树来考虑个体的异质性。通过模拟研究和对妇女机构间艾滋病研究数据集的应用,我们证明了所提方法在研究联合抗逆转录病毒疗法的药物遗传学方面的临床实用性,并可协助医生做出有效的个体化治疗决策,从而改善艾滋病患者的健康状况。
{"title":"A Bayesian approach for investigating the pharmacogenetics of combination antiretroviral therapy in people with HIV.","authors":"Wei Jin, Yang Ni, Amanda B Spence, Leah H Rubin, Yanxun Xu","doi":"10.1093/biostatistics/kxae001","DOIUrl":"10.1093/biostatistics/kxae001","url":null,"abstract":"<p><p>Combination antiretroviral therapy (ART) with at least three different drugs has become the standard of care for people with HIV (PWH) due to its exceptional effectiveness in viral suppression. However, many ART drugs have been reported to associate with neuropsychiatric adverse effects including depression, especially when certain genetic polymorphisms exist. Pharmacogenetics is an important consideration for administering combination ART as it may influence drug efficacy and increase risk for neuropsychiatric conditions. Large-scale longitudinal HIV databases provide researchers opportunities to investigate the pharmacogenetics of combination ART in a data-driven manner. However, with more than 30 FDA-approved ART drugs, the interplay between the large number of possible ART drug combinations and genetic polymorphisms imposes statistical modeling challenges. We develop a Bayesian approach to examine the longitudinal effects of combination ART and their interactions with genetic polymorphisms on depressive symptoms in PWH. The proposed method utilizes a Gaussian process with a composite kernel function to capture the longitudinal combination ART effects by directly incorporating individuals' treatment histories, and a Bayesian classification and regression tree to account for individual heterogeneity. Through both simulation studies and an application to a dataset from the Women's Interagency HIV Study, we demonstrate the clinical utility of the proposed approach in investigating the pharmacogenetics of combination ART and assisting physicians to make effective individualized treatment decisions that can improve health outcomes for PWH.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":" ","pages":"1034-1048"},"PeriodicalIF":1.8,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139747854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-01DOI: 10.1093/biostatistics/kxae005
Zhenke Wu, Zehang R Li, Irena Chen, Mengbing Li
Determining causes of deaths (CODs) occurred outside of civil registration and vital statistics systems is challenging. A technique called verbal autopsy (VA) is widely adopted to gather information on deaths in practice. A VA consists of interviewing relatives of a deceased person about symptoms of the deceased in the period leading to the death, often resulting in multivariate binary responses. While statistical methods have been devised for estimating the cause-specific mortality fractions (CSMFs) for a study population, continued expansion of VA to new populations (or "domains") necessitates approaches that recognize between-domain differences while capitalizing on potential similarities. In this article, we propose such a domain-adaptive method that integrates external between-domain similarity information encoded by a prespecified rooted weighted tree. Given a cause, we use latent class models to characterize the conditional distributions of the responses that may vary by domain. We specify a logistic stick-breaking Gaussian diffusion process prior along the tree for class mixing weights with node-specific spike-and-slab priors to pool information between the domains in a data-driven way. The posterior inference is conducted via a scalable variational Bayes algorithm. Simulation studies show that the domain adaptation enabled by the proposed method improves CSMF estimation and individual COD assignment. We also illustrate and evaluate the method using a validation dataset. The article concludes with a discussion of limitations and future directions.
确定民事登记和生命统计系统之外的死亡原因(COD)具有挑战性。在实践中,一种名为口头尸检(VA)的技术被广泛用于收集死亡信息。口头尸检包括对死者亲属进行访谈,了解死者在死亡前的症状,通常会得出多变量二元回答。虽然已有统计方法用于估算研究人群的特定病因死亡率分数(CSMFs),但要继续将 VA 扩展到新的人群(或 "领域"),就必须采用既能认识到不同领域之间的差异,又能利用潜在相似性的方法。在本文中,我们提出了这样一种领域自适应方法,它整合了由预先指定的有根加权树编码的外部域间相似性信息。在给定原因的情况下,我们使用潜类模型来描述可能因领域而异的响应的条件分布。我们沿树为类混合权重指定了一个逻辑破棒高斯扩散过程先验,并指定了节点特定的尖峰和平板先验,以数据驱动的方式汇集域间信息。后验推断通过可扩展的变异贝叶斯算法进行。仿真研究表明,所提出方法的域适应性改进了 CSMF 估计和个体 COD 分配。我们还使用验证数据集对该方法进行了说明和评估。文章最后讨论了局限性和未来发展方向。
{"title":"Tree-informed Bayesian multi-source domain adaptation: cross-population probabilistic cause-of-death assignment using verbal autopsy.","authors":"Zhenke Wu, Zehang R Li, Irena Chen, Mengbing Li","doi":"10.1093/biostatistics/kxae005","DOIUrl":"10.1093/biostatistics/kxae005","url":null,"abstract":"<p><p>Determining causes of deaths (CODs) occurred outside of civil registration and vital statistics systems is challenging. A technique called verbal autopsy (VA) is widely adopted to gather information on deaths in practice. A VA consists of interviewing relatives of a deceased person about symptoms of the deceased in the period leading to the death, often resulting in multivariate binary responses. While statistical methods have been devised for estimating the cause-specific mortality fractions (CSMFs) for a study population, continued expansion of VA to new populations (or \"domains\") necessitates approaches that recognize between-domain differences while capitalizing on potential similarities. In this article, we propose such a domain-adaptive method that integrates external between-domain similarity information encoded by a prespecified rooted weighted tree. Given a cause, we use latent class models to characterize the conditional distributions of the responses that may vary by domain. We specify a logistic stick-breaking Gaussian diffusion process prior along the tree for class mixing weights with node-specific spike-and-slab priors to pool information between the domains in a data-driven way. The posterior inference is conducted via a scalable variational Bayes algorithm. Simulation studies show that the domain adaptation enabled by the proposed method improves CSMF estimation and individual COD assignment. We also illustrate and evaluate the method using a validation dataset. The article concludes with a discussion of limitations and future directions.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":" ","pages":"1233-1253"},"PeriodicalIF":1.8,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11471964/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139944717","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
DNA methylation is an important epigenetic mark that modulates gene expression through the inhibition of transcriptional proteins binding to DNA. As in many other omics experiments, the issue of missing values is an important one, and appropriate imputation techniques are important in avoiding an unnecessary sample size reduction as well as to optimally leverage the information collected. We consider the case where relatively few samples are processed via an expensive high-density whole genome bisulfite sequencing (WGBS) strategy and a larger number of samples is processed using more affordable low-density, array-based technologies. In such cases, one can impute the low-coverage (array-based) methylation data using the high-density information provided by the WGBS samples. In this paper, we propose an efficient Linear Model of Coregionalisation with informative Covariates (LMCC) to predict missing values based on observed values and covariates. Our model assumes that at each site, the methylation vector of all samples is linked to the set of fixed factors (covariates) and a set of latent factors. Furthermore, we exploit the functional nature of the data and the spatial correlation across sites by assuming some Gaussian processes on the fixed and latent coefficient vectors, respectively. Our simulations show that the use of covariates can significantly improve the accuracy of imputed values, especially in cases where missing data contain some relevant information about the explanatory variable. We also showed that our proposed model is particularly efficient when the number of columns is much greater than the number of rows-which is usually the case in methylation data analysis. Finally, we apply and compare our proposed method with alternative approaches on two real methylation datasets, showing how covariates such as cell type, tissue type or age can enhance the accuracy of imputed values.
DNA 甲基化是一种重要的表观遗传标记,它通过抑制转录蛋白与 DNA 的结合来调节基因表达。与许多其他 omics 实验一样,缺失值也是一个重要问题,适当的估算技术对于避免不必要的样本量减少以及优化利用收集到的信息非常重要。我们考虑的情况是,通过昂贵的高密度全基因组亚硫酸氢盐测序(WGBS)策略处理的样本相对较少,而通过价格更低廉的基于阵列的低密度技术处理的样本数量较多。在这种情况下,我们可以利用 WGBS 样本提供的高密度信息来推算低覆盖率(基于阵列的)甲基化数据。在本文中,我们提出了一种高效的带有信息协变量的核心区域化线性模型(LMCC),用于根据观测值和协变量预测缺失值。我们的模型假定,在每个位点,所有样本的甲基化向量都与一组固定因子(协变量)和一组潜在因子相关联。此外,我们还利用了数据的函数性质和不同位点间的空间相关性,分别假设了固定系数向量和潜在系数向量的一些高斯过程。我们的模拟结果表明,协变量的使用可以显著提高估算值的准确性,尤其是在缺失数据包含一些解释变量相关信息的情况下。我们还表明,当列数远大于行数时,我们提出的模型尤其有效--甲基化数据分析中通常就是这种情况。最后,我们在两个真实的甲基化数据集上应用并比较了我们提出的方法和其他方法,展示了细胞类型、组织类型或年龄等协变量如何提高估算值的准确性。
{"title":"Fast matrix completion in epigenetic methylation studies with informative covariates.","authors":"Mélina Ribaud, Aurélie Labbe, Khaled Fouda, Karim Oualkacha","doi":"10.1093/biostatistics/kxae016","DOIUrl":"10.1093/biostatistics/kxae016","url":null,"abstract":"<p><p>DNA methylation is an important epigenetic mark that modulates gene expression through the inhibition of transcriptional proteins binding to DNA. As in many other omics experiments, the issue of missing values is an important one, and appropriate imputation techniques are important in avoiding an unnecessary sample size reduction as well as to optimally leverage the information collected. We consider the case where relatively few samples are processed via an expensive high-density whole genome bisulfite sequencing (WGBS) strategy and a larger number of samples is processed using more affordable low-density, array-based technologies. In such cases, one can impute the low-coverage (array-based) methylation data using the high-density information provided by the WGBS samples. In this paper, we propose an efficient Linear Model of Coregionalisation with informative Covariates (LMCC) to predict missing values based on observed values and covariates. Our model assumes that at each site, the methylation vector of all samples is linked to the set of fixed factors (covariates) and a set of latent factors. Furthermore, we exploit the functional nature of the data and the spatial correlation across sites by assuming some Gaussian processes on the fixed and latent coefficient vectors, respectively. Our simulations show that the use of covariates can significantly improve the accuracy of imputed values, especially in cases where missing data contain some relevant information about the explanatory variable. We also showed that our proposed model is particularly efficient when the number of columns is much greater than the number of rows-which is usually the case in methylation data analysis. Finally, we apply and compare our proposed method with alternative approaches on two real methylation datasets, showing how covariates such as cell type, tissue type or age can enhance the accuracy of imputed values.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":" ","pages":"1062-1078"},"PeriodicalIF":1.8,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11471954/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141293984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-01DOI: 10.1093/biostatistics/kxae008
Xinyuan Tian, Yiting Wang, Selena Wang, Yi Zhao, Yize Zhao
Genetic association studies for brain connectivity phenotypes have gained prominence due to advances in noninvasive imaging techniques and quantitative genetics. Brain connectivity traits, characterized by network configurations and unique biological structures, present distinct challenges compared to other quantitative phenotypes. Furthermore, the presence of sample relatedness in the most imaging genetics studies limits the feasibility of adopting existing network-response modeling. In this article, we fill this gap by proposing a Bayesian network-response mixed-effect model that considers a network-variate phenotype and incorporates population structures including pedigrees and unknown sample relatedness. To accommodate the inherent topological architecture associated with the genetic contributions to the phenotype, we model the effect components via a set of effect network configurations and impose an inter-network sparsity and intra-network shrinkage to dissect the phenotypic network configurations affected by the risk genetic variant. A Markov chain Monte Carlo (MCMC) algorithm is further developed to facilitate uncertainty quantification. We evaluate the performance of our model through extensive simulations. By further applying the method to study, the genetic bases for brain structural connectivity using data from the Human Connectome Project with excessive family structures, we obtain plausible and interpretable results. Beyond brain connectivity genetic studies, our proposed model also provides a general linear mixed-effect regression framework for network-variate outcomes.
{"title":"Bayesian mixed model inference for genetic association under related samples with brain network phenotype.","authors":"Xinyuan Tian, Yiting Wang, Selena Wang, Yi Zhao, Yize Zhao","doi":"10.1093/biostatistics/kxae008","DOIUrl":"10.1093/biostatistics/kxae008","url":null,"abstract":"<p><p>Genetic association studies for brain connectivity phenotypes have gained prominence due to advances in noninvasive imaging techniques and quantitative genetics. Brain connectivity traits, characterized by network configurations and unique biological structures, present distinct challenges compared to other quantitative phenotypes. Furthermore, the presence of sample relatedness in the most imaging genetics studies limits the feasibility of adopting existing network-response modeling. In this article, we fill this gap by proposing a Bayesian network-response mixed-effect model that considers a network-variate phenotype and incorporates population structures including pedigrees and unknown sample relatedness. To accommodate the inherent topological architecture associated with the genetic contributions to the phenotype, we model the effect components via a set of effect network configurations and impose an inter-network sparsity and intra-network shrinkage to dissect the phenotypic network configurations affected by the risk genetic variant. A Markov chain Monte Carlo (MCMC) algorithm is further developed to facilitate uncertainty quantification. We evaluate the performance of our model through extensive simulations. By further applying the method to study, the genetic bases for brain structural connectivity using data from the Human Connectome Project with excessive family structures, we obtain plausible and interpretable results. Beyond brain connectivity genetic studies, our proposed model also provides a general linear mixed-effect regression framework for network-variate outcomes.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":" ","pages":"1195-1209"},"PeriodicalIF":1.8,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140144658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-01DOI: 10.1093/biostatistics/kxae003
Bao Le, Xiaoyue Niu, Tim Brown, Jeffrey W Imai-Eaton
Dynamic models have been successfully used in producing estimates of HIV epidemics at the national level due to their epidemiological nature and their ability to estimate prevalence, incidence, and mortality rates simultaneously. Recently, HIV interventions and policies have required more information at sub-national levels to support local planning, decision-making and resource allocation. Unfortunately, many areas lack sufficient data for deriving stable and reliable results, and this is a critical technical barrier to more stratified estimates. One solution is to borrow information from other areas within the same country. However, directly assuming hierarchical structures within the HIV dynamic models is complicated and computationally time-consuming. In this article, we propose a simple and innovative way to incorporate hierarchical information into the dynamical systems by using auxiliary data. The proposed method efficiently uses information from multiple areas within each country without increasing the computational burden. As a result, the new model improves predictive ability and uncertainty assessment.
动态模型具有流行病学性质,能够同时估算流行率、发病率和死亡率,因此已成功用于估算国家层面的艾滋病毒流行情况。最近,艾滋病干预措施和政策需要国家以下各级提供更多信息,以支持地方规划、决策和资源分配。遗憾的是,许多地区缺乏足够的数据来得出稳定可靠的结果,这是进行更多分层估算的关键技术障碍。解决办法之一是借用同一国家其他地区的信息。然而,在 HIV 动态模型中直接假设分层结构既复杂又耗费计算时间。在本文中,我们提出了一种简单而创新的方法,通过使用辅助数据将分层信息纳入动态系统。所提出的方法在不增加计算负担的情况下,有效地利用了每个国家内多个地区的信息。因此,新模型提高了预测能力和不确定性评估。
{"title":"Dynamic models augmented by hierarchical data: an application of estimating HIV epidemics at sub-national level.","authors":"Bao Le, Xiaoyue Niu, Tim Brown, Jeffrey W Imai-Eaton","doi":"10.1093/biostatistics/kxae003","DOIUrl":"10.1093/biostatistics/kxae003","url":null,"abstract":"<p><p>Dynamic models have been successfully used in producing estimates of HIV epidemics at the national level due to their epidemiological nature and their ability to estimate prevalence, incidence, and mortality rates simultaneously. Recently, HIV interventions and policies have required more information at sub-national levels to support local planning, decision-making and resource allocation. Unfortunately, many areas lack sufficient data for deriving stable and reliable results, and this is a critical technical barrier to more stratified estimates. One solution is to borrow information from other areas within the same country. However, directly assuming hierarchical structures within the HIV dynamic models is complicated and computationally time-consuming. In this article, we propose a simple and innovative way to incorporate hierarchical information into the dynamical systems by using auxiliary data. The proposed method efficiently uses information from multiple areas within each country without increasing the computational burden. As a result, the new model improves predictive ability and uncertainty assessment.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":" ","pages":"1049-1061"},"PeriodicalIF":1.8,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11471966/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139998375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-01DOI: 10.1093/biostatistics/kxae024
Yifan Yu, Rosario Pintos Lobo, Michael Cody Riedel, Katherine Bottenhorn, Angela R Laird, Thomas E Nichols
Coordinate-based meta-analysis combines evidence from a collection of neuroimaging studies to estimate brain activation. In such analyses, a key practical challenge is to find a computationally efficient approach with good statistical interpretability to model the locations of activation foci. In this article, we propose a generative coordinate-based meta-regression (CBMR) framework to approximate a smooth activation intensity function and investigate the effect of study-level covariates (e.g. year of publication, sample size). We employ a spline parameterization to model the spatial structure of brain activation and consider four stochastic models for modeling the random variation in foci. To examine the validity of CBMR, we estimate brain activation on 20 meta-analytic datasets, conduct spatial homogeneity tests at the voxel level, and compare the results to those generated by existing kernel-based and model-based approaches. Keywords: generalized linear models; meta-analysis; spatial statistics; statistical modeling.
{"title":"Neuroimaging meta regression for coordinate based meta analysis data with a spatial model.","authors":"Yifan Yu, Rosario Pintos Lobo, Michael Cody Riedel, Katherine Bottenhorn, Angela R Laird, Thomas E Nichols","doi":"10.1093/biostatistics/kxae024","DOIUrl":"10.1093/biostatistics/kxae024","url":null,"abstract":"<p><p>Coordinate-based meta-analysis combines evidence from a collection of neuroimaging studies to estimate brain activation. In such analyses, a key practical challenge is to find a computationally efficient approach with good statistical interpretability to model the locations of activation foci. In this article, we propose a generative coordinate-based meta-regression (CBMR) framework to approximate a smooth activation intensity function and investigate the effect of study-level covariates (e.g. year of publication, sample size). We employ a spline parameterization to model the spatial structure of brain activation and consider four stochastic models for modeling the random variation in foci. To examine the validity of CBMR, we estimate brain activation on 20 meta-analytic datasets, conduct spatial homogeneity tests at the voxel level, and compare the results to those generated by existing kernel-based and model-based approaches. Keywords: generalized linear models; meta-analysis; spatial statistics; statistical modeling.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":" ","pages":"1210-1232"},"PeriodicalIF":1.8,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11471956/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141604512","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}