首页 > 最新文献

Annals of Applied Statistics最新文献

英文 中文
BIVARIATE FUNCTIONAL PATTERNS OF LIFETIME MEDICARE COSTS AMONG ESRD PATIENTS. ESD 患者终身医疗保险费用的双变量功能模式。
IF 1.4 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2024-09-01 Epub Date: 2024-08-05 DOI: 10.1214/24-aoas1897
Yue Wang, Bin Nan, John D Kalbfleisch

In this work we study the lifetime Medicare spending patterns of patients with end-stage renal disease (ESRD). We extract the information of patients who started their ESRD services in 2007-2011 from the United States Renal Data System (USRDS). Patients are partitioned into three groups based on their kidney transplant status: 1-unwaitlisted and never transplanted, 2-waitlisted but never transplanted, and 3-waitlisted and then transplanted. To study their Medicare cost trajectories, we use a semiparametric regression model with both fixed and bivariate time-varying coefficients to compare groups 1 and 2, and a bivariate time-varying coefficient model with different starting times (time since the first ESRD service and time since the kidney transplant) to compare groups 2 and 3. In addition to demographics and other medical conditions, these regression models are conditional on the survival time, which ideally depict the lifetime Medicare spending patterns. For estimation, we extend the profile weighted least squares (PWLS) estimator to longitudinal data for the first comparison and propose a two-stage estimating method for the second comparison. We use sandwich variance estimators to construct confidence intervals and validate inference procedures through simulations. Our analysis of the Medicare claims data reveals that waitlisting is associated with a lower daily medical cost at the beginning of ESRD service among waitlisted patients which gradually increases over time. Averaging over lifespan, however, there is no difference between waitlisted and unwaitlisted groups. A kidney transplant, on the other hand, reduces the medical cost significantly after an initial spike.

在这项工作中,我们研究了终末期肾病(ESRD)患者的终生医疗保险支出模式。我们从美国肾脏数据系统(USRDS)中提取了 2007-2011 年开始接受 ESRD 服务的患者信息。根据患者的肾移植状态将其分为三组:1-未列入等待名单且从未移植;2-列入等待名单但从未移植;3-列入等待名单后移植。为了研究他们的医疗保险费用轨迹,我们使用了一个具有固定系数和双变量时变系数的半参数回归模型来比较第 1 组和第 2 组,以及一个具有不同起始时间(首次 ESRD 服务起始时间和肾移植起始时间)的双变量时变系数模型来比较第 2 组和第 3 组。除人口统计学和其他医疗条件外,这些回归模型还以生存时间为条件,从而理想地描绘出医疗保险的终生支出模式。在估算时,我们将剖面加权最小二乘法(PWLS)估算器扩展到纵向数据,用于第一组比较,并为第二组比较提出了两阶段估算方法。我们使用三明治方差估计器构建置信区间,并通过模拟验证推断程序。我们对医疗保险理赔数据的分析表明,在 ESRD 服务开始时,候补患者的每日医疗费用较低,而随着时间的推移,这一费用会逐渐增加。然而,从生命周期的平均值来看,候诊组和未候诊组之间并无差异。另一方面,肾移植在最初的峰值之后会显著降低医疗费用。
{"title":"BIVARIATE FUNCTIONAL PATTERNS OF LIFETIME MEDICARE COSTS AMONG ESRD PATIENTS.","authors":"Yue Wang, Bin Nan, John D Kalbfleisch","doi":"10.1214/24-aoas1897","DOIUrl":"10.1214/24-aoas1897","url":null,"abstract":"<p><p>In this work we study the lifetime Medicare spending patterns of patients with end-stage renal disease (ESRD). We extract the information of patients who started their ESRD services in 2007-2011 from the United States Renal Data System (USRDS). Patients are partitioned into three groups based on their kidney transplant status: 1-unwaitlisted and never transplanted, 2-waitlisted but never transplanted, and 3-waitlisted and then transplanted. To study their Medicare cost trajectories, we use a semiparametric regression model with both fixed and bivariate time-varying coefficients to compare groups 1 and 2, and a bivariate time-varying coefficient model with different starting times (time since the first ESRD service and time since the kidney transplant) to compare groups 2 and 3. In addition to demographics and other medical conditions, these regression models are conditional on the survival time, which ideally depict the lifetime Medicare spending patterns. For estimation, we extend the profile weighted least squares (PWLS) estimator to longitudinal data for the first comparison and propose a two-stage estimating method for the second comparison. We use sandwich variance estimators to construct confidence intervals and validate inference procedures through simulations. Our analysis of the Medicare claims data reveals that waitlisting is associated with a lower daily medical cost at the beginning of ESRD service among waitlisted patients which gradually increases over time. Averaging over lifespan, however, there is no difference between waitlisted and unwaitlisted groups. A kidney transplant, on the other hand, reduces the medical cost significantly after an initial spike.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"18 3","pages":"2596-2614"},"PeriodicalIF":1.4,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11488692/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142479908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
INTEGRATING MENDELIAN RANDOMIZATION WITH CAUSAL MEDIATION ANALYSES FOR CHARACTERIZING DIRECT AND INDIRECT EXPOSURE-TO-OUTCOME EFFECTS. 整合孟德尔随机化与因果中介分析,以表征直接和间接暴露对结果的影响。
IF 1.3 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2024-09-01 Epub Date: 2024-08-05 DOI: 10.1214/24-aoas1901
Fan Yang, Lin S Chen, Shahram Oveisgharan, Dawood Darbar, David A Bennett

Mendelian randomization (MR) assesses the total effect of exposure on outcome. With the rapidly increasing availability of summary statistics from genome-wide association studies (GWASs), MR leverages existing summary statistics and is widely used to study the causal effects among complex traits and diseases. The total effect in the population is a sum of indirect and direct effects. For complex disease outcomes with complicated etiologies, and/or for modifiable exposure traits, there may exist more than one pathway between exposure and outcome. The direct effect and the indirect effect via a mediator of interest could be of opposite directions, and the total effect estimates may not be informative for treatment and prevention decision-making or may be even misleading for different subgroups of patients. Causal mediation analysis delineates the indirect effect of exposure on outcome operating through the mediator and the direct effect transmitted through other mechanisms. However, causal mediation analysis often requires individual-level data measured on exposure, outcome, mediator and confounding variables, and the power of the mediation analysis is restricted by sample size. In this work, motivated by a study of the effects of atrial fibrillation (AF) on Alzheimer's dementia, we propose a framework for Integrative Mendelian randomization and Mediation Analysis (IMMA). The proposed method integrates the total effect estimates from MR analyses based on large-scale GWASs with the direct and indirect effect estimates from mediation analysis based on individual-level data of a limited sample size. We introduce a series of IMMA models, under the scenarios with or without exposure-mediator interaction and/or study heterogeneity. The proposed IMMA models improve the estimation and the power of inference on the direct and indirect effects in the population, as well as the characterization of the variation of effects. Our analyses showed a significant positive direct effect of AF on Alzheimer's dementia risk not through the use of the oral anticoagulant treatment and a significant indirect effect of AF-induced anticoagulant treatment in reducing Alzheimer's dementia risk. The results suggested potential Alzheimer's dementia risk prediction and prevention strategies for AF patients, and paved the way for future re-evaluation of anticoagulant treatment guidelines for AF patients. A sensitivity analysis was conducted to assess the sensitivity of the conclusions to a key assumption of the IMMA approach.

孟德尔随机化(MR)评估暴露对结果的总体影响。随着全基因组关联研究(GWASs)汇总统计数据的迅速增加,MR利用现有的汇总统计数据,被广泛用于研究复杂性状和疾病之间的因果关系。人口中的总影响是间接和直接影响的总和。对于病因复杂的复杂疾病结局,和/或可改变的暴露特征,暴露与结局之间可能存在不止一种途径。直接效应和通过感兴趣的中介的间接效应可能是相反的方向,总效应估计可能不能提供治疗和预防决策的信息,甚至可能对不同亚组患者产生误导。因果中介分析描述了暴露通过中介作用对结果的间接影响和通过其他机制传递的直接影响。然而,因果中介分析通常需要测量暴露、结果、中介和混淆变量的个人水平数据,并且中介分析的能力受到样本量的限制。在这项工作中,受到心房颤动(AF)对阿尔茨海默氏痴呆症影响的研究的启发,我们提出了一个综合孟德尔随机化和中介分析(IMMA)的框架。该方法将基于大规模GWASs的MR分析的总效应估计与基于有限样本量的个人水平数据的中介分析的直接和间接效应估计相结合。在有或没有暴露-中介相互作用和/或研究异质性的情况下,我们介绍了一系列的IMMA模型。所提出的IMMA模型提高了对人口中直接和间接影响的估计和推理能力,以及对影响变化的表征。我们的分析显示,AF在不使用口服抗凝治疗的情况下对阿尔茨海默氏痴呆风险有显著的直接积极影响,AF诱导的抗凝治疗在降低阿尔茨海默氏痴呆风险方面有显著的间接影响。研究结果提示了AF患者阿尔茨海默氏痴呆的潜在风险预测和预防策略,并为今后重新评估AF患者抗凝治疗指南铺平了道路。进行了敏感性分析,以评估结论对IMMA方法的关键假设的敏感性。
{"title":"INTEGRATING MENDELIAN RANDOMIZATION WITH CAUSAL MEDIATION ANALYSES FOR CHARACTERIZING DIRECT AND INDIRECT EXPOSURE-TO-OUTCOME EFFECTS.","authors":"Fan Yang, Lin S Chen, Shahram Oveisgharan, Dawood Darbar, David A Bennett","doi":"10.1214/24-aoas1901","DOIUrl":"10.1214/24-aoas1901","url":null,"abstract":"<p><p>Mendelian randomization (MR) assesses the total effect of exposure on outcome. With the rapidly increasing availability of summary statistics from genome-wide association studies (GWASs), MR leverages existing summary statistics and is widely used to study the causal effects among complex traits and diseases. The total effect in the population is a sum of indirect and direct effects. For complex disease outcomes with complicated etiologies, and/or for modifiable exposure traits, there may exist more than one pathway between exposure and outcome. The direct effect and the indirect effect via a mediator of interest could be of opposite directions, and the total effect estimates may not be informative for treatment and prevention decision-making or may be even misleading for different subgroups of patients. Causal mediation analysis delineates the indirect effect of exposure on outcome operating through the mediator and the direct effect transmitted through other mechanisms. However, causal mediation analysis often requires individual-level data measured on exposure, outcome, mediator and confounding variables, and the power of the mediation analysis is restricted by sample size. In this work, motivated by a study of the effects of atrial fibrillation (AF) on Alzheimer's dementia, we propose a framework for Integrative Mendelian randomization and Mediation Analysis (IMMA). The proposed method integrates the total effect estimates from MR analyses based on large-scale GWASs with the direct and indirect effect estimates from mediation analysis based on individual-level data of a limited sample size. We introduce a series of IMMA models, under the scenarios with or without exposure-mediator interaction and/or study heterogeneity. The proposed IMMA models improve the estimation and the power of inference on the direct and indirect effects in the population, as well as the characterization of the variation of effects. Our analyses showed a significant positive direct effect of AF on Alzheimer's dementia risk not through the use of the oral anticoagulant treatment and a significant indirect effect of AF-induced anticoagulant treatment in reducing Alzheimer's dementia risk. The results suggested potential Alzheimer's dementia risk prediction and prevention strategies for AF patients, and paved the way for future re-evaluation of anticoagulant treatment guidelines for AF patients. A sensitivity analysis was conducted to assess the sensitivity of the conclusions to a key assumption of the IMMA approach.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"18 3","pages":"2656-2677"},"PeriodicalIF":1.3,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11845245/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143484582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A bootstrap model comparison test for identifying genes with context-specific patterns of genetic regulation. 用于识别具有特定遗传调控模式的基因的自举模型比较检验。
IF 1.3 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2024-09-01 Epub Date: 2024-08-05 DOI: 10.1214/23-aoas1859
Mykhaylo M Malakhov, Ben Dai, Xiaotong T Shen, Wei Pan

Understanding how genetic variation affects gene expression is essential for a complete picture of the functional pathways that give rise to complex traits. Although numerous studies have established that many genes are differentially expressed in distinct human tissues and cell types, no tools exist for identifying the genes whose expression is differentially regulated. Here we introduce DRAB (differential regulation analysis by bootstrapping), a gene-based method for testing whether patterns of genetic regulation are significantly different between tissues or other biological contexts. DRAB first leverages the elastic net to learn context-specific models of local genetic regulation and then applies a novel bootstrap-based model comparison test to check their equivalency. Unlike previous model comparison tests, our proposed approach can determine whether population-level models have equal predictive performance by accounting for the variability of feature selection and model training. We validated DRAB on mRNA expression data from a variety of human tissues in the Genotype-Tissue Expression (GTEx) Project. DRAB yielded biologically reasonable results and had sufficient power to detect genes with tissue-specific regulatory profiles while effectively controlling false positives. By providing a framework that facilitates the prioritization of differentially regulated genes, our study enables future discoveries on the genetic architecture of molecular phenotypes.

要全面了解导致复杂性状的功能途径,就必须了解遗传变异是如何影响基因表达的。尽管大量研究已经证实,许多基因在不同的人体组织和细胞类型中表达不同,但目前还没有工具可以识别表达受到不同调控的基因。在这里,我们介绍一种基于基因的方法--DRAB(自举法差异调控分析),用于测试不同组织或其他生物环境中的基因调控模式是否存在显著差异。DRAB 首先利用弹性网来学习局部基因调控的特定背景模型,然后应用一种新颖的基于引导的模型比较测试来检验它们的等效性。与以往的模型比较测试不同,我们提出的方法可以通过考虑特征选择和模型训练的可变性来确定群体级模型是否具有相同的预测性能。我们在基因型-组织表达(GTEx)项目中对来自各种人体组织的 mRNA 表达数据进行了 DRAB 验证。DRAB 得出了生物学上合理的结果,并有足够的能力检测出具有组织特异性调控特征的基因,同时有效控制了假阳性。我们的研究提供了一个框架,有助于确定差异调控基因的优先次序,从而有助于未来发现分子表型的遗传结构。
{"title":"A bootstrap model comparison test for identifying genes with context-specific patterns of genetic regulation.","authors":"Mykhaylo M Malakhov, Ben Dai, Xiaotong T Shen, Wei Pan","doi":"10.1214/23-aoas1859","DOIUrl":"10.1214/23-aoas1859","url":null,"abstract":"<p><p>Understanding how genetic variation affects gene expression is essential for a complete picture of the functional pathways that give rise to complex traits. Although numerous studies have established that many genes are differentially expressed in distinct human tissues and cell types, no tools exist for identifying the genes whose expression is differentially regulated. Here we introduce DRAB (differential regulation analysis by bootstrapping), a gene-based method for testing whether patterns of genetic regulation are significantly different between tissues or other biological contexts. DRAB first leverages the elastic net to learn context-specific models of local genetic regulation and then applies a novel bootstrap-based model comparison test to check their equivalency. Unlike previous model comparison tests, our proposed approach can determine whether population-level models have equal predictive performance by accounting for the variability of feature selection and model training. We validated DRAB on mRNA expression data from a variety of human tissues in the Genotype-Tissue Expression (GTEx) Project. DRAB yielded biologically reasonable results and had sufficient power to detect genes with tissue-specific regulatory profiles while effectively controlling false positives. By providing a framework that facilitates the prioritization of differentially regulated genes, our study enables future discoveries on the genetic architecture of molecular phenotypes.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"18 3","pages":"1840-1857"},"PeriodicalIF":1.3,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11484521/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142479813","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AN INTEGRATIVE NETWORK-BASED MEDIATION MODEL (NMM) TO ESTIMATE MULTIPLE GENETIC EFFECTS ON OUTCOMES MEDIATED BY FUNCTIONAL CONNECTIVITY. 一个综合网络为基础的中介模型(nmm),以估计多种遗传效应的结果介导的功能连接。
IF 1.4 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2024-09-01 Epub Date: 2024-08-05 DOI: 10.1214/24-aoas1880
Wei Dai, Heping Zhang

Functional connectivity of the brain, characterized by interconnected neural circuits across functional networks, is a cutting-edge feature in neuroimaging. It has the potential to mediate the effect of genetic variants on behavioral outcomes or diseases. Existing mediation analysis methods can evaluate the impact of genetics and brain structurefunction on cognitive behavior or disorders, but they tend to be limited to single genetic variants or univariate mediators, without considering cumulative genetic effects and the complex matrix and group and network structures of functional connectivity. To address this gap, the paper presents an integrative network-based mediation model (NMM) that estimates the effect of multiple genetic variants on behavioral outcomes or diseases mediated by functional connectivity. The model incorporates group information of inter-regions at broad network level and imposes low-rank and sparse assumptions to reflect the complex structures of functional connectivity and selecting network mediators simultaneously. We adopt block coordinate descent algorithm to implement a fast and efficient solution to our model. Simulation results indicate the efficacy of the model in selecting active mediators and reducing bias in effect estimation. With application to the Human Connectome Project Youth Adult (HCP-YA) study of 493 young adults, two genetic variants (rs769448 and rs769449) on the APOE4 gene are identified that lead to deficits in functional connectivity within visual networks and fluid intelligence.

大脑的功能连通性,其特征是跨功能网络的相互连接的神经回路,是神经影像学的前沿特征。它有可能介导基因变异对行为结果或疾病的影响。现有的中介分析方法可以评估遗传和大脑结构功能对认知行为或障碍的影响,但它们往往局限于单一遗传变异或单变量中介,而没有考虑累积遗传效应和功能连接的复杂矩阵、群体和网络结构。为了解决这一差距,本文提出了一个基于网络的综合中介模型(NMM),该模型估计了多种遗传变异对由功能连接介导的行为结果或疾病的影响。该模型在广泛的网络层面上整合了区域间的群体信息,并采用低秩和稀疏假设,同时反映了功能连通性和网络中介选择的复杂结构。采用块坐标下降算法对模型进行快速有效的求解。仿真结果表明了该模型在选择有效介质和减少效应估计偏差方面的有效性。在人类连接组计划青年成人(HCP-YA)对493名年轻人的研究中,发现APOE4基因上的两个遗传变异(rs769448和rs769449)导致视觉网络和流体智力的功能连接缺陷。
{"title":"AN INTEGRATIVE NETWORK-BASED MEDIATION MODEL (NMM) TO ESTIMATE MULTIPLE GENETIC EFFECTS ON OUTCOMES MEDIATED BY FUNCTIONAL CONNECTIVITY.","authors":"Wei Dai, Heping Zhang","doi":"10.1214/24-aoas1880","DOIUrl":"10.1214/24-aoas1880","url":null,"abstract":"<p><p>Functional connectivity of the brain, characterized by interconnected neural circuits across functional networks, is a cutting-edge feature in neuroimaging. It has the potential to mediate the effect of genetic variants on behavioral outcomes or diseases. Existing mediation analysis methods can evaluate the impact of genetics and brain structurefunction on cognitive behavior or disorders, but they tend to be limited to single genetic variants or univariate mediators, without considering cumulative genetic effects and the complex matrix and group and network structures of functional connectivity. To address this gap, the paper presents an integrative network-based mediation model (NMM) that estimates the effect of multiple genetic variants on behavioral outcomes or diseases mediated by functional connectivity. The model incorporates group information of inter-regions at broad network level and imposes low-rank and sparse assumptions to reflect the complex structures of functional connectivity and selecting network mediators simultaneously. We adopt block coordinate descent algorithm to implement a fast and efficient solution to our model. Simulation results indicate the efficacy of the model in selecting active mediators and reducing bias in effect estimation. With application to the Human Connectome Project Youth Adult (HCP-YA) study of 493 young adults, two genetic variants (rs769448 and rs769449) on the <i>APOE4</i> gene are identified that lead to deficits in functional connectivity within visual networks and fluid intelligence.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"18 3","pages":"2277-2294"},"PeriodicalIF":1.4,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11616023/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142787675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PATIENT RECRUITMENT USING ELECTRONIC HEALTH RECORDS UNDER SELECTION BIAS: A TWO-PHASE SAMPLING FRAMEWORK. 在选择偏差的情况下利用电子病历招募患者:两阶段抽样框架。
IF 1.3 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2024-09-01 Epub Date: 2024-08-05 DOI: 10.1214/23-aoas1860
Guanghao Zhang, Lauren J Beesley, Bhramar Mukherjee, X U Shi

Electronic health records (EHRs) are increasingly recognized as a cost-effective resource for patient recruitment in clinical research. However, how to optimally select a cohort from millions of individuals to answer a scientific question of interest remains unclear. Consider a study to estimate the mean or mean difference of an expensive outcome. Inexpensive auxiliary covariates predictive of the outcome may often be available in patients' health records, presenting an opportunity to recruit patients selectively, which may improve efficiency in downstream analyses. In this paper we propose a two-phase sampling design that leverages available information on auxiliary covariates in EHR data. A key challenge in using EHR data for multiphase sampling is the potential selection bias, because EHR data are not necessarily representative of the target population. Extending existing literature on two-phase sampling design, we derive an optimal two-phase sampling method that improves efficiency over random sampling while accounting for the potential selection bias in EHR data. We demonstrate the efficiency gain from our sampling design via simulation studies and an application evaluating the prevalence of hypertension among U.S. adults leveraging data from the Michigan Genomics Initiative, a longitudinal biorepository in Michigan Medicine.

电子健康记录(EHR)越来越被认为是临床研究中招募病人的一种具有成本效益的资源。然而,如何从数以百万计的个体中最优化地选择一个队列来回答感兴趣的科学问题仍不清楚。考虑一项估算昂贵结果的平均值或平均差的研究。患者的健康记录中通常可能存在可预测结果的廉价辅助协变量,这为有选择性地招募患者提供了机会,可提高下游分析的效率。在本文中,我们提出了一种两阶段抽样设计,充分利用电子病历数据中可用的辅助协变量信息。使用电子病历数据进行多阶段抽样的一个主要挑战是潜在的选择偏差,因为电子病历数据并不一定代表目标人群。我们扩展了有关两阶段抽样设计的现有文献,推导出了一种最佳的两阶段抽样方法,它比随机抽样提高了效率,同时考虑到了电子病历数据中潜在的选择偏差。我们通过模拟研究和一个评估美国成年人高血压患病率的应用,利用密歇根基因组学倡议(Michigan Genomics Initiative)的数据(密歇根医学的一个纵向生物库),证明了我们的抽样设计提高了效率。
{"title":"PATIENT RECRUITMENT USING ELECTRONIC HEALTH RECORDS UNDER SELECTION BIAS: A TWO-PHASE SAMPLING FRAMEWORK.","authors":"Guanghao Zhang, Lauren J Beesley, Bhramar Mukherjee, X U Shi","doi":"10.1214/23-aoas1860","DOIUrl":"10.1214/23-aoas1860","url":null,"abstract":"<p><p>Electronic health records (EHRs) are increasingly recognized as a cost-effective resource for patient recruitment in clinical research. However, how to optimally select a cohort from millions of individuals to answer a scientific question of interest remains unclear. Consider a study to estimate the mean or mean difference of an expensive outcome. Inexpensive auxiliary covariates predictive of the outcome may often be available in patients' health records, presenting an opportunity to recruit patients selectively, which may improve efficiency in downstream analyses. In this paper we propose a two-phase sampling design that leverages available information on auxiliary covariates in EHR data. A key challenge in using EHR data for multiphase sampling is the potential selection bias, because EHR data are not necessarily representative of the target population. Extending existing literature on two-phase sampling design, we derive an optimal two-phase sampling method that improves efficiency over random sampling while accounting for the potential selection bias in EHR data. We demonstrate the efficiency gain from our sampling design via simulation studies and an application evaluating the prevalence of hypertension among U.S. adults leveraging data from the Michigan Genomics Initiative, a longitudinal biorepository in Michigan Medicine.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"18 3","pages":"1858-1878"},"PeriodicalIF":1.3,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11323140/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141989442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A NONPARAMETRIC MIXED-EFFECTS MIXTURE MODEL FOR PATTERNS OF CLINICAL MEASUREMENTS ASSOCIATED WITH COVID-19. 与 covid-19 相关的临床测量模式的非参数混合效应混合物模型。
IF 1.3 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2024-09-01 Epub Date: 2024-08-05 DOI: 10.1214/23-aoas1871
Xiaoran Ma, Wensheng Guo, Mengyang Gu, Len Usvyat, Peter Kotanko, Yuedong Wang

Some patients with COVID-19 show changes in signs and symptoms such as temperature and oxygen saturation days before being positively tested for SARS-CoV-2, while others remain asymptomatic. It is important to identify these subgroups and to understand what biological and clinical predictors are related to these subgroups. This information will provide insights into how the immune system may respond differently to infection and can further be used to identify infected individuals. We propose a flexible nonparametric mixed-effects mixture model that identifies risk factors and classifies patients with biological changes. We model the latent probability of biological changes using a logistic regression model and trajectories in the latent groups using smoothing splines. We developed an EM algorithm to maximize the penalized likelihood for estimating all parameters and mean functions. We evaluate our methods by simulations and apply the proposed model to investigate changes in temperature in a cohort of COVID-19-infected hemodialysis patients.

一些 COVID-19 患者在接受 SARS-CoV-2 阳性检测前几天体温和血氧饱和度等体征和症状发生变化,而另一些患者则仍无症状。确定这些亚群并了解与这些亚群相关的生物学和临床预测因素非常重要。这些信息将有助于了解免疫系统如何对感染做出不同的反应,并可进一步用于识别感染者。我们提出了一种灵活的非参数混合效应模型,该模型可识别风险因素,并根据生物变化对患者进行分类。我们使用逻辑回归模型对生物变化的潜伏概率进行建模,并使用平滑样条对潜伏组的轨迹进行建模。我们开发了一种 EM 算法,用于最大化估计所有参数和均值函数的惩罚似然。我们通过模拟评估了我们的方法,并将所提出的模型应用于研究 COVID-19 感染血液透析患者队列中的体温变化。
{"title":"A NONPARAMETRIC MIXED-EFFECTS MIXTURE MODEL FOR PATTERNS OF CLINICAL MEASUREMENTS ASSOCIATED WITH COVID-19.","authors":"Xiaoran Ma, Wensheng Guo, Mengyang Gu, Len Usvyat, Peter Kotanko, Yuedong Wang","doi":"10.1214/23-aoas1871","DOIUrl":"10.1214/23-aoas1871","url":null,"abstract":"<p><p>Some patients with COVID-19 show changes in signs and symptoms such as temperature and oxygen saturation days before being positively tested for SARS-CoV-2, while others remain asymptomatic. It is important to identify these subgroups and to understand what biological and clinical predictors are related to these subgroups. This information will provide insights into how the immune system may respond differently to infection and can further be used to identify infected individuals. We propose a flexible nonparametric mixed-effects mixture model that identifies risk factors and classifies patients with biological changes. We model the latent probability of biological changes using a logistic regression model and trajectories in the latent groups using smoothing splines. We developed an EM algorithm to maximize the penalized likelihood for estimating all parameters and mean functions. We evaluate our methods by simulations and apply the proposed model to investigate changes in temperature in a cohort of COVID-19-infected hemodialysis patients.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"18 3","pages":"2080-2095"},"PeriodicalIF":1.3,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11460989/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142394985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
OUTCOME-GUIDED DISEASE SUBTYPING BY GENERATIVE MODEL AND WEIGHTED JOINT LIKELIHOOD IN TRANSCRIPTOMIC APPLICATIONS. 转录组学应用中生成模型和加权联合似然的结果导向疾病亚型。
IF 1.4 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2024-09-01 Epub Date: 2024-08-05 DOI: 10.1214/23-aoas1865
Yujia Li, Peng Liu, Wenjia Wang, Wei Zong, Yusi Fang, Zhao Ren, Lu Tang, Juan C Celedón, Steffi Oesterreich, George C Tseng

With advances in high-throughput technology, molecular disease subtyping by high-dimensional omics data has been recognized as an effective approach for identifying subtypes of complex diseases with distinct disease mechanisms and prognoses. Conventional cluster analysis takes omics data as input and generates patient clusters with similar gene expression pattern. The omics data, however, usually contain multi-faceted cluster structures that can be defined by different sets of gene. If the gene set associated with irrelevant clinical variables (e.g., sex or age) dominates the clustering process, the resulting clusters may not capture clinically meaningful disease subtypes. This motivates the development of a clustering framework with guidance from a pre-specified disease outcome, such as lung function measurement or survival, in this paper. We propose two disease subtyping methods by omics data with outcome guidance using a generative model or a weighted joint likelihood. Both methods connect an outcome association model and a disease subtyping model by a latent variable of cluster labels. Compared to the generative model, weighted joint likelihood contains a data-driven weight parameter to balance the likelihood contributions from outcome association and gene cluster separation, which improves generalizability in independent validation but requires heavier computing. Extensive simulations and two real applications in lung disease and triple-negative breast cancer demonstrate superior disease subtyping performance of the outcome-guided clustering methods in terms of disease subtyping accuracy, gene selection and outcome association. Unlike existing clustering methods, the outcome-guided disease subtyping framework creates a new precision medicine paradigm to directly identify patient subgroups with clinical association.

随着高通量技术的进步,利用高维组学数据进行疾病分子分型已被认为是识别具有不同发病机制和预后的复杂疾病亚型的有效方法。传统的聚类分析以组学数据为输入,生成具有相似基因表达模式的患者聚类。然而,组学数据通常包含多方面的簇结构,可以由不同的基因集来定义。如果与不相关的临床变量(例如,性别或年龄)相关的基因集在聚类过程中占主导地位,则所得的聚类可能无法捕获临床有意义的疾病亚型。在本文中,这激发了基于预先指定的疾病结果(如肺功能测量或生存率)指导的聚类框架的发展。我们提出了两种疾病分型方法组学数据与结果指导使用生成模型或加权联合似然。两种方法都通过聚类标签的潜在变量将结果关联模型和疾病亚型模型连接起来。与生成模型相比,加权联合似然包含一个数据驱动的权重参数来平衡结果关联和基因聚类分离的似然贡献,提高了独立验证的泛化性,但需要更多的计算。广泛的模拟和在肺部疾病和三阴性乳腺癌中的两个实际应用表明,结果导向聚类方法在疾病分型准确性、基因选择和结果关联方面具有优越的疾病分型性能。与现有的聚类方法不同,以结果为导向的疾病亚型框架创建了一种新的精准医学范式,可以直接识别具有临床关联的患者亚组。
{"title":"OUTCOME-GUIDED DISEASE SUBTYPING BY GENERATIVE MODEL AND WEIGHTED JOINT LIKELIHOOD IN TRANSCRIPTOMIC APPLICATIONS.","authors":"Yujia Li, Peng Liu, Wenjia Wang, Wei Zong, Yusi Fang, Zhao Ren, Lu Tang, Juan C Celedón, Steffi Oesterreich, George C Tseng","doi":"10.1214/23-aoas1865","DOIUrl":"10.1214/23-aoas1865","url":null,"abstract":"<p><p>With advances in high-throughput technology, molecular disease subtyping by high-dimensional omics data has been recognized as an effective approach for identifying subtypes of complex diseases with distinct disease mechanisms and prognoses. Conventional cluster analysis takes omics data as input and generates patient clusters with similar gene expression pattern. The omics data, however, usually contain multi-faceted cluster structures that can be defined by different sets of gene. If the gene set associated with irrelevant clinical variables (e.g., sex or age) dominates the clustering process, the resulting clusters may not capture clinically meaningful disease subtypes. This motivates the development of a clustering framework with guidance from a pre-specified disease outcome, such as lung function measurement or survival, in this paper. We propose two disease subtyping methods by omics data with outcome guidance using a generative model or a weighted joint likelihood. Both methods connect an outcome association model and a disease subtyping model by a latent variable of cluster labels. Compared to the generative model, weighted joint likelihood contains a data-driven weight parameter to balance the likelihood contributions from outcome association and gene cluster separation, which improves generalizability in independent validation but requires heavier computing. Extensive simulations and two real applications in lung disease and triple-negative breast cancer demonstrate superior disease subtyping performance of the outcome-guided clustering methods in terms of disease subtyping accuracy, gene selection and outcome association. Unlike existing clustering methods, the outcome-guided disease subtyping framework creates a new precision medicine paradigm to directly identify patient subgroups with clinical association.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"18 3","pages":"1947-1964"},"PeriodicalIF":1.4,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12309773/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144755012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
QUANTILE REGRESSION DECOMPOSITION ANALYSIS OF DISPARITY RESEARCH USING COMPLEX SURVEY DATA: APPLICATION TO DISPARITIES IN BMI AND TELOMERE LENGTH BETWEEN U.S. MINORITY AND WHITE POPULATION GROUPS. 使用复杂调查数据的差异研究的分位数回归分解分析:应用于美国少数民族和白人群体之间的bmi和端粒长度差异。
IF 1.4 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2024-09-01 Epub Date: 2024-08-05 DOI: 10.1214/23-aoas1868
Hyokyoung G Hong, Barry I Graubard, Joseph L Gastwirth, Mi-Ok Kim

We develop a quantile regression decomposition (QRD) method for analyzing observed disparities (OD) between population groups in socioeconomic and health-related outcomes for complex survey data. The conventional decomposition approaches use the conditional mean regression to decompose the disparity into two parts, the part explained by the difference arising from the different distributions in the explanatory covariates and the remaining part, which is unexplained by the covariates. Many socioeconomic and health outcomes exhibit heteroscedastic distributions, where the magnitude of observed disparities varies across different quantiles of these outcomes. Thus, differences in the explanatory covariates may account for varying differences in the OD across the quantiles of the outcome. The QRD can identify where there are greater differences in the outcome distribution, for example, 90th quantile, and how important the covariates are in explaining those differences. Much socioeconomic and health research relies on complex surveys, such as the National Health and Nutrition Examination Survey (NHANES), that oversample individuals from disadvantaged/minority population groups in order to provide improved precision. QRD has not been extended to the complex survey setting. We improve the QRD approach proposed in Machado and Mata (2005) to yield more reliable estimates at the quantiles, where the data are sparse, and extend it to the complex survey setting. We also propose a perturbation-based variance estimation method. Simulation studies indicate that the estimates of the unexplained portions of the OD across quantiles are unbiased and the coverage of the confidence intervals are close to nominal value. This methodology is used to study disparities in body mass index (BMI) and telomere length between race/ethnic groups estimated from the NHANES data.

我们开发了一种分位数回归分解(QRD)方法,用于分析复杂调查数据中不同人群在社会经济和健康相关结果方面的观察差异(OD)。传统的分解方法使用条件均值回归将差异分解为两部分,一部分是由解释协变量的不同分布引起的差异来解释的,另一部分是由协变量来解释的。许多社会经济和健康结果表现出异方差分布,在这些结果的不同分位数中,观察到的差异的大小各不相同。因此,解释协变量的差异可以解释结果分位数上OD的不同差异。QRD可以识别结果分布中存在较大差异的地方,例如,第90分位数,以及协变量在解释这些差异时的重要性。许多社会经济和健康研究依赖于复杂的调查,例如国家健康和营养检查调查(NHANES),这些调查从弱势/少数民族人口群体中对个人进行抽样,以提高准确性。QRD尚未扩展到复杂的调查环境。我们改进了Machado和Mata(2005)提出的QRD方法,以在数据稀疏的分位数上产生更可靠的估计,并将其扩展到复杂的调查设置。我们还提出了一种基于微扰的方差估计方法。仿真研究表明,对OD的未解释部分在分位数上的估计是无偏的,置信区间的覆盖率接近名义值。该方法用于研究从NHANES数据估计的种族/族裔群体之间的身体质量指数(BMI)和端粒长度的差异。
{"title":"QUANTILE REGRESSION DECOMPOSITION ANALYSIS OF DISPARITY RESEARCH USING COMPLEX SURVEY DATA: APPLICATION TO DISPARITIES IN BMI AND TELOMERE LENGTH BETWEEN U.S. MINORITY AND WHITE POPULATION GROUPS.","authors":"Hyokyoung G Hong, Barry I Graubard, Joseph L Gastwirth, Mi-Ok Kim","doi":"10.1214/23-aoas1868","DOIUrl":"10.1214/23-aoas1868","url":null,"abstract":"<p><p>We develop a quantile regression decomposition (QRD) method for analyzing observed disparities (OD) between population groups in socioeconomic and health-related outcomes for complex survey data. The conventional decomposition approaches use the conditional mean regression to decompose the disparity into two parts, the part explained by the difference arising from the different distributions in the explanatory covariates and the remaining part, which is unexplained by the covariates. Many socioeconomic and health outcomes exhibit heteroscedastic distributions, where the magnitude of observed disparities varies across different quantiles of these outcomes. Thus, differences in the explanatory covariates may account for varying differences in the OD across the quantiles of the outcome. The QRD can identify where there are greater differences in the outcome distribution, for example, 90th quantile, and how important the covariates are in explaining those differences. Much socioeconomic and health research relies on complex surveys, such as the National Health and Nutrition Examination Survey (NHANES), that oversample individuals from disadvantaged/minority population groups in order to provide improved precision. QRD has not been extended to the complex survey setting. We improve the QRD approach proposed in Machado and Mata (2005) to yield more reliable estimates at the quantiles, where the data are sparse, and extend it to the complex survey setting. We also propose a perturbation-based variance estimation method. Simulation studies indicate that the estimates of the unexplained portions of the OD across quantiles are unbiased and the coverage of the confidence intervals are close to nominal value. This methodology is used to study disparities in body mass index (BMI) and telomere length between race/ethnic groups estimated from the NHANES data.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"18 3","pages":"2012-2033"},"PeriodicalIF":1.4,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12456447/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145139184","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
JOINT MODELING OF MULTISTATE AND NONPARAMETRIC MULTIVARIATE LONGITUDINAL DATA. 多态和非参数多变量纵向数据的联合建模。
IF 1.8 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2024-08-05 DOI: 10.1214/24-aoas1889
L U You,Falastin Salami,Carina Törn,Åke Lernmark,Roy Tamura
It is oftentimes the case in studies of disease progression that subjects can move into one of several disease states of interest. Multistate models are an indispensable tool to analyze data from such studies. The Environmental Determinants of Diabetes in the Young (TEDDY) is an observational study of at-risk children from birth to onset of type-1 diabetes (T1D) up through the age of 15. A joint model for simultaneous inference of multistate and multivariate nonparametric longitudinal data is proposed to analyze data and answer the research questions brought up in the study. The proposed method allows us to make statistical inferences, test hypotheses, and make predictions about future state occupation in the TEDDY study. The performance of the proposed method is evaluated by simulation studies. The proposed method is applied to the motivating example to demonstrate the capabilities of the method.
在疾病进展研究中,受试者往往会进入几种相关疾病状态中的一种。多态模型是分析此类研究数据不可或缺的工具。青少年糖尿病的环境决定因素(TEDDY)是一项观察性研究,研究对象为从出生到 1 型糖尿病(T1D)发病直至 15 岁的高危儿童。本研究提出了一种多态和多变量非参数纵向数据同时推断的联合模型,用于分析数据和回答研究中提出的问题。通过所提出的方法,我们可以在 TEDDY 研究中进行统计推断、检验假设并预测未来的职业状态。我们通过模拟研究对所提方法的性能进行了评估。建议的方法应用于激励性实例,以展示该方法的能力。
{"title":"JOINT MODELING OF MULTISTATE AND NONPARAMETRIC MULTIVARIATE LONGITUDINAL DATA.","authors":"L U You,Falastin Salami,Carina Törn,Åke Lernmark,Roy Tamura","doi":"10.1214/24-aoas1889","DOIUrl":"https://doi.org/10.1214/24-aoas1889","url":null,"abstract":"It is oftentimes the case in studies of disease progression that subjects can move into one of several disease states of interest. Multistate models are an indispensable tool to analyze data from such studies. The Environmental Determinants of Diabetes in the Young (TEDDY) is an observational study of at-risk children from birth to onset of type-1 diabetes (T1D) up through the age of 15. A joint model for simultaneous inference of multistate and multivariate nonparametric longitudinal data is proposed to analyze data and answer the research questions brought up in the study. The proposed method allows us to make statistical inferences, test hypotheses, and make predictions about future state occupation in the TEDDY study. The performance of the proposed method is evaluated by simulation studies. The proposed method is applied to the motivating example to demonstrate the capabilities of the method.","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"21 1","pages":"2444-2461"},"PeriodicalIF":1.8,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142259215","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ACCURATE ESTIMATION OF RARE CELL-TYPE FRACTIONS FROM TISSUE OMICS DATA VIA HIERARCHICAL DECONVOLUTION. 通过分层反褶积从组织组学数据中准确估计稀有细胞类型。
IF 1.4 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2024-06-01 Epub Date: 2024-04-05 DOI: 10.1214/23-aoas1829
Penghui Huang, Manqi Cai, Xinghua Lu, Chris McKennan, Jiebiao Wang

Bulk transcriptomics in tissue samples reflects the average expression levels across different cell types and is highly influenced by cellular fractions. As such, it is critical to estimate cellular fractions to both deconfound differential expression analyses and infer cell type-specific differential expression. Since experimentally counting cells is infeasible in most tissues and studies, in silico cellular deconvolution methods have been developed as an alternative. However, existing methods are designed for tissues consisting of clearly distinguishable cell types and have difficulties estimating highly correlated or rare cell types. To address this challenge, we propose hierarchical deconvolution (HiDecon) that uses single-cell RNA sequencing references and a hierarchical cell-type tree, which models the similarities among cell types and cell differentiation relationships, to estimate cellular fractions in bulk data. By coordinating cell fractions across layers of the hierarchical tree, cellular fraction information is passed up and down the tree, which helps correct estimation biases by pooling information across related cell types. The flexible hierarchical tree structure also enables estimating rare cell fractions by splitting the tree to higher resolutions. Through simulations and real data applications with the ground truth of measured cellular fractions, we demonstrate that HiDecon outperforms existing methods and accurately estimates cellular fractions. Finally, we show the utility of HiDecon estimates in identifying the associations between cellular fractions and Alzheimer's disease.

组织样本中的大量转录组学反映了不同细胞类型的平均表达水平,并受到细胞组分的高度影响。因此,估计细胞组分对于分离差异表达分析和推断细胞类型特异性差异表达至关重要。由于在大多数组织和研究中实验计数细胞是不可行的,因此硅细胞反褶积方法已被开发作为一种替代方法。然而,现有的方法是针对由明显可区分的细胞类型组成的组织设计的,难以估计高度相关或罕见的细胞类型。为了解决这一挑战,我们提出了分层反褶积(HiDecon),该方法使用单细胞RNA测序参考和分层细胞类型树(模拟细胞类型和细胞分化关系之间的相似性)来估计大量数据中的细胞分数。通过协调分层树各层的细胞分数,细胞分数信息在树中上下传递,这有助于通过池化相关细胞类型的信息来纠正估计偏差。灵活的分层树结构还可以通过将树拆分到更高的分辨率来估计罕见的细胞分数。通过模拟和实际数据应用,我们证明了HiDecon优于现有方法,可以准确地估计细胞分数。最后,我们展示了HiDecon估计在识别细胞组分和阿尔茨海默病之间关联方面的效用。
{"title":"ACCURATE ESTIMATION OF RARE CELL-TYPE FRACTIONS FROM TISSUE OMICS DATA VIA HIERARCHICAL DECONVOLUTION.","authors":"Penghui Huang, Manqi Cai, Xinghua Lu, Chris McKennan, Jiebiao Wang","doi":"10.1214/23-aoas1829","DOIUrl":"10.1214/23-aoas1829","url":null,"abstract":"<p><p>Bulk transcriptomics in tissue samples reflects the average expression levels across different cell types and is highly influenced by cellular fractions. As such, it is critical to estimate cellular fractions to both deconfound differential expression analyses and infer cell type-specific differential expression. Since experimentally counting cells is infeasible in most tissues and studies, <i>in silico</i> cellular deconvolution methods have been developed as an alternative. However, existing methods are designed for tissues consisting of clearly distinguishable cell types and have difficulties estimating highly correlated or rare cell types. To address this challenge, we propose hierarchical deconvolution (HiDecon) that uses single-cell RNA sequencing references and a hierarchical cell-type tree, which models the similarities among cell types and cell differentiation relationships, to estimate cellular fractions in bulk data. By coordinating cell fractions across layers of the hierarchical tree, cellular fraction information is passed up and down the tree, which helps correct estimation biases by pooling information across related cell types. The flexible hierarchical tree structure also enables estimating rare cell fractions by splitting the tree to higher resolutions. Through simulations and real data applications with the ground truth of measured cellular fractions, we demonstrate that HiDecon outperforms existing methods and accurately estimates cellular fractions. Finally, we show the utility of HiDecon estimates in identifying the associations between cellular fractions and Alzheimer's disease.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"18 2","pages":"1178-1194"},"PeriodicalIF":1.4,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12530111/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145330846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Annals of Applied Statistics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1