Pub Date : 2024-09-01Epub Date: 2024-08-05DOI: 10.1214/23-aoas1865
Yujia Li, Peng Liu, Wenjia Wang, Wei Zong, Yusi Fang, Zhao Ren, Lu Tang, Juan C Celedón, Steffi Oesterreich, George C Tseng
With advances in high-throughput technology, molecular disease subtyping by high-dimensional omics data has been recognized as an effective approach for identifying subtypes of complex diseases with distinct disease mechanisms and prognoses. Conventional cluster analysis takes omics data as input and generates patient clusters with similar gene expression pattern. The omics data, however, usually contain multi-faceted cluster structures that can be defined by different sets of gene. If the gene set associated with irrelevant clinical variables (e.g., sex or age) dominates the clustering process, the resulting clusters may not capture clinically meaningful disease subtypes. This motivates the development of a clustering framework with guidance from a pre-specified disease outcome, such as lung function measurement or survival, in this paper. We propose two disease subtyping methods by omics data with outcome guidance using a generative model or a weighted joint likelihood. Both methods connect an outcome association model and a disease subtyping model by a latent variable of cluster labels. Compared to the generative model, weighted joint likelihood contains a data-driven weight parameter to balance the likelihood contributions from outcome association and gene cluster separation, which improves generalizability in independent validation but requires heavier computing. Extensive simulations and two real applications in lung disease and triple-negative breast cancer demonstrate superior disease subtyping performance of the outcome-guided clustering methods in terms of disease subtyping accuracy, gene selection and outcome association. Unlike existing clustering methods, the outcome-guided disease subtyping framework creates a new precision medicine paradigm to directly identify patient subgroups with clinical association.
{"title":"OUTCOME-GUIDED DISEASE SUBTYPING BY GENERATIVE MODEL AND WEIGHTED JOINT LIKELIHOOD IN TRANSCRIPTOMIC APPLICATIONS.","authors":"Yujia Li, Peng Liu, Wenjia Wang, Wei Zong, Yusi Fang, Zhao Ren, Lu Tang, Juan C Celedón, Steffi Oesterreich, George C Tseng","doi":"10.1214/23-aoas1865","DOIUrl":"10.1214/23-aoas1865","url":null,"abstract":"<p><p>With advances in high-throughput technology, molecular disease subtyping by high-dimensional omics data has been recognized as an effective approach for identifying subtypes of complex diseases with distinct disease mechanisms and prognoses. Conventional cluster analysis takes omics data as input and generates patient clusters with similar gene expression pattern. The omics data, however, usually contain multi-faceted cluster structures that can be defined by different sets of gene. If the gene set associated with irrelevant clinical variables (e.g., sex or age) dominates the clustering process, the resulting clusters may not capture clinically meaningful disease subtypes. This motivates the development of a clustering framework with guidance from a pre-specified disease outcome, such as lung function measurement or survival, in this paper. We propose two disease subtyping methods by omics data with outcome guidance using a generative model or a weighted joint likelihood. Both methods connect an outcome association model and a disease subtyping model by a latent variable of cluster labels. Compared to the generative model, weighted joint likelihood contains a data-driven weight parameter to balance the likelihood contributions from outcome association and gene cluster separation, which improves generalizability in independent validation but requires heavier computing. Extensive simulations and two real applications in lung disease and triple-negative breast cancer demonstrate superior disease subtyping performance of the outcome-guided clustering methods in terms of disease subtyping accuracy, gene selection and outcome association. Unlike existing clustering methods, the outcome-guided disease subtyping framework creates a new precision medicine paradigm to directly identify patient subgroups with clinical association.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"18 3","pages":"1947-1964"},"PeriodicalIF":1.4,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12309773/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144755012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-01Epub Date: 2024-08-05DOI: 10.1214/23-aoas1871
Xiaoran Ma, Wensheng Guo, Mengyang Gu, Len Usvyat, Peter Kotanko, Yuedong Wang
Some patients with COVID-19 show changes in signs and symptoms such as temperature and oxygen saturation days before being positively tested for SARS-CoV-2, while others remain asymptomatic. It is important to identify these subgroups and to understand what biological and clinical predictors are related to these subgroups. This information will provide insights into how the immune system may respond differently to infection and can further be used to identify infected individuals. We propose a flexible nonparametric mixed-effects mixture model that identifies risk factors and classifies patients with biological changes. We model the latent probability of biological changes using a logistic regression model and trajectories in the latent groups using smoothing splines. We developed an EM algorithm to maximize the penalized likelihood for estimating all parameters and mean functions. We evaluate our methods by simulations and apply the proposed model to investigate changes in temperature in a cohort of COVID-19-infected hemodialysis patients.
一些 COVID-19 患者在接受 SARS-CoV-2 阳性检测前几天体温和血氧饱和度等体征和症状发生变化,而另一些患者则仍无症状。确定这些亚群并了解与这些亚群相关的生物学和临床预测因素非常重要。这些信息将有助于了解免疫系统如何对感染做出不同的反应,并可进一步用于识别感染者。我们提出了一种灵活的非参数混合效应模型,该模型可识别风险因素,并根据生物变化对患者进行分类。我们使用逻辑回归模型对生物变化的潜伏概率进行建模,并使用平滑样条对潜伏组的轨迹进行建模。我们开发了一种 EM 算法,用于最大化估计所有参数和均值函数的惩罚似然。我们通过模拟评估了我们的方法,并将所提出的模型应用于研究 COVID-19 感染血液透析患者队列中的体温变化。
{"title":"A NONPARAMETRIC MIXED-EFFECTS MIXTURE MODEL FOR PATTERNS OF CLINICAL MEASUREMENTS ASSOCIATED WITH COVID-19.","authors":"Xiaoran Ma, Wensheng Guo, Mengyang Gu, Len Usvyat, Peter Kotanko, Yuedong Wang","doi":"10.1214/23-aoas1871","DOIUrl":"10.1214/23-aoas1871","url":null,"abstract":"<p><p>Some patients with COVID-19 show changes in signs and symptoms such as temperature and oxygen saturation days before being positively tested for SARS-CoV-2, while others remain asymptomatic. It is important to identify these subgroups and to understand what biological and clinical predictors are related to these subgroups. This information will provide insights into how the immune system may respond differently to infection and can further be used to identify infected individuals. We propose a flexible nonparametric mixed-effects mixture model that identifies risk factors and classifies patients with biological changes. We model the latent probability of biological changes using a logistic regression model and trajectories in the latent groups using smoothing splines. We developed an EM algorithm to maximize the penalized likelihood for estimating all parameters and mean functions. We evaluate our methods by simulations and apply the proposed model to investigate changes in temperature in a cohort of COVID-19-infected hemodialysis patients.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"18 3","pages":"2080-2095"},"PeriodicalIF":1.3,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11460989/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142394985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
L U You,Falastin Salami,Carina Törn,Åke Lernmark,Roy Tamura
It is oftentimes the case in studies of disease progression that subjects can move into one of several disease states of interest. Multistate models are an indispensable tool to analyze data from such studies. The Environmental Determinants of Diabetes in the Young (TEDDY) is an observational study of at-risk children from birth to onset of type-1 diabetes (T1D) up through the age of 15. A joint model for simultaneous inference of multistate and multivariate nonparametric longitudinal data is proposed to analyze data and answer the research questions brought up in the study. The proposed method allows us to make statistical inferences, test hypotheses, and make predictions about future state occupation in the TEDDY study. The performance of the proposed method is evaluated by simulation studies. The proposed method is applied to the motivating example to demonstrate the capabilities of the method.
{"title":"JOINT MODELING OF MULTISTATE AND NONPARAMETRIC MULTIVARIATE LONGITUDINAL DATA.","authors":"L U You,Falastin Salami,Carina Törn,Åke Lernmark,Roy Tamura","doi":"10.1214/24-aoas1889","DOIUrl":"https://doi.org/10.1214/24-aoas1889","url":null,"abstract":"It is oftentimes the case in studies of disease progression that subjects can move into one of several disease states of interest. Multistate models are an indispensable tool to analyze data from such studies. The Environmental Determinants of Diabetes in the Young (TEDDY) is an observational study of at-risk children from birth to onset of type-1 diabetes (T1D) up through the age of 15. A joint model for simultaneous inference of multistate and multivariate nonparametric longitudinal data is proposed to analyze data and answer the research questions brought up in the study. The proposed method allows us to make statistical inferences, test hypotheses, and make predictions about future state occupation in the TEDDY study. The performance of the proposed method is evaluated by simulation studies. The proposed method is applied to the motivating example to demonstrate the capabilities of the method.","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"21 1","pages":"2444-2461"},"PeriodicalIF":1.8,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142259215","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-01Epub Date: 2024-04-05DOI: 10.1214/23-aoas1829
Penghui Huang, Manqi Cai, Xinghua Lu, Chris McKennan, Jiebiao Wang
Bulk transcriptomics in tissue samples reflects the average expression levels across different cell types and is highly influenced by cellular fractions. As such, it is critical to estimate cellular fractions to both deconfound differential expression analyses and infer cell type-specific differential expression. Since experimentally counting cells is infeasible in most tissues and studies, in silico cellular deconvolution methods have been developed as an alternative. However, existing methods are designed for tissues consisting of clearly distinguishable cell types and have difficulties estimating highly correlated or rare cell types. To address this challenge, we propose hierarchical deconvolution (HiDecon) that uses single-cell RNA sequencing references and a hierarchical cell-type tree, which models the similarities among cell types and cell differentiation relationships, to estimate cellular fractions in bulk data. By coordinating cell fractions across layers of the hierarchical tree, cellular fraction information is passed up and down the tree, which helps correct estimation biases by pooling information across related cell types. The flexible hierarchical tree structure also enables estimating rare cell fractions by splitting the tree to higher resolutions. Through simulations and real data applications with the ground truth of measured cellular fractions, we demonstrate that HiDecon outperforms existing methods and accurately estimates cellular fractions. Finally, we show the utility of HiDecon estimates in identifying the associations between cellular fractions and Alzheimer's disease.
{"title":"ACCURATE ESTIMATION OF RARE CELL-TYPE FRACTIONS FROM TISSUE OMICS DATA VIA HIERARCHICAL DECONVOLUTION.","authors":"Penghui Huang, Manqi Cai, Xinghua Lu, Chris McKennan, Jiebiao Wang","doi":"10.1214/23-aoas1829","DOIUrl":"10.1214/23-aoas1829","url":null,"abstract":"<p><p>Bulk transcriptomics in tissue samples reflects the average expression levels across different cell types and is highly influenced by cellular fractions. As such, it is critical to estimate cellular fractions to both deconfound differential expression analyses and infer cell type-specific differential expression. Since experimentally counting cells is infeasible in most tissues and studies, <i>in silico</i> cellular deconvolution methods have been developed as an alternative. However, existing methods are designed for tissues consisting of clearly distinguishable cell types and have difficulties estimating highly correlated or rare cell types. To address this challenge, we propose hierarchical deconvolution (HiDecon) that uses single-cell RNA sequencing references and a hierarchical cell-type tree, which models the similarities among cell types and cell differentiation relationships, to estimate cellular fractions in bulk data. By coordinating cell fractions across layers of the hierarchical tree, cellular fraction information is passed up and down the tree, which helps correct estimation biases by pooling information across related cell types. The flexible hierarchical tree structure also enables estimating rare cell fractions by splitting the tree to higher resolutions. Through simulations and real data applications with the ground truth of measured cellular fractions, we demonstrate that HiDecon outperforms existing methods and accurately estimates cellular fractions. Finally, we show the utility of HiDecon estimates in identifying the associations between cellular fractions and Alzheimer's disease.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"18 2","pages":"1178-1194"},"PeriodicalIF":1.4,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12530111/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145330846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-01Epub Date: 2024-04-05DOI: 10.1214/23-aoas1826
Zehang Richard Li, Zhenke Wu, Irena Chen, Samuel J Clark
Understanding cause-specific mortality rates is crucial for monitoring population health and designing public health interventions. Worldwide, two-thirds of deaths do not have a cause assigned. Verbal autopsy (VA) is a well-established tool to collect information describing deaths outside of hospitals by conducting surveys to caregivers of a deceased person. It is routinely implemented in many low- and middle-income countries. Statistical algorithms to assign cause of death using VAs are typically vulnerable to the distribution shift between the data used to train the model and the target population. This presents a major challenge for analyzing VAs, as labeled data are usually unavailable in the target population. This article proposes a latent class model framework for VA data (LCVA) that jointly models VAs collected over multiple heterogeneous domains, assigns causes of death for out-of-domain observations and estimates cause-specific mortality fractions for a new domain. We introduce a parsimonious representation of the joint distribution of the collected symptoms using nested latent class models and develop a computationally efficient algorithm for posterior inference. We demonstrate that LCVA outperforms existing methods in predictive performance and scalability. Supplementary Material and reproducible analysis codes are available online. The R package LCVA implementing the method is available on GitHub (https://github.com/richardli/LCVA).
了解特定病因死亡率对于监测人口健康和设计公共卫生干预措施至关重要。在世界范围内,三分之二的死亡没有指定死因。口头尸检(VA)是一种行之有效的工具,通过对死者的护理人员进行调查,收集医院外的死亡信息。在许多中低收入国家,这种方法已成为常规。使用尸体解剖确定死因的统计算法通常容易受到用于训练模型的数据与目标人群之间分布变化的影响。由于目标人群中通常没有标注数据,这给分析 VAs 带来了重大挑战。本文提出了一种针对退伍军人数据的潜类模型框架(LCVA),该框架可对多个异质领域收集的退伍军人数据进行联合建模,为领域外观测数据指定死因,并估算新领域的特定死因死亡率分数。我们使用嵌套潜类模型对收集到的症状的联合分布进行了简明表述,并开发了一种计算高效的后验推断算法。我们证明 LCVA 在预测性能和可扩展性方面优于现有方法。补充材料和可重复的分析代码可在线获取。实现该方法的 R 软件包 LCVA 可在 GitHub 上获取 (https://github.com/richardli/LCVA)。
{"title":"BAYESIAN NESTED LATENT CLASS MODELS FOR CAUSE-OF-DEATH ASSIGNMENT USING VERBAL AUTOPSIES ACROSS MULTIPLE DOMAINS.","authors":"Zehang Richard Li, Zhenke Wu, Irena Chen, Samuel J Clark","doi":"10.1214/23-aoas1826","DOIUrl":"10.1214/23-aoas1826","url":null,"abstract":"<p><p>Understanding cause-specific mortality rates is crucial for monitoring population health and designing public health interventions. Worldwide, two-thirds of deaths do not have a cause assigned. Verbal autopsy (VA) is a well-established tool to collect information describing deaths outside of hospitals by conducting surveys to caregivers of a deceased person. It is routinely implemented in many low- and middle-income countries. Statistical algorithms to assign cause of death using VAs are typically vulnerable to the distribution shift between the data used to train the model and the target population. This presents a major challenge for analyzing VAs, as labeled data are usually unavailable in the target population. This article proposes a latent class model framework for VA data (LCVA) that jointly models VAs collected over multiple heterogeneous domains, assigns causes of death for out-of-domain observations and estimates cause-specific mortality fractions for a new domain. We introduce a parsimonious representation of the joint distribution of the collected symptoms using nested latent class models and develop a computationally efficient algorithm for posterior inference. We demonstrate that LCVA outperforms existing methods in predictive performance and scalability. Supplementary Material and reproducible analysis codes are available online. The R package LCVA implementing the method is available on GitHub (https://github.com/richardli/LCVA).</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"18 2","pages":"1137-1159"},"PeriodicalIF":1.3,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11484295/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142479812","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-01Epub Date: 2024-04-05DOI: 10.1214/23-aoas1850
Bora Jin, Amy H Herring, David Dunson
In this paper we predict sea surface salinity (SSS) in the Arctic Ocean based on satellite measurements. SSS is a crucial indicator for ongoing changes in the Arctic Ocean and can offer important insights about climate change. We particularly focus on areas of water mistakenly flagged as ice by satellite algorithms. To remove bias in the retrieval of salinity near sea ice, the algorithms use conservative ice masks, which result in considerable loss of data. We aim to produce realistic SSS values for such regions to obtain more complete understanding about the SSS surface over the Arctic Ocean and benefit future applications that may require SSS measurements near edges of sea ice or coasts. We propose a class of scalable nonstationary processes that can handle large data from satellite products and complex geometries of the Arctic Ocean. Barrier overlap-removal acyclic directed graph GP (BORA-GP) constructs sparse directed acyclic graphs (DAGs) with neighbors conforming to barriers and boundaries, enabling characterization of dependence in constrained domains. The BORA-GP models produce more sensible SSS values in regions without satellite measurements and show improved performance in various constrained domains in simulation studies compared to state-of-the-art alternatives. An R package is available at https://github.com/jinbora0720/boraGP.
{"title":"SPATIAL PREDICTIONS ON PHYSICALLY CONSTRAINED DOMAINS: APPLICATIONS TO ARCTIC SEA SALINITY DATA.","authors":"Bora Jin, Amy H Herring, David Dunson","doi":"10.1214/23-aoas1850","DOIUrl":"10.1214/23-aoas1850","url":null,"abstract":"<p><p>In this paper we predict sea surface salinity (SSS) in the Arctic Ocean based on satellite measurements. SSS is a crucial indicator for ongoing changes in the Arctic Ocean and can offer important insights about climate change. We particularly focus on areas of water mistakenly flagged as ice by satellite algorithms. To remove bias in the retrieval of salinity near sea ice, the algorithms use conservative ice masks, which result in considerable loss of data. We aim to produce realistic SSS values for such regions to obtain more complete understanding about the SSS surface over the Arctic Ocean and benefit future applications that may require SSS measurements near edges of sea ice or coasts. We propose a class of scalable nonstationary processes that can handle large data from satellite products and complex geometries of the Arctic Ocean. Barrier overlap-removal acyclic directed graph GP (BORA-GP) constructs sparse directed acyclic graphs (DAGs) with neighbors conforming to barriers and boundaries, enabling characterization of dependence in constrained domains. The BORA-GP models produce more sensible SSS values in regions without satellite measurements and show improved performance in various constrained domains in simulation studies compared to state-of-the-art alternatives. An R package is available at https://github.com/jinbora0720/boraGP.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"18 2","pages":"1596-1617"},"PeriodicalIF":1.4,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12391905/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144976642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-01Epub Date: 2024-04-05DOI: 10.1214/23-aoas1835
Zihuan Liu, Cheuk Yin Lee, Heping Zhang
Neuroimaging studies often involve predicting a scalar outcome from an array of images collectively called tensor. The use of magnetic resonance imaging (MRI) provides a unique opportunity to investigate the structures of the brain. To learn the association between MRI images and human intelligence, we formulate a scalar-on-image quantile regression framework. However, the high dimensionality of the tensor makes estimating the coefficients for all elements computationally challenging. To address this, we propose a low-rank coefficient array estimation algorithm based on tensor train (TT) decomposition which we demonstrate can effectively reduce the dimensionality of the coefficient tensor to a feasible level while ensuring adequacy to the data. Our method is more stable and efficient compared to the commonly used, Canonic Polyadic rank approximation-based method. We also propose a generalized Lasso penalty on the coefficient tensor to take advantage of the spatial structure of the tensor, further reduce the dimensionality of the coefficient tensor, and improve the interpretability of the model. The consistency and asymptotic normality of the TT estimator are established under some mild conditions on the covariates and random errors in quantile regression models. The rate of convergence is obtained with regularization under the total variation penalty. Extensive numerical studies, including both synthetic and real MRI imaging data, are conducted to examine the empirical performance of the proposed method and its competitors.
{"title":"TENSOR QUANTILE REGRESSION WITH LOW-RANK TENSOR TRAIN ESTIMATION.","authors":"Zihuan Liu, Cheuk Yin Lee, Heping Zhang","doi":"10.1214/23-aoas1835","DOIUrl":"10.1214/23-aoas1835","url":null,"abstract":"<p><p>Neuroimaging studies often involve predicting a scalar outcome from an array of images collectively called tensor. The use of magnetic resonance imaging (MRI) provides a unique opportunity to investigate the structures of the brain. To learn the association between MRI images and human intelligence, we formulate a scalar-on-image quantile regression framework. However, the high dimensionality of the tensor makes estimating the coefficients for all elements computationally challenging. To address this, we propose a low-rank coefficient array estimation algorithm based on tensor train (TT) decomposition which we demonstrate can effectively reduce the dimensionality of the coefficient tensor to a feasible level while ensuring adequacy to the data. Our method is more stable and efficient compared to the commonly used, Canonic Polyadic rank approximation-based method. We also propose a generalized Lasso penalty on the coefficient tensor to take advantage of the spatial structure of the tensor, further reduce the dimensionality of the coefficient tensor, and improve the interpretability of the model. The consistency and asymptotic normality of the TT estimator are established under some mild conditions on the covariates and random errors in quantile regression models. The rate of convergence is obtained with regularization under the total variation penalty. Extensive numerical studies, including both synthetic and real MRI imaging data, are conducted to examine the empirical performance of the proposed method and its competitors.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"18 2","pages":"1294-1318"},"PeriodicalIF":1.3,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11046526/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140865777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-01Epub Date: 2024-04-05DOI: 10.1214/23-aoas1838
Sunyi Chi, Christopher R Flowers, Ziyi Li, Xuelin Huang, Peng Wei
Environmental exposures such as cigarette smoking influence health outcomes through intermediate molecular phenotypes, such as the methylome, transcriptome, and metabolome. Mediation analysis is a useful tool for investigating the role of potentially high-dimensional intermediate phenotypes in the relationship between environmental exposures and health outcomes. However, little work has been done on mediation analysis when the mediators are high-dimensional and the outcome is a survival endpoint, and none of it has provided a robust measure of total mediation effect. To this end, we propose an estimation procedure for Mediation Analysis of Survival outcome and High-dimensional omics mediators (MASH) based on sure independence screening for putative mediator variable selection and a second-moment-based measure of total mediation effect for survival data analogous to the measure in a linear model. Extensive simulations showed good performance of MASH in estimating the total mediation effect and identifying true mediators. By applying MASH to the metabolomics data of 1919 subjects in the Framingham Heart Study, we identified five metabolites as mediators of the effect of cigarette smoking on coronary heart disease risk (total mediation effect, 51.1%) and two metabolites as mediators between smoking and risk of cancer (total mediation effect, 50.7%). Application of MASH to a diffuse large B-cell lymphoma genomics data set identified copy-number variations for eight genes as mediators between the baseline International Prognostic Index score and overall survival.
吸烟等环境暴露通过中间分子表型(如甲基组、转录组和代谢组)影响健康结果。中介分析是研究潜在高维中间表型在环境暴露与健康结果之间关系中的作用的有用工具。然而,当中介因素是高维的,而结果是生存终点时,中介分析方面的工作很少,而且没有一项工作提供了总中介效应的稳健测量方法。为此,我们提出了一种生存结果与高维 omics 中介因子中介分析(MASH)的估算程序,该程序基于对推定中介变量选择的确定独立性筛选,以及对生存数据的总中介效应的基于第二时刻的测量,类似于线性模型中的 R 2 测量。大量模拟结果表明,MASH 在估计总中介效应和识别真正的中介因子方面表现出色。通过将 MASH 应用于弗雷明汉心脏研究中 1919 名受试者的代谢组学数据,我们确定了五种代谢物是吸烟对冠心病风险影响的中介物(总中介效应为 51.1%),两种代谢物是吸烟与癌症风险之间的中介物(总中介效应为 50.7%)。将 MASH 应用于弥漫大 B 细胞淋巴瘤基因组学数据集,发现 8 个基因的拷贝数变异是基线国际预后指数评分与总生存期之间的中介因子。
{"title":"MASH: MEDIATION ANALYSIS OF SURVIVAL OUTCOME AND HIGH-DIMENSIONAL OMICS MEDIATORS WITH APPLICATION TO COMPLEX DISEASES.","authors":"Sunyi Chi, Christopher R Flowers, Ziyi Li, Xuelin Huang, Peng Wei","doi":"10.1214/23-aoas1838","DOIUrl":"10.1214/23-aoas1838","url":null,"abstract":"<p><p>Environmental exposures such as cigarette smoking influence health outcomes through intermediate molecular phenotypes, such as the methylome, transcriptome, and metabolome. Mediation analysis is a useful tool for investigating the role of potentially high-dimensional intermediate phenotypes in the relationship between environmental exposures and health outcomes. However, little work has been done on mediation analysis when the mediators are high-dimensional and the outcome is a survival endpoint, and none of it has provided a robust measure of total mediation effect. To this end, we propose an estimation procedure for Mediation Analysis of Survival outcome and High-dimensional omics mediators (MASH) based on sure independence screening for putative mediator variable selection and a second-moment-based measure of total mediation effect for survival data analogous to the <math> <mrow><msup><mi>R</mi> <mn>2</mn></msup> </mrow> </math> measure in a linear model. Extensive simulations showed good performance of MASH in estimating the total mediation effect and identifying true mediators. By applying MASH to the metabolomics data of 1919 subjects in the Framingham Heart Study, we identified five metabolites as mediators of the effect of cigarette smoking on coronary heart disease risk (total mediation effect, 51.1%) and two metabolites as mediators between smoking and risk of cancer (total mediation effect, 50.7%). Application of MASH to a diffuse large B-cell lymphoma genomics data set identified copy-number variations for eight genes as mediators between the baseline International Prognostic Index score and overall survival.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"18 2","pages":"1360-1377"},"PeriodicalIF":1.3,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11426188/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142331821","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-01Epub Date: 2024-04-05DOI: 10.1214/23-aoas1852
Irena Chen, Zhenke Wu, Siobán D Harlow, Carrie A Karvonen-Gutierrez, Michelle M Hood, Michael R Elliott
Longitudinal biomarker data and cross-sectional outcomes are routinely collected in modern epidemiology studies, often with the goal of informing tailored early intervention decisions. For example, hormones, such as estradiol (E2) and follicle-stimulating hormone (FSH), may predict changes in womens' health during the midlife. Most existing methods focus on constructing predictors from mean marker trajectories. However, subject-level biomarker variability may also provide critical information about disease risks and health outcomes. Current literature does not provide statistical models to investigate such relationships with valid uncertainty quantification. In this paper we develop a fully Bayesian joint model that estimates subject-level means, variances, and covariances of multiple longitudinal biomarkers and uses these as predictors to evaluate their respective associations with a cross-sectional health outcome. Simulations demonstrate excellent recovery of true model parameters. The proposed method provides less biased and more efficient estimates, relative to alternative approaches that either ignore subject-level differences in variances or perform two-stage estimation where estimated marker variances are treated as observed. Empowered by the model, analyses of women's health data reveal, for the first time, that larger variability of E2 was associated with slower increases in waist circumference across the menopausal transition.
{"title":"VARIANCE AS A PREDICTOR OF HEALTH OUTCOMES: SUBJECT-LEVEL TRAJECTORIES AND VARIABILITY OF SEX HORMONES TO PREDICT BODY FAT CHANGES IN PERI- AND POSTMENOPAUSAL WOMEN.","authors":"Irena Chen, Zhenke Wu, Siobán D Harlow, Carrie A Karvonen-Gutierrez, Michelle M Hood, Michael R Elliott","doi":"10.1214/23-aoas1852","DOIUrl":"10.1214/23-aoas1852","url":null,"abstract":"<p><p>Longitudinal biomarker data and cross-sectional outcomes are routinely collected in modern epidemiology studies, often with the goal of informing tailored early intervention decisions. For example, hormones, such as estradiol (E2) and follicle-stimulating hormone (FSH), may predict changes in womens' health during the midlife. Most existing methods focus on constructing predictors from mean marker trajectories. However, subject-level biomarker variability may also provide critical information about disease risks and health outcomes. Current literature does not provide statistical models to investigate such relationships with valid uncertainty quantification. In this paper we develop a fully Bayesian joint model that estimates subject-level means, variances, and covariances of multiple longitudinal biomarkers and uses these as predictors to evaluate their respective associations with a cross-sectional health outcome. Simulations demonstrate excellent recovery of true model parameters. The proposed method provides less biased and more efficient estimates, relative to alternative approaches that either ignore subject-level differences in variances or perform two-stage estimation where estimated marker variances are treated as observed. Empowered by the model, analyses of women's health data reveal, for the first time, that larger variability of E2 was associated with slower increases in waist circumference across the menopausal transition.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"18 2","pages":"1642-1667"},"PeriodicalIF":1.4,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12309625/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144755011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-01Epub Date: 2024-04-05DOI: 10.1214/23-aoas1849
Emily N Peterson, Rachel C Nethery, Tullia Padellini, Jarvis T Chen, Brent A Coull, Frédéric B Piel, Jon Wakefield, Marta Blangiardo, Lance A Waller
Small area population counts are necessary for many epidemiological studies, yet their quality and accuracy are often not assessed. In the United States, small area population counts are published by the United States Census Bureau (USCB) in the form of the decennial census counts, intercensal population projections (PEP), and American Community Survey (ACS) estimates. Although there are significant relationships between these three data sources, there are important contrasts in data collection, data availability, and processing methodologies such that each set of reported population counts may be subject to different sources and magnitudes of error. Additionally, these data sources do not report identical small area population counts due to post-survey adjustments specific to each data source. Consequently, in public health studies, small area disease/mortality rates may differ depending on which data source is used for denominator data. To accurately estimate annual small area population counts and their associated uncertainties, we present a Bayesian population (BPop) model, which fuses information from all three USCB sources, accounting for data source specific methodologies and associated errors. We produce comprehensive small area race-stratified estimates of the true population, and associated uncertainties, given the observed trends in all three USCB population estimates. The main features of our framework are: (1) a single model integrating multiple data sources, (2) accounting for data source specific data generating mechanisms and specifically accounting for data source specific errors, and (3) prediction of population counts for years without USCB reported data. We focus our study on the Black and White only populations for 159 counties of Georgia and produce estimates for years 2006-2023. We compare BPop population estimates to decennial census counts, PEP annual counts, and ACS multi-year estimates. Additionally, we illustrate and explain the different types of data source specific errors. Lastly, we compare model performance using simulations and validation exercises. Our Bayesian population model can be extended to other applications at smaller spatial granularity and for demographic subpopulations defined further by race, age, and sex, and/or for other geographical regions.
{"title":"A BAYESIAN HIERARCHICAL SMALL AREA POPULATION MODEL ACCOUNTING FOR DATA SOURCE SPECIFIC METHODOLOGIES FROM AMERICAN COMMUNITY SURVEY, POPULATION ESTIMATES PROGRAM, AND DECENNIAL CENSUS DATA.","authors":"Emily N Peterson, Rachel C Nethery, Tullia Padellini, Jarvis T Chen, Brent A Coull, Frédéric B Piel, Jon Wakefield, Marta Blangiardo, Lance A Waller","doi":"10.1214/23-aoas1849","DOIUrl":"https://doi.org/10.1214/23-aoas1849","url":null,"abstract":"<p><p>Small area population counts are necessary for many epidemiological studies, yet their quality and accuracy are often not assessed. In the United States, small area population counts are published by the United States Census Bureau (USCB) in the form of the decennial census counts, intercensal population projections (PEP), and American Community Survey (ACS) estimates. Although there are significant relationships between these three data sources, there are important contrasts in data collection, data availability, and processing methodologies such that each set of reported population counts may be subject to different sources and magnitudes of error. Additionally, these data sources do not report identical small area population counts due to post-survey adjustments specific to each data source. Consequently, in public health studies, small area disease/mortality rates may differ depending on which data source is used for denominator data. To accurately estimate annual small area population counts <i>and their</i> associated uncertainties, we present a Bayesian population (BPop) model, which fuses information from all three USCB sources, accounting for data source specific methodologies and associated errors. We produce comprehensive small area race-stratified estimates of the true population, and associated uncertainties, given the observed trends in all three USCB population estimates. The main features of our framework are: (1) a single model integrating multiple data sources, (2) accounting for data source specific data generating mechanisms and specifically accounting for data source specific errors, and (3) prediction of population counts for years without USCB reported data. We focus our study on the Black and White only populations for 159 counties of Georgia and produce estimates for years 2006-2023. We compare BPop population estimates to decennial census counts, PEP annual counts, and ACS multi-year estimates. Additionally, we illustrate and explain the different types of data source specific errors. Lastly, we compare model performance using simulations and validation exercises. Our Bayesian population model can be extended to other applications at smaller spatial granularity and for demographic subpopulations defined further by race, age, and sex, and/or for other geographical regions.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"18 2","pages":"1565-1595"},"PeriodicalIF":1.3,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11423836/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142331820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}