首页 > 最新文献

Annals of Applied Statistics最新文献

英文 中文
OUTCOME-GUIDED DISEASE SUBTYPING BY GENERATIVE MODEL AND WEIGHTED JOINT LIKELIHOOD IN TRANSCRIPTOMIC APPLICATIONS. 转录组学应用中生成模型和加权联合似然的结果导向疾病亚型。
IF 1.4 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2024-09-01 Epub Date: 2024-08-05 DOI: 10.1214/23-aoas1865
Yujia Li, Peng Liu, Wenjia Wang, Wei Zong, Yusi Fang, Zhao Ren, Lu Tang, Juan C Celedón, Steffi Oesterreich, George C Tseng

With advances in high-throughput technology, molecular disease subtyping by high-dimensional omics data has been recognized as an effective approach for identifying subtypes of complex diseases with distinct disease mechanisms and prognoses. Conventional cluster analysis takes omics data as input and generates patient clusters with similar gene expression pattern. The omics data, however, usually contain multi-faceted cluster structures that can be defined by different sets of gene. If the gene set associated with irrelevant clinical variables (e.g., sex or age) dominates the clustering process, the resulting clusters may not capture clinically meaningful disease subtypes. This motivates the development of a clustering framework with guidance from a pre-specified disease outcome, such as lung function measurement or survival, in this paper. We propose two disease subtyping methods by omics data with outcome guidance using a generative model or a weighted joint likelihood. Both methods connect an outcome association model and a disease subtyping model by a latent variable of cluster labels. Compared to the generative model, weighted joint likelihood contains a data-driven weight parameter to balance the likelihood contributions from outcome association and gene cluster separation, which improves generalizability in independent validation but requires heavier computing. Extensive simulations and two real applications in lung disease and triple-negative breast cancer demonstrate superior disease subtyping performance of the outcome-guided clustering methods in terms of disease subtyping accuracy, gene selection and outcome association. Unlike existing clustering methods, the outcome-guided disease subtyping framework creates a new precision medicine paradigm to directly identify patient subgroups with clinical association.

随着高通量技术的进步,利用高维组学数据进行疾病分子分型已被认为是识别具有不同发病机制和预后的复杂疾病亚型的有效方法。传统的聚类分析以组学数据为输入,生成具有相似基因表达模式的患者聚类。然而,组学数据通常包含多方面的簇结构,可以由不同的基因集来定义。如果与不相关的临床变量(例如,性别或年龄)相关的基因集在聚类过程中占主导地位,则所得的聚类可能无法捕获临床有意义的疾病亚型。在本文中,这激发了基于预先指定的疾病结果(如肺功能测量或生存率)指导的聚类框架的发展。我们提出了两种疾病分型方法组学数据与结果指导使用生成模型或加权联合似然。两种方法都通过聚类标签的潜在变量将结果关联模型和疾病亚型模型连接起来。与生成模型相比,加权联合似然包含一个数据驱动的权重参数来平衡结果关联和基因聚类分离的似然贡献,提高了独立验证的泛化性,但需要更多的计算。广泛的模拟和在肺部疾病和三阴性乳腺癌中的两个实际应用表明,结果导向聚类方法在疾病分型准确性、基因选择和结果关联方面具有优越的疾病分型性能。与现有的聚类方法不同,以结果为导向的疾病亚型框架创建了一种新的精准医学范式,可以直接识别具有临床关联的患者亚组。
{"title":"OUTCOME-GUIDED DISEASE SUBTYPING BY GENERATIVE MODEL AND WEIGHTED JOINT LIKELIHOOD IN TRANSCRIPTOMIC APPLICATIONS.","authors":"Yujia Li, Peng Liu, Wenjia Wang, Wei Zong, Yusi Fang, Zhao Ren, Lu Tang, Juan C Celedón, Steffi Oesterreich, George C Tseng","doi":"10.1214/23-aoas1865","DOIUrl":"10.1214/23-aoas1865","url":null,"abstract":"<p><p>With advances in high-throughput technology, molecular disease subtyping by high-dimensional omics data has been recognized as an effective approach for identifying subtypes of complex diseases with distinct disease mechanisms and prognoses. Conventional cluster analysis takes omics data as input and generates patient clusters with similar gene expression pattern. The omics data, however, usually contain multi-faceted cluster structures that can be defined by different sets of gene. If the gene set associated with irrelevant clinical variables (e.g., sex or age) dominates the clustering process, the resulting clusters may not capture clinically meaningful disease subtypes. This motivates the development of a clustering framework with guidance from a pre-specified disease outcome, such as lung function measurement or survival, in this paper. We propose two disease subtyping methods by omics data with outcome guidance using a generative model or a weighted joint likelihood. Both methods connect an outcome association model and a disease subtyping model by a latent variable of cluster labels. Compared to the generative model, weighted joint likelihood contains a data-driven weight parameter to balance the likelihood contributions from outcome association and gene cluster separation, which improves generalizability in independent validation but requires heavier computing. Extensive simulations and two real applications in lung disease and triple-negative breast cancer demonstrate superior disease subtyping performance of the outcome-guided clustering methods in terms of disease subtyping accuracy, gene selection and outcome association. Unlike existing clustering methods, the outcome-guided disease subtyping framework creates a new precision medicine paradigm to directly identify patient subgroups with clinical association.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"18 3","pages":"1947-1964"},"PeriodicalIF":1.4,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12309773/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144755012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A NONPARAMETRIC MIXED-EFFECTS MIXTURE MODEL FOR PATTERNS OF CLINICAL MEASUREMENTS ASSOCIATED WITH COVID-19. 与 covid-19 相关的临床测量模式的非参数混合效应混合物模型。
IF 1.3 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2024-09-01 Epub Date: 2024-08-05 DOI: 10.1214/23-aoas1871
Xiaoran Ma, Wensheng Guo, Mengyang Gu, Len Usvyat, Peter Kotanko, Yuedong Wang

Some patients with COVID-19 show changes in signs and symptoms such as temperature and oxygen saturation days before being positively tested for SARS-CoV-2, while others remain asymptomatic. It is important to identify these subgroups and to understand what biological and clinical predictors are related to these subgroups. This information will provide insights into how the immune system may respond differently to infection and can further be used to identify infected individuals. We propose a flexible nonparametric mixed-effects mixture model that identifies risk factors and classifies patients with biological changes. We model the latent probability of biological changes using a logistic regression model and trajectories in the latent groups using smoothing splines. We developed an EM algorithm to maximize the penalized likelihood for estimating all parameters and mean functions. We evaluate our methods by simulations and apply the proposed model to investigate changes in temperature in a cohort of COVID-19-infected hemodialysis patients.

一些 COVID-19 患者在接受 SARS-CoV-2 阳性检测前几天体温和血氧饱和度等体征和症状发生变化,而另一些患者则仍无症状。确定这些亚群并了解与这些亚群相关的生物学和临床预测因素非常重要。这些信息将有助于了解免疫系统如何对感染做出不同的反应,并可进一步用于识别感染者。我们提出了一种灵活的非参数混合效应模型,该模型可识别风险因素,并根据生物变化对患者进行分类。我们使用逻辑回归模型对生物变化的潜伏概率进行建模,并使用平滑样条对潜伏组的轨迹进行建模。我们开发了一种 EM 算法,用于最大化估计所有参数和均值函数的惩罚似然。我们通过模拟评估了我们的方法,并将所提出的模型应用于研究 COVID-19 感染血液透析患者队列中的体温变化。
{"title":"A NONPARAMETRIC MIXED-EFFECTS MIXTURE MODEL FOR PATTERNS OF CLINICAL MEASUREMENTS ASSOCIATED WITH COVID-19.","authors":"Xiaoran Ma, Wensheng Guo, Mengyang Gu, Len Usvyat, Peter Kotanko, Yuedong Wang","doi":"10.1214/23-aoas1871","DOIUrl":"10.1214/23-aoas1871","url":null,"abstract":"<p><p>Some patients with COVID-19 show changes in signs and symptoms such as temperature and oxygen saturation days before being positively tested for SARS-CoV-2, while others remain asymptomatic. It is important to identify these subgroups and to understand what biological and clinical predictors are related to these subgroups. This information will provide insights into how the immune system may respond differently to infection and can further be used to identify infected individuals. We propose a flexible nonparametric mixed-effects mixture model that identifies risk factors and classifies patients with biological changes. We model the latent probability of biological changes using a logistic regression model and trajectories in the latent groups using smoothing splines. We developed an EM algorithm to maximize the penalized likelihood for estimating all parameters and mean functions. We evaluate our methods by simulations and apply the proposed model to investigate changes in temperature in a cohort of COVID-19-infected hemodialysis patients.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"18 3","pages":"2080-2095"},"PeriodicalIF":1.3,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11460989/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142394985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
JOINT MODELING OF MULTISTATE AND NONPARAMETRIC MULTIVARIATE LONGITUDINAL DATA. 多态和非参数多变量纵向数据的联合建模。
IF 1.8 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2024-08-05 DOI: 10.1214/24-aoas1889
L U You,Falastin Salami,Carina Törn,Åke Lernmark,Roy Tamura
It is oftentimes the case in studies of disease progression that subjects can move into one of several disease states of interest. Multistate models are an indispensable tool to analyze data from such studies. The Environmental Determinants of Diabetes in the Young (TEDDY) is an observational study of at-risk children from birth to onset of type-1 diabetes (T1D) up through the age of 15. A joint model for simultaneous inference of multistate and multivariate nonparametric longitudinal data is proposed to analyze data and answer the research questions brought up in the study. The proposed method allows us to make statistical inferences, test hypotheses, and make predictions about future state occupation in the TEDDY study. The performance of the proposed method is evaluated by simulation studies. The proposed method is applied to the motivating example to demonstrate the capabilities of the method.
在疾病进展研究中,受试者往往会进入几种相关疾病状态中的一种。多态模型是分析此类研究数据不可或缺的工具。青少年糖尿病的环境决定因素(TEDDY)是一项观察性研究,研究对象为从出生到 1 型糖尿病(T1D)发病直至 15 岁的高危儿童。本研究提出了一种多态和多变量非参数纵向数据同时推断的联合模型,用于分析数据和回答研究中提出的问题。通过所提出的方法,我们可以在 TEDDY 研究中进行统计推断、检验假设并预测未来的职业状态。我们通过模拟研究对所提方法的性能进行了评估。建议的方法应用于激励性实例,以展示该方法的能力。
{"title":"JOINT MODELING OF MULTISTATE AND NONPARAMETRIC MULTIVARIATE LONGITUDINAL DATA.","authors":"L U You,Falastin Salami,Carina Törn,Åke Lernmark,Roy Tamura","doi":"10.1214/24-aoas1889","DOIUrl":"https://doi.org/10.1214/24-aoas1889","url":null,"abstract":"It is oftentimes the case in studies of disease progression that subjects can move into one of several disease states of interest. Multistate models are an indispensable tool to analyze data from such studies. The Environmental Determinants of Diabetes in the Young (TEDDY) is an observational study of at-risk children from birth to onset of type-1 diabetes (T1D) up through the age of 15. A joint model for simultaneous inference of multistate and multivariate nonparametric longitudinal data is proposed to analyze data and answer the research questions brought up in the study. The proposed method allows us to make statistical inferences, test hypotheses, and make predictions about future state occupation in the TEDDY study. The performance of the proposed method is evaluated by simulation studies. The proposed method is applied to the motivating example to demonstrate the capabilities of the method.","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"21 1","pages":"2444-2461"},"PeriodicalIF":1.8,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142259215","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ACCURATE ESTIMATION OF RARE CELL-TYPE FRACTIONS FROM TISSUE OMICS DATA VIA HIERARCHICAL DECONVOLUTION. 通过分层反褶积从组织组学数据中准确估计稀有细胞类型。
IF 1.4 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2024-06-01 Epub Date: 2024-04-05 DOI: 10.1214/23-aoas1829
Penghui Huang, Manqi Cai, Xinghua Lu, Chris McKennan, Jiebiao Wang

Bulk transcriptomics in tissue samples reflects the average expression levels across different cell types and is highly influenced by cellular fractions. As such, it is critical to estimate cellular fractions to both deconfound differential expression analyses and infer cell type-specific differential expression. Since experimentally counting cells is infeasible in most tissues and studies, in silico cellular deconvolution methods have been developed as an alternative. However, existing methods are designed for tissues consisting of clearly distinguishable cell types and have difficulties estimating highly correlated or rare cell types. To address this challenge, we propose hierarchical deconvolution (HiDecon) that uses single-cell RNA sequencing references and a hierarchical cell-type tree, which models the similarities among cell types and cell differentiation relationships, to estimate cellular fractions in bulk data. By coordinating cell fractions across layers of the hierarchical tree, cellular fraction information is passed up and down the tree, which helps correct estimation biases by pooling information across related cell types. The flexible hierarchical tree structure also enables estimating rare cell fractions by splitting the tree to higher resolutions. Through simulations and real data applications with the ground truth of measured cellular fractions, we demonstrate that HiDecon outperforms existing methods and accurately estimates cellular fractions. Finally, we show the utility of HiDecon estimates in identifying the associations between cellular fractions and Alzheimer's disease.

组织样本中的大量转录组学反映了不同细胞类型的平均表达水平,并受到细胞组分的高度影响。因此,估计细胞组分对于分离差异表达分析和推断细胞类型特异性差异表达至关重要。由于在大多数组织和研究中实验计数细胞是不可行的,因此硅细胞反褶积方法已被开发作为一种替代方法。然而,现有的方法是针对由明显可区分的细胞类型组成的组织设计的,难以估计高度相关或罕见的细胞类型。为了解决这一挑战,我们提出了分层反褶积(HiDecon),该方法使用单细胞RNA测序参考和分层细胞类型树(模拟细胞类型和细胞分化关系之间的相似性)来估计大量数据中的细胞分数。通过协调分层树各层的细胞分数,细胞分数信息在树中上下传递,这有助于通过池化相关细胞类型的信息来纠正估计偏差。灵活的分层树结构还可以通过将树拆分到更高的分辨率来估计罕见的细胞分数。通过模拟和实际数据应用,我们证明了HiDecon优于现有方法,可以准确地估计细胞分数。最后,我们展示了HiDecon估计在识别细胞组分和阿尔茨海默病之间关联方面的效用。
{"title":"ACCURATE ESTIMATION OF RARE CELL-TYPE FRACTIONS FROM TISSUE OMICS DATA VIA HIERARCHICAL DECONVOLUTION.","authors":"Penghui Huang, Manqi Cai, Xinghua Lu, Chris McKennan, Jiebiao Wang","doi":"10.1214/23-aoas1829","DOIUrl":"10.1214/23-aoas1829","url":null,"abstract":"<p><p>Bulk transcriptomics in tissue samples reflects the average expression levels across different cell types and is highly influenced by cellular fractions. As such, it is critical to estimate cellular fractions to both deconfound differential expression analyses and infer cell type-specific differential expression. Since experimentally counting cells is infeasible in most tissues and studies, <i>in silico</i> cellular deconvolution methods have been developed as an alternative. However, existing methods are designed for tissues consisting of clearly distinguishable cell types and have difficulties estimating highly correlated or rare cell types. To address this challenge, we propose hierarchical deconvolution (HiDecon) that uses single-cell RNA sequencing references and a hierarchical cell-type tree, which models the similarities among cell types and cell differentiation relationships, to estimate cellular fractions in bulk data. By coordinating cell fractions across layers of the hierarchical tree, cellular fraction information is passed up and down the tree, which helps correct estimation biases by pooling information across related cell types. The flexible hierarchical tree structure also enables estimating rare cell fractions by splitting the tree to higher resolutions. Through simulations and real data applications with the ground truth of measured cellular fractions, we demonstrate that HiDecon outperforms existing methods and accurately estimates cellular fractions. Finally, we show the utility of HiDecon estimates in identifying the associations between cellular fractions and Alzheimer's disease.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"18 2","pages":"1178-1194"},"PeriodicalIF":1.4,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12530111/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145330846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
BAYESIAN NESTED LATENT CLASS MODELS FOR CAUSE-OF-DEATH ASSIGNMENT USING VERBAL AUTOPSIES ACROSS MULTIPLE DOMAINS. 利用多领域口头尸检的贝叶斯嵌套潜类模型确定死因。
IF 1.3 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2024-06-01 Epub Date: 2024-04-05 DOI: 10.1214/23-aoas1826
Zehang Richard Li, Zhenke Wu, Irena Chen, Samuel J Clark

Understanding cause-specific mortality rates is crucial for monitoring population health and designing public health interventions. Worldwide, two-thirds of deaths do not have a cause assigned. Verbal autopsy (VA) is a well-established tool to collect information describing deaths outside of hospitals by conducting surveys to caregivers of a deceased person. It is routinely implemented in many low- and middle-income countries. Statistical algorithms to assign cause of death using VAs are typically vulnerable to the distribution shift between the data used to train the model and the target population. This presents a major challenge for analyzing VAs, as labeled data are usually unavailable in the target population. This article proposes a latent class model framework for VA data (LCVA) that jointly models VAs collected over multiple heterogeneous domains, assigns causes of death for out-of-domain observations and estimates cause-specific mortality fractions for a new domain. We introduce a parsimonious representation of the joint distribution of the collected symptoms using nested latent class models and develop a computationally efficient algorithm for posterior inference. We demonstrate that LCVA outperforms existing methods in predictive performance and scalability. Supplementary Material and reproducible analysis codes are available online. The R package LCVA implementing the method is available on GitHub (https://github.com/richardli/LCVA).

了解特定病因死亡率对于监测人口健康和设计公共卫生干预措施至关重要。在世界范围内,三分之二的死亡没有指定死因。口头尸检(VA)是一种行之有效的工具,通过对死者的护理人员进行调查,收集医院外的死亡信息。在许多中低收入国家,这种方法已成为常规。使用尸体解剖确定死因的统计算法通常容易受到用于训练模型的数据与目标人群之间分布变化的影响。由于目标人群中通常没有标注数据,这给分析 VAs 带来了重大挑战。本文提出了一种针对退伍军人数据的潜类模型框架(LCVA),该框架可对多个异质领域收集的退伍军人数据进行联合建模,为领域外观测数据指定死因,并估算新领域的特定死因死亡率分数。我们使用嵌套潜类模型对收集到的症状的联合分布进行了简明表述,并开发了一种计算高效的后验推断算法。我们证明 LCVA 在预测性能和可扩展性方面优于现有方法。补充材料和可重复的分析代码可在线获取。实现该方法的 R 软件包 LCVA 可在 GitHub 上获取 (https://github.com/richardli/LCVA)。
{"title":"BAYESIAN NESTED LATENT CLASS MODELS FOR CAUSE-OF-DEATH ASSIGNMENT USING VERBAL AUTOPSIES ACROSS MULTIPLE DOMAINS.","authors":"Zehang Richard Li, Zhenke Wu, Irena Chen, Samuel J Clark","doi":"10.1214/23-aoas1826","DOIUrl":"10.1214/23-aoas1826","url":null,"abstract":"<p><p>Understanding cause-specific mortality rates is crucial for monitoring population health and designing public health interventions. Worldwide, two-thirds of deaths do not have a cause assigned. Verbal autopsy (VA) is a well-established tool to collect information describing deaths outside of hospitals by conducting surveys to caregivers of a deceased person. It is routinely implemented in many low- and middle-income countries. Statistical algorithms to assign cause of death using VAs are typically vulnerable to the distribution shift between the data used to train the model and the target population. This presents a major challenge for analyzing VAs, as labeled data are usually unavailable in the target population. This article proposes a latent class model framework for VA data (LCVA) that jointly models VAs collected over multiple heterogeneous domains, assigns causes of death for out-of-domain observations and estimates cause-specific mortality fractions for a new domain. We introduce a parsimonious representation of the joint distribution of the collected symptoms using nested latent class models and develop a computationally efficient algorithm for posterior inference. We demonstrate that LCVA outperforms existing methods in predictive performance and scalability. Supplementary Material and reproducible analysis codes are available online. The R package LCVA implementing the method is available on GitHub (https://github.com/richardli/LCVA).</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"18 2","pages":"1137-1159"},"PeriodicalIF":1.3,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11484295/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142479812","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SPATIAL PREDICTIONS ON PHYSICALLY CONSTRAINED DOMAINS: APPLICATIONS TO ARCTIC SEA SALINITY DATA. 物理约束域的空间预测:北极海盐度数据的应用。
IF 1.4 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2024-06-01 Epub Date: 2024-04-05 DOI: 10.1214/23-aoas1850
Bora Jin, Amy H Herring, David Dunson

In this paper we predict sea surface salinity (SSS) in the Arctic Ocean based on satellite measurements. SSS is a crucial indicator for ongoing changes in the Arctic Ocean and can offer important insights about climate change. We particularly focus on areas of water mistakenly flagged as ice by satellite algorithms. To remove bias in the retrieval of salinity near sea ice, the algorithms use conservative ice masks, which result in considerable loss of data. We aim to produce realistic SSS values for such regions to obtain more complete understanding about the SSS surface over the Arctic Ocean and benefit future applications that may require SSS measurements near edges of sea ice or coasts. We propose a class of scalable nonstationary processes that can handle large data from satellite products and complex geometries of the Arctic Ocean. Barrier overlap-removal acyclic directed graph GP (BORA-GP) constructs sparse directed acyclic graphs (DAGs) with neighbors conforming to barriers and boundaries, enabling characterization of dependence in constrained domains. The BORA-GP models produce more sensible SSS values in regions without satellite measurements and show improved performance in various constrained domains in simulation studies compared to state-of-the-art alternatives. An R package is available at https://github.com/jinbora0720/boraGP.

本文利用卫星观测资料对北冰洋海面盐度进行了预测。SSS是北冰洋持续变化的关键指标,可以提供有关气候变化的重要见解。我们特别关注那些被卫星算法错误地标记为冰的水域。为了消除海冰附近盐度检索中的偏差,算法使用了保守的冰掩模,这导致了相当大的数据损失。我们的目标是为这些地区产生现实的SSS值,以获得对北冰洋SSS表面更完整的了解,并有利于未来可能需要在海冰边缘或海岸附近测量SSS的应用。我们提出了一类可扩展的非平稳过程,可以处理来自卫星产品和北冰洋复杂几何形状的大数据。屏障重叠去除无环有向图GP (BORA-GP)构建了具有符合屏障和边界的稀疏有向无环图(dag),从而能够表征约束域中的依赖性。BORA-GP模型在没有卫星测量的地区产生更合理的SSS值,并且在模拟研究中与最先进的替代方案相比,在各种约束域中显示出更好的性能。R包可在https://github.com/jinbora0720/boraGP上获得。
{"title":"SPATIAL PREDICTIONS ON PHYSICALLY CONSTRAINED DOMAINS: APPLICATIONS TO ARCTIC SEA SALINITY DATA.","authors":"Bora Jin, Amy H Herring, David Dunson","doi":"10.1214/23-aoas1850","DOIUrl":"10.1214/23-aoas1850","url":null,"abstract":"<p><p>In this paper we predict sea surface salinity (SSS) in the Arctic Ocean based on satellite measurements. SSS is a crucial indicator for ongoing changes in the Arctic Ocean and can offer important insights about climate change. We particularly focus on areas of water mistakenly flagged as ice by satellite algorithms. To remove bias in the retrieval of salinity near sea ice, the algorithms use conservative ice masks, which result in considerable loss of data. We aim to produce realistic SSS values for such regions to obtain more complete understanding about the SSS surface over the Arctic Ocean and benefit future applications that may require SSS measurements near edges of sea ice or coasts. We propose a class of scalable nonstationary processes that can handle large data from satellite products and complex geometries of the Arctic Ocean. Barrier overlap-removal acyclic directed graph GP (BORA-GP) constructs sparse directed acyclic graphs (DAGs) with neighbors conforming to barriers and boundaries, enabling characterization of dependence in constrained domains. The BORA-GP models produce more sensible SSS values in regions without satellite measurements and show improved performance in various constrained domains in simulation studies compared to state-of-the-art alternatives. An R package is available at https://github.com/jinbora0720/boraGP.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"18 2","pages":"1596-1617"},"PeriodicalIF":1.4,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12391905/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144976642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TENSOR QUANTILE REGRESSION WITH LOW-RANK TENSOR TRAIN ESTIMATION. 张量量子回归与低等级张量列车估计。
IF 1.3 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2024-06-01 Epub Date: 2024-04-05 DOI: 10.1214/23-aoas1835
Zihuan Liu, Cheuk Yin Lee, Heping Zhang

Neuroimaging studies often involve predicting a scalar outcome from an array of images collectively called tensor. The use of magnetic resonance imaging (MRI) provides a unique opportunity to investigate the structures of the brain. To learn the association between MRI images and human intelligence, we formulate a scalar-on-image quantile regression framework. However, the high dimensionality of the tensor makes estimating the coefficients for all elements computationally challenging. To address this, we propose a low-rank coefficient array estimation algorithm based on tensor train (TT) decomposition which we demonstrate can effectively reduce the dimensionality of the coefficient tensor to a feasible level while ensuring adequacy to the data. Our method is more stable and efficient compared to the commonly used, Canonic Polyadic rank approximation-based method. We also propose a generalized Lasso penalty on the coefficient tensor to take advantage of the spatial structure of the tensor, further reduce the dimensionality of the coefficient tensor, and improve the interpretability of the model. The consistency and asymptotic normality of the TT estimator are established under some mild conditions on the covariates and random errors in quantile regression models. The rate of convergence is obtained with regularization under the total variation penalty. Extensive numerical studies, including both synthetic and real MRI imaging data, are conducted to examine the empirical performance of the proposed method and its competitors.

神经成像研究通常涉及从统称为张量的图像阵列中预测标量结果。磁共振成像(MRI)的使用为研究大脑结构提供了独特的机会。为了了解核磁共振成像图像与人类智力之间的关联,我们制定了一个图像标量量化回归框架。然而,张量的高维度使得估算所有元素的系数在计算上具有挑战性。为了解决这个问题,我们提出了一种基于张量列车(TT)分解的低秩系数阵列估计算法,我们证明这种算法可以有效地将系数张量的维度降低到可行的水平,同时确保数据的充分性。与常用的基于卡诺尼多模秩近似的方法相比,我们的方法更稳定、更高效。我们还提出了对系数张量的广义 Lasso 惩罚,以利用张量的空间结构,进一步降低系数张量的维度,提高模型的可解释性。在量化回归模型的协变量和随机误差的一些温和条件下,建立了 TT 估计器的一致性和渐近正态性。在总变异惩罚下,通过正则化获得了收敛率。我们还进行了广泛的数值研究,包括合成和真实的核磁共振成像数据,以检验所提出的方法及其竞争对手的经验性能。
{"title":"TENSOR QUANTILE REGRESSION WITH LOW-RANK TENSOR TRAIN ESTIMATION.","authors":"Zihuan Liu, Cheuk Yin Lee, Heping Zhang","doi":"10.1214/23-aoas1835","DOIUrl":"10.1214/23-aoas1835","url":null,"abstract":"<p><p>Neuroimaging studies often involve predicting a scalar outcome from an array of images collectively called tensor. The use of magnetic resonance imaging (MRI) provides a unique opportunity to investigate the structures of the brain. To learn the association between MRI images and human intelligence, we formulate a scalar-on-image quantile regression framework. However, the high dimensionality of the tensor makes estimating the coefficients for all elements computationally challenging. To address this, we propose a low-rank coefficient array estimation algorithm based on tensor train (TT) decomposition which we demonstrate can effectively reduce the dimensionality of the coefficient tensor to a feasible level while ensuring adequacy to the data. Our method is more stable and efficient compared to the commonly used, Canonic Polyadic rank approximation-based method. We also propose a generalized Lasso penalty on the coefficient tensor to take advantage of the spatial structure of the tensor, further reduce the dimensionality of the coefficient tensor, and improve the interpretability of the model. The consistency and asymptotic normality of the TT estimator are established under some mild conditions on the covariates and random errors in quantile regression models. The rate of convergence is obtained with regularization under the total variation penalty. Extensive numerical studies, including both synthetic and real MRI imaging data, are conducted to examine the empirical performance of the proposed method and its competitors.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"18 2","pages":"1294-1318"},"PeriodicalIF":1.3,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11046526/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140865777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MASH: MEDIATION ANALYSIS OF SURVIVAL OUTCOME AND HIGH-DIMENSIONAL OMICS MEDIATORS WITH APPLICATION TO COMPLEX DISEASES. mash:生存结果和高维 omics 中介因子的中介分析,适用于复杂疾病。
IF 1.3 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2024-06-01 Epub Date: 2024-04-05 DOI: 10.1214/23-aoas1838
Sunyi Chi, Christopher R Flowers, Ziyi Li, Xuelin Huang, Peng Wei

Environmental exposures such as cigarette smoking influence health outcomes through intermediate molecular phenotypes, such as the methylome, transcriptome, and metabolome. Mediation analysis is a useful tool for investigating the role of potentially high-dimensional intermediate phenotypes in the relationship between environmental exposures and health outcomes. However, little work has been done on mediation analysis when the mediators are high-dimensional and the outcome is a survival endpoint, and none of it has provided a robust measure of total mediation effect. To this end, we propose an estimation procedure for Mediation Analysis of Survival outcome and High-dimensional omics mediators (MASH) based on sure independence screening for putative mediator variable selection and a second-moment-based measure of total mediation effect for survival data analogous to the R 2 measure in a linear model. Extensive simulations showed good performance of MASH in estimating the total mediation effect and identifying true mediators. By applying MASH to the metabolomics data of 1919 subjects in the Framingham Heart Study, we identified five metabolites as mediators of the effect of cigarette smoking on coronary heart disease risk (total mediation effect, 51.1%) and two metabolites as mediators between smoking and risk of cancer (total mediation effect, 50.7%). Application of MASH to a diffuse large B-cell lymphoma genomics data set identified copy-number variations for eight genes as mediators between the baseline International Prognostic Index score and overall survival.

吸烟等环境暴露通过中间分子表型(如甲基组、转录组和代谢组)影响健康结果。中介分析是研究潜在高维中间表型在环境暴露与健康结果之间关系中的作用的有用工具。然而,当中介因素是高维的,而结果是生存终点时,中介分析方面的工作很少,而且没有一项工作提供了总中介效应的稳健测量方法。为此,我们提出了一种生存结果与高维 omics 中介因子中介分析(MASH)的估算程序,该程序基于对推定中介变量选择的确定独立性筛选,以及对生存数据的总中介效应的基于第二时刻的测量,类似于线性模型中的 R 2 测量。大量模拟结果表明,MASH 在估计总中介效应和识别真正的中介因子方面表现出色。通过将 MASH 应用于弗雷明汉心脏研究中 1919 名受试者的代谢组学数据,我们确定了五种代谢物是吸烟对冠心病风险影响的中介物(总中介效应为 51.1%),两种代谢物是吸烟与癌症风险之间的中介物(总中介效应为 50.7%)。将 MASH 应用于弥漫大 B 细胞淋巴瘤基因组学数据集,发现 8 个基因的拷贝数变异是基线国际预后指数评分与总生存期之间的中介因子。
{"title":"MASH: MEDIATION ANALYSIS OF SURVIVAL OUTCOME AND HIGH-DIMENSIONAL OMICS MEDIATORS WITH APPLICATION TO COMPLEX DISEASES.","authors":"Sunyi Chi, Christopher R Flowers, Ziyi Li, Xuelin Huang, Peng Wei","doi":"10.1214/23-aoas1838","DOIUrl":"10.1214/23-aoas1838","url":null,"abstract":"<p><p>Environmental exposures such as cigarette smoking influence health outcomes through intermediate molecular phenotypes, such as the methylome, transcriptome, and metabolome. Mediation analysis is a useful tool for investigating the role of potentially high-dimensional intermediate phenotypes in the relationship between environmental exposures and health outcomes. However, little work has been done on mediation analysis when the mediators are high-dimensional and the outcome is a survival endpoint, and none of it has provided a robust measure of total mediation effect. To this end, we propose an estimation procedure for Mediation Analysis of Survival outcome and High-dimensional omics mediators (MASH) based on sure independence screening for putative mediator variable selection and a second-moment-based measure of total mediation effect for survival data analogous to the <math> <mrow><msup><mi>R</mi> <mn>2</mn></msup> </mrow> </math> measure in a linear model. Extensive simulations showed good performance of MASH in estimating the total mediation effect and identifying true mediators. By applying MASH to the metabolomics data of 1919 subjects in the Framingham Heart Study, we identified five metabolites as mediators of the effect of cigarette smoking on coronary heart disease risk (total mediation effect, 51.1%) and two metabolites as mediators between smoking and risk of cancer (total mediation effect, 50.7%). Application of MASH to a diffuse large B-cell lymphoma genomics data set identified copy-number variations for eight genes as mediators between the baseline International Prognostic Index score and overall survival.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"18 2","pages":"1360-1377"},"PeriodicalIF":1.3,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11426188/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142331821","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
VARIANCE AS A PREDICTOR OF HEALTH OUTCOMES: SUBJECT-LEVEL TRAJECTORIES AND VARIABILITY OF SEX HORMONES TO PREDICT BODY FAT CHANGES IN PERI- AND POSTMENOPAUSAL WOMEN. 方差作为健康结果的预测因子:受试者水平的性激素轨迹和可变性预测围绝经期和绝经后妇女体脂变化
IF 1.4 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2024-06-01 Epub Date: 2024-04-05 DOI: 10.1214/23-aoas1852
Irena Chen, Zhenke Wu, Siobán D Harlow, Carrie A Karvonen-Gutierrez, Michelle M Hood, Michael R Elliott

Longitudinal biomarker data and cross-sectional outcomes are routinely collected in modern epidemiology studies, often with the goal of informing tailored early intervention decisions. For example, hormones, such as estradiol (E2) and follicle-stimulating hormone (FSH), may predict changes in womens' health during the midlife. Most existing methods focus on constructing predictors from mean marker trajectories. However, subject-level biomarker variability may also provide critical information about disease risks and health outcomes. Current literature does not provide statistical models to investigate such relationships with valid uncertainty quantification. In this paper we develop a fully Bayesian joint model that estimates subject-level means, variances, and covariances of multiple longitudinal biomarkers and uses these as predictors to evaluate their respective associations with a cross-sectional health outcome. Simulations demonstrate excellent recovery of true model parameters. The proposed method provides less biased and more efficient estimates, relative to alternative approaches that either ignore subject-level differences in variances or perform two-stage estimation where estimated marker variances are treated as observed. Empowered by the model, analyses of women's health data reveal, for the first time, that larger variability of E2 was associated with slower increases in waist circumference across the menopausal transition.

在现代流行病学研究中,定期收集纵向生物标志物数据和横断面结果,通常目的是为量身定制的早期干预决策提供信息。例如,激素,如雌二醇(E2)和卵泡刺激素(FSH),可以预测中年妇女健康状况的变化。大多数现有的方法侧重于从平均标记轨迹构建预测器。然而,受试者水平的生物标志物可变性也可能提供有关疾病风险和健康结果的关键信息。目前的文献没有提供统计模型来研究这种关系与有效的不确定性量化。在本文中,我们开发了一个全贝叶斯联合模型,该模型估计了多个纵向生物标志物的受试者水平均值、方差和协方差,并使用这些作为预测因子来评估它们各自与横断面健康结果的关联。仿真结果表明,该方法能很好地恢复真实模型参数。与忽略受试者水平方差差异或执行两阶段估计(其中估计的标记方差被视为观察到的)的替代方法相比,所提出的方法提供了更少的偏差和更有效的估计。在该模型的支持下,对女性健康数据的分析首次表明,E2的较大变异性与绝经期腰围增长较慢有关。
{"title":"VARIANCE AS A PREDICTOR OF HEALTH OUTCOMES: SUBJECT-LEVEL TRAJECTORIES AND VARIABILITY OF SEX HORMONES TO PREDICT BODY FAT CHANGES IN PERI- AND POSTMENOPAUSAL WOMEN.","authors":"Irena Chen, Zhenke Wu, Siobán D Harlow, Carrie A Karvonen-Gutierrez, Michelle M Hood, Michael R Elliott","doi":"10.1214/23-aoas1852","DOIUrl":"10.1214/23-aoas1852","url":null,"abstract":"<p><p>Longitudinal biomarker data and cross-sectional outcomes are routinely collected in modern epidemiology studies, often with the goal of informing tailored early intervention decisions. For example, hormones, such as estradiol (E2) and follicle-stimulating hormone (FSH), may predict changes in womens' health during the midlife. Most existing methods focus on constructing predictors from mean marker trajectories. However, subject-level biomarker variability may also provide critical information about disease risks and health outcomes. Current literature does not provide statistical models to investigate such relationships with valid uncertainty quantification. In this paper we develop a fully Bayesian joint model that estimates subject-level means, variances, and covariances of multiple longitudinal biomarkers and uses these as predictors to evaluate their respective associations with a cross-sectional health outcome. Simulations demonstrate excellent recovery of true model parameters. The proposed method provides less biased and more efficient estimates, relative to alternative approaches that either ignore subject-level differences in variances or perform two-stage estimation where estimated marker variances are treated as observed. Empowered by the model, analyses of women's health data reveal, for the first time, that larger variability of E2 was associated with slower increases in waist circumference across the menopausal transition.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"18 2","pages":"1642-1667"},"PeriodicalIF":1.4,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12309625/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144755011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A BAYESIAN HIERARCHICAL SMALL AREA POPULATION MODEL ACCOUNTING FOR DATA SOURCE SPECIFIC METHODOLOGIES FROM AMERICAN COMMUNITY SURVEY, POPULATION ESTIMATES PROGRAM, AND DECENNIAL CENSUS DATA. 根据美国社区调查、人口估计计划和十年一次的人口普查数据,建立一个考虑到数据源特定方法的贝叶斯分层小地区人口模型。
IF 1.3 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2024-06-01 Epub Date: 2024-04-05 DOI: 10.1214/23-aoas1849
Emily N Peterson, Rachel C Nethery, Tullia Padellini, Jarvis T Chen, Brent A Coull, Frédéric B Piel, Jon Wakefield, Marta Blangiardo, Lance A Waller

Small area population counts are necessary for many epidemiological studies, yet their quality and accuracy are often not assessed. In the United States, small area population counts are published by the United States Census Bureau (USCB) in the form of the decennial census counts, intercensal population projections (PEP), and American Community Survey (ACS) estimates. Although there are significant relationships between these three data sources, there are important contrasts in data collection, data availability, and processing methodologies such that each set of reported population counts may be subject to different sources and magnitudes of error. Additionally, these data sources do not report identical small area population counts due to post-survey adjustments specific to each data source. Consequently, in public health studies, small area disease/mortality rates may differ depending on which data source is used for denominator data. To accurately estimate annual small area population counts and their associated uncertainties, we present a Bayesian population (BPop) model, which fuses information from all three USCB sources, accounting for data source specific methodologies and associated errors. We produce comprehensive small area race-stratified estimates of the true population, and associated uncertainties, given the observed trends in all three USCB population estimates. The main features of our framework are: (1) a single model integrating multiple data sources, (2) accounting for data source specific data generating mechanisms and specifically accounting for data source specific errors, and (3) prediction of population counts for years without USCB reported data. We focus our study on the Black and White only populations for 159 counties of Georgia and produce estimates for years 2006-2023. We compare BPop population estimates to decennial census counts, PEP annual counts, and ACS multi-year estimates. Additionally, we illustrate and explain the different types of data source specific errors. Lastly, we compare model performance using simulations and validation exercises. Our Bayesian population model can be extended to other applications at smaller spatial granularity and for demographic subpopulations defined further by race, age, and sex, and/or for other geographical regions.

小地区人口统计是许多流行病学研究的必要条件,但其质量和准确性往往得不到评估。在美国,小地区人口统计由美国人口普查局(USCB)以十年一次的人口普查计数、普查间人口预测(PEP)和美国社区调查(ACS)估计值的形式发布。虽然这三个数据源之间存在重要关系,但在数据收集、数据可用性和处理方法方面存在重要差异,因此每套报告的人口数量可能会受到不同来源和不同程度误差的影响。此外,由于每个数据源都会进行特定的调查后调整,因此这些数据源报告的小地区人口数并不完全相同。因此,在公共卫生研究中,小地区疾病/死亡率可能会因分母数据使用的数据源不同而不同。为了准确估算年度小地区人口数量及其相关的不确定性,我们提出了一个贝叶斯人口(BPop)模型,该模型融合了 USCB 所有三个来源的信息,并考虑了数据源特定的方法和相关误差。考虑到所有三个 USCB 人口估计中观察到的趋势,我们对真实人口及其相关不确定性进行了全面的小区域种族分层估计。我们的框架的主要特点是(1) 整合多个数据源的单一模型,(2) 考虑到数据源特定的数据生成机制,特别是考虑到数据源特定的误差,以及 (3) 对没有 USCB 报告数据的年份的人口数量进行预测。我们的研究重点是佐治亚州 159 个县的黑人和白人人口,并得出 2006-2023 年的估计值。我们将 BPop 人口估计值与十年一次的人口普查计数、PEP 年度计数和 ACS 多年估计值进行了比较。此外,我们还说明并解释了不同类型的数据源特定误差。最后,我们通过模拟和验证练习来比较模型的性能。我们的贝叶斯人口模型可扩展到其他应用领域,如更小的空间粒度、按种族、年龄和性别进一步定义的人口亚群,以及/或其他地理区域。
{"title":"A BAYESIAN HIERARCHICAL SMALL AREA POPULATION MODEL ACCOUNTING FOR DATA SOURCE SPECIFIC METHODOLOGIES FROM AMERICAN COMMUNITY SURVEY, POPULATION ESTIMATES PROGRAM, AND DECENNIAL CENSUS DATA.","authors":"Emily N Peterson, Rachel C Nethery, Tullia Padellini, Jarvis T Chen, Brent A Coull, Frédéric B Piel, Jon Wakefield, Marta Blangiardo, Lance A Waller","doi":"10.1214/23-aoas1849","DOIUrl":"https://doi.org/10.1214/23-aoas1849","url":null,"abstract":"<p><p>Small area population counts are necessary for many epidemiological studies, yet their quality and accuracy are often not assessed. In the United States, small area population counts are published by the United States Census Bureau (USCB) in the form of the decennial census counts, intercensal population projections (PEP), and American Community Survey (ACS) estimates. Although there are significant relationships between these three data sources, there are important contrasts in data collection, data availability, and processing methodologies such that each set of reported population counts may be subject to different sources and magnitudes of error. Additionally, these data sources do not report identical small area population counts due to post-survey adjustments specific to each data source. Consequently, in public health studies, small area disease/mortality rates may differ depending on which data source is used for denominator data. To accurately estimate annual small area population counts <i>and their</i> associated uncertainties, we present a Bayesian population (BPop) model, which fuses information from all three USCB sources, accounting for data source specific methodologies and associated errors. We produce comprehensive small area race-stratified estimates of the true population, and associated uncertainties, given the observed trends in all three USCB population estimates. The main features of our framework are: (1) a single model integrating multiple data sources, (2) accounting for data source specific data generating mechanisms and specifically accounting for data source specific errors, and (3) prediction of population counts for years without USCB reported data. We focus our study on the Black and White only populations for 159 counties of Georgia and produce estimates for years 2006-2023. We compare BPop population estimates to decennial census counts, PEP annual counts, and ACS multi-year estimates. Additionally, we illustrate and explain the different types of data source specific errors. Lastly, we compare model performance using simulations and validation exercises. Our Bayesian population model can be extended to other applications at smaller spatial granularity and for demographic subpopulations defined further by race, age, and sex, and/or for other geographical regions.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"18 2","pages":"1565-1595"},"PeriodicalIF":1.3,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11423836/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142331820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Annals of Applied Statistics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1