Pub Date : 2024-06-01Epub Date: 2024-04-05DOI: 10.1214/23-aoas1838
Sunyi Chi, Christopher R Flowers, Ziyi Li, Xuelin Huang, Peng Wei
Environmental exposures such as cigarette smoking influence health outcomes through intermediate molecular phenotypes, such as the methylome, transcriptome, and metabolome. Mediation analysis is a useful tool for investigating the role of potentially high-dimensional intermediate phenotypes in the relationship between environmental exposures and health outcomes. However, little work has been done on mediation analysis when the mediators are high-dimensional and the outcome is a survival endpoint, and none of it has provided a robust measure of total mediation effect. To this end, we propose an estimation procedure for Mediation Analysis of Survival outcome and High-dimensional omics mediators (MASH) based on sure independence screening for putative mediator variable selection and a second-moment-based measure of total mediation effect for survival data analogous to the measure in a linear model. Extensive simulations showed good performance of MASH in estimating the total mediation effect and identifying true mediators. By applying MASH to the metabolomics data of 1919 subjects in the Framingham Heart Study, we identified five metabolites as mediators of the effect of cigarette smoking on coronary heart disease risk (total mediation effect, 51.1%) and two metabolites as mediators between smoking and risk of cancer (total mediation effect, 50.7%). Application of MASH to a diffuse large B-cell lymphoma genomics data set identified copy-number variations for eight genes as mediators between the baseline International Prognostic Index score and overall survival.
吸烟等环境暴露通过中间分子表型(如甲基组、转录组和代谢组)影响健康结果。中介分析是研究潜在高维中间表型在环境暴露与健康结果之间关系中的作用的有用工具。然而,当中介因素是高维的,而结果是生存终点时,中介分析方面的工作很少,而且没有一项工作提供了总中介效应的稳健测量方法。为此,我们提出了一种生存结果与高维 omics 中介因子中介分析(MASH)的估算程序,该程序基于对推定中介变量选择的确定独立性筛选,以及对生存数据的总中介效应的基于第二时刻的测量,类似于线性模型中的 R 2 测量。大量模拟结果表明,MASH 在估计总中介效应和识别真正的中介因子方面表现出色。通过将 MASH 应用于弗雷明汉心脏研究中 1919 名受试者的代谢组学数据,我们确定了五种代谢物是吸烟对冠心病风险影响的中介物(总中介效应为 51.1%),两种代谢物是吸烟与癌症风险之间的中介物(总中介效应为 50.7%)。将 MASH 应用于弥漫大 B 细胞淋巴瘤基因组学数据集,发现 8 个基因的拷贝数变异是基线国际预后指数评分与总生存期之间的中介因子。
{"title":"MASH: MEDIATION ANALYSIS OF SURVIVAL OUTCOME AND HIGH-DIMENSIONAL OMICS MEDIATORS WITH APPLICATION TO COMPLEX DISEASES.","authors":"Sunyi Chi, Christopher R Flowers, Ziyi Li, Xuelin Huang, Peng Wei","doi":"10.1214/23-aoas1838","DOIUrl":"https://doi.org/10.1214/23-aoas1838","url":null,"abstract":"<p><p>Environmental exposures such as cigarette smoking influence health outcomes through intermediate molecular phenotypes, such as the methylome, transcriptome, and metabolome. Mediation analysis is a useful tool for investigating the role of potentially high-dimensional intermediate phenotypes in the relationship between environmental exposures and health outcomes. However, little work has been done on mediation analysis when the mediators are high-dimensional and the outcome is a survival endpoint, and none of it has provided a robust measure of total mediation effect. To this end, we propose an estimation procedure for Mediation Analysis of Survival outcome and High-dimensional omics mediators (MASH) based on sure independence screening for putative mediator variable selection and a second-moment-based measure of total mediation effect for survival data analogous to the <math> <mrow><msup><mi>R</mi> <mn>2</mn></msup> </mrow> </math> measure in a linear model. Extensive simulations showed good performance of MASH in estimating the total mediation effect and identifying true mediators. By applying MASH to the metabolomics data of 1919 subjects in the Framingham Heart Study, we identified five metabolites as mediators of the effect of cigarette smoking on coronary heart disease risk (total mediation effect, 51.1%) and two metabolites as mediators between smoking and risk of cancer (total mediation effect, 50.7%). Application of MASH to a diffuse large B-cell lymphoma genomics data set identified copy-number variations for eight genes as mediators between the baseline International Prognostic Index score and overall survival.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"18 2","pages":"1360-1377"},"PeriodicalIF":1.3,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11426188/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142331821","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-01Epub Date: 2024-04-05DOI: 10.1214/23-aoas1849
Emily N Peterson, Rachel C Nethery, Tullia Padellini, Jarvis T Chen, Brent A Coull, Frédéric B Piel, Jon Wakefield, Marta Blangiardo, Lance A Waller
Small area population counts are necessary for many epidemiological studies, yet their quality and accuracy are often not assessed. In the United States, small area population counts are published by the United States Census Bureau (USCB) in the form of the decennial census counts, intercensal population projections (PEP), and American Community Survey (ACS) estimates. Although there are significant relationships between these three data sources, there are important contrasts in data collection, data availability, and processing methodologies such that each set of reported population counts may be subject to different sources and magnitudes of error. Additionally, these data sources do not report identical small area population counts due to post-survey adjustments specific to each data source. Consequently, in public health studies, small area disease/mortality rates may differ depending on which data source is used for denominator data. To accurately estimate annual small area population counts and their associated uncertainties, we present a Bayesian population (BPop) model, which fuses information from all three USCB sources, accounting for data source specific methodologies and associated errors. We produce comprehensive small area race-stratified estimates of the true population, and associated uncertainties, given the observed trends in all three USCB population estimates. The main features of our framework are: (1) a single model integrating multiple data sources, (2) accounting for data source specific data generating mechanisms and specifically accounting for data source specific errors, and (3) prediction of population counts for years without USCB reported data. We focus our study on the Black and White only populations for 159 counties of Georgia and produce estimates for years 2006-2023. We compare BPop population estimates to decennial census counts, PEP annual counts, and ACS multi-year estimates. Additionally, we illustrate and explain the different types of data source specific errors. Lastly, we compare model performance using simulations and validation exercises. Our Bayesian population model can be extended to other applications at smaller spatial granularity and for demographic subpopulations defined further by race, age, and sex, and/or for other geographical regions.
{"title":"A BAYESIAN HIERARCHICAL SMALL AREA POPULATION MODEL ACCOUNTING FOR DATA SOURCE SPECIFIC METHODOLOGIES FROM AMERICAN COMMUNITY SURVEY, POPULATION ESTIMATES PROGRAM, AND DECENNIAL CENSUS DATA.","authors":"Emily N Peterson, Rachel C Nethery, Tullia Padellini, Jarvis T Chen, Brent A Coull, Frédéric B Piel, Jon Wakefield, Marta Blangiardo, Lance A Waller","doi":"10.1214/23-aoas1849","DOIUrl":"https://doi.org/10.1214/23-aoas1849","url":null,"abstract":"<p><p>Small area population counts are necessary for many epidemiological studies, yet their quality and accuracy are often not assessed. In the United States, small area population counts are published by the United States Census Bureau (USCB) in the form of the decennial census counts, intercensal population projections (PEP), and American Community Survey (ACS) estimates. Although there are significant relationships between these three data sources, there are important contrasts in data collection, data availability, and processing methodologies such that each set of reported population counts may be subject to different sources and magnitudes of error. Additionally, these data sources do not report identical small area population counts due to post-survey adjustments specific to each data source. Consequently, in public health studies, small area disease/mortality rates may differ depending on which data source is used for denominator data. To accurately estimate annual small area population counts <i>and their</i> associated uncertainties, we present a Bayesian population (BPop) model, which fuses information from all three USCB sources, accounting for data source specific methodologies and associated errors. We produce comprehensive small area race-stratified estimates of the true population, and associated uncertainties, given the observed trends in all three USCB population estimates. The main features of our framework are: (1) a single model integrating multiple data sources, (2) accounting for data source specific data generating mechanisms and specifically accounting for data source specific errors, and (3) prediction of population counts for years without USCB reported data. We focus our study on the Black and White only populations for 159 counties of Georgia and produce estimates for years 2006-2023. We compare BPop population estimates to decennial census counts, PEP annual counts, and ACS multi-year estimates. Additionally, we illustrate and explain the different types of data source specific errors. Lastly, we compare model performance using simulations and validation exercises. Our Bayesian population model can be extended to other applications at smaller spatial granularity and for demographic subpopulations defined further by race, age, and sex, and/or for other geographical regions.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"18 2","pages":"1565-1595"},"PeriodicalIF":1.3,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11423836/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142331820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-05eCollection Date: 2024-06-01DOI: 10.1214/23-AOAS1856
Ashish Patel, Francis J DiTraglia, Verena Zuber, Stephen Burgess
Mendelian randomization (MR) is a widely-used method to estimate the causal relationship between a risk factor and disease. A fundamental part of any MR analysis is to choose appropriate genetic variants as instrumental variables. Genome-wide association studies often reveal that hundreds of genetic variants may be robustly associated with a risk factor, but in some situations investigators may have greater confidence in the instrument validity of only a smaller subset of variants. Nevertheless, the use of additional instruments may be optimal from the perspective of mean squared error even if they are slightly invalid; a small bias in estimation may be a price worth paying for a larger reduction in variance. For this purpose, we consider a method for "focused" instrument selection whereby genetic variants are selected to minimise the estimated asymptotic mean squared error of causal effect estimates. In a setting of many weak and locally invalid instruments, we propose a novel strategy to construct confidence intervals for post-selection focused estimators that guards against the worst case loss in asymptotic coverage. In empirical applications to: (i) validate lipid drug targets; and (ii) investigate vitamin D effects on a wide range of outcomes, our findings suggest that the optimal selection of instruments does not involve only a small number of biologically-justified instruments, but also many potentially invalid instruments.
孟德尔随机法(Mendelian randomization,MR)是一种广泛使用的估算风险因素与疾病之间因果关系的方法。任何 MR 分析的一个基本部分都是选择适当的遗传变异作为工具变量。全基因组关联研究通常会发现,数以百计的遗传变异可能与某一风险因素密切相关,但在某些情况下,研究者可能只对较小变异子集的工具有效性更有信心。尽管如此,从均方误差的角度来看,使用额外的工具可能是最佳的,即使这些工具稍有无效;估计中的微小偏差可能是值得付出的代价,以换取方差的较大减少。为此,我们考虑了一种 "有针对性 "的工具选择方法,即通过选择遗传变异来最小化因果效应估计的渐近均方误差。在存在许多弱工具和局部无效工具的情况下,我们提出了一种新的策略来构建选择后集中估计器的置信区间,以防止渐近覆盖率的最坏情况损失。在(i) 验证脂质药物靶点;(ii) 调查维生素 D 对各种结果的影响的经验应用中,我们的研究结果表明,最佳工具选择不仅涉及少量生物学上合理的工具,还涉及许多潜在的无效工具。
{"title":"Selecting Invalid Instruments to Improve Mendelian Randomization with Two-Sample Summary Data.","authors":"Ashish Patel, Francis J DiTraglia, Verena Zuber, Stephen Burgess","doi":"10.1214/23-AOAS1856","DOIUrl":"10.1214/23-AOAS1856","url":null,"abstract":"<p><p>Mendelian randomization (MR) is a widely-used method to estimate the causal relationship between a risk factor and disease. A fundamental part of any MR analysis is to choose appropriate genetic variants as instrumental variables. Genome-wide association studies often reveal that hundreds of genetic variants may be robustly associated with a risk factor, but in some situations investigators may have greater confidence in the instrument validity of only a smaller subset of variants. Nevertheless, the use of additional instruments may be optimal from the perspective of mean squared error even if they are slightly invalid; a small bias in estimation may be a price worth paying for a larger reduction in variance. For this purpose, we consider a method for \"focused\" instrument selection whereby genetic variants are selected to minimise the estimated asymptotic mean squared error of causal effect estimates. In a setting of many weak and locally invalid instruments, we propose a novel strategy to construct confidence intervals for post-selection focused estimators that guards against the worst case loss in asymptotic coverage. In empirical applications to: (i) validate lipid drug targets; and (ii) investigate vitamin D effects on a wide range of outcomes, our findings suggest that the optimal selection of instruments does not involve only a small number of biologically-justified instruments, but also many potentially invalid instruments.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"18 2","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7615940/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140913351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-01Epub Date: 2024-01-31DOI: 10.1214/23-aoas1792
Xinyuan Chen, Michael O Harhay, Guangyu Tong, Fan Li
Assessing heterogeneity in the effects of treatments has become increasingly popular in the field of causal inference and carries important implications for clinical decision-making. While extensive literature exists for studying treatment effect heterogeneity when outcomes are fully observed, there has been limited development in tools for estimating heterogeneous causal effects when patient-centered outcomes are truncated by a terminal event, such as death. Due to mortality occurring during study follow-up, the outcomes of interest are unobservable, undefined, or not fully observed for many participants in which case principal stratification is an appealing framework to draw valid causal conclusions. Motivated by the Acute Respiratory Distress Syndrome Network (ARDSNetwork) ARDS respiratory management (ARMA) trial, we developed a flexible Bayesian machine learning approach to estimate the average causal effect and heterogeneous causal effects among the always-survivors stratum when clinical outcomes are subject to truncation. We adopted Bayesian additive regression trees (BART) to flexibly specify separate mean models for the potential outcomes and latent stratum membership. In the analysis of the ARMA trial, we found that the low tidal volume treatment had an overall benefit for participants sustaining acute lung injuries on the outcome of time to returning home but substantial heterogeneity in treatment effects among the always-survivors, driven most strongly by biologic sex and the alveolar-arterial oxygen gradient at baseline (a physiologic measure of lung function and degree of hypoxemia). These findings illustrate how the proposed methodology could guide the prognostic enrichment of future trials in the field.
评估治疗效果的异质性在因果推理领域越来越流行,对临床决策具有重要意义。虽然已有大量文献用于研究完全观察结果时治疗效果的异质性,但当以患者为中心的结果被死亡等终末事件截断时,用于估计异质性因果效应的工具的发展还很有限。由于死亡发生在研究随访期间,许多参与者的相关结果是不可观察、未定义或未完全观察到的,在这种情况下,主分层是得出有效因果结论的一个有吸引力的框架。受急性呼吸窘迫综合征网络(ARDSNetwork)ARDS 呼吸管理(ARMA)试验的启发,我们开发了一种灵活的贝叶斯机器学习方法,用于估算当临床结果受到截断影响时,始终存活者层中的平均因果效应和异质性因果效应。我们采用贝叶斯加性回归树(BART),灵活地为潜在结果和潜在分层成员指定单独的均值模型。在对 ARMA 试验的分析中,我们发现低潮气量治疗对急性肺损伤参与者的总体获益在于重返家园的时间,但在始终存活者中,治疗效果存在很大的异质性,这主要受生物性别和基线肺泡-动脉血氧梯度(肺功能和低氧血症程度的生理学测量指标)的影响。这些发现说明了所提出的方法可如何指导该领域未来试验的预后丰富化。
{"title":"A BAYESIAN MACHINE LEARNING APPROACH FOR ESTIMATING HETEROGENEOUS SURVIVOR CAUSAL EFFECTS: APPLICATIONS TO A CRITICAL CARE TRIAL.","authors":"Xinyuan Chen, Michael O Harhay, Guangyu Tong, Fan Li","doi":"10.1214/23-aoas1792","DOIUrl":"10.1214/23-aoas1792","url":null,"abstract":"<p><p>Assessing heterogeneity in the effects of treatments has become increasingly popular in the field of causal inference and carries important implications for clinical decision-making. While extensive literature exists for studying treatment effect heterogeneity when outcomes are fully observed, there has been limited development in tools for estimating heterogeneous causal effects when patient-centered outcomes are truncated by a terminal event, such as death. Due to mortality occurring during study follow-up, the outcomes of interest are unobservable, undefined, or not fully observed for many participants in which case principal stratification is an appealing framework to draw valid causal conclusions. Motivated by the Acute Respiratory Distress Syndrome Network (ARDSNetwork) ARDS respiratory management (ARMA) trial, we developed a flexible Bayesian machine learning approach to estimate the average causal effect and heterogeneous causal effects among the always-survivors stratum when clinical outcomes are subject to truncation. We adopted Bayesian additive regression trees (BART) to flexibly specify separate mean models for the potential outcomes and latent stratum membership. In the analysis of the ARMA trial, we found that the low tidal volume treatment had an overall benefit for participants sustaining acute lung injuries on the outcome of time to returning home but substantial heterogeneity in treatment effects among the always-survivors, driven most strongly by biologic sex and the alveolar-arterial oxygen gradient at baseline (a physiologic measure of lung function and degree of hypoxemia). These findings illustrate how the proposed methodology could guide the prognostic enrichment of future trials in the field.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"18 1","pages":"350-374"},"PeriodicalIF":1.8,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10919396/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140061169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-01Epub Date: 2024-01-31DOI: 10.1214/23-aoas1817
Alan J Aw, Jeffrey P Spence, Yun S Song
In scientific studies involving analyses of multivariate data, basic but important questions often arise for the researcher: Is the sample exchangeable, meaning that the joint distribution of the sample is invariant to the ordering of the units? Are the features independent of one another, or perhaps the features can be grouped so that the groups are mutually independent? In statistical genomics, these considerations are fundamental to downstream tasks such as demographic inference and the construction of polygenic risk scores. We propose a non-parametric approach, which we call the V test, to address these two questions, namely, a test of sample exchangeability given dependency structure of features, and a test of feature independence given sample exchangeability. Our test is conceptually simple, yet fast and flexible. It controls the Type I error across realistic scenarios, and handles data of arbitrary dimensions by leveraging large-sample asymptotics. Through extensive simulations and a comparison against unsupervised tests of stratification based on random matrix theory, we find that our test compares favorably in various scenarios of interest. We apply the test to data from the 1000 Genomes Project, demonstrating how it can be employed to assess exchangeability of the genetic sample, or find optimal linkage disequilibrium (LD) splits for downstream analysis. For exchangeability assessment, we find that removing rare variants can substantially increase the -value of the test statistic. For optimal LD splitting, the V test reports different optimal splits than previous approaches not relying on hypothesis testing. Software for our methods is available in R (CRAN: flintyR) and Python (PyPI: flintyPy).
在涉及多变量数据分析的科学研究中,研究人员经常会遇到一些基本但重要的问题: 样本是否可交换,即样本的联合分布与单位排序无关?特征是否相互独立,或者特征是否可以分组,从而使各组相互独立?在统计基因组学中,这些考虑因素对于人口推断和构建多基因风险评分等下游任务至关重要。我们提出了一种非参数方法(我们称之为 V 检验)来解决这两个问题,即给定特征依赖结构的样本可交换性检验和给定样本可交换性的特征独立性检验。我们的检验方法概念简单、快速灵活。它能在现实场景中控制 I 类误差,并利用大样本渐近学处理任意维度的数据。通过大量的模拟以及与基于随机矩阵理论的无监督分层检验的比较,我们发现我们的检验在各种感兴趣的情况下都表现出色。我们将该检验应用于 1000 基因组计划的数据,展示了如何利用它来评估基因样本的可交换性,或为下游分析找到最佳的连锁不平衡(LD)分割。在可交换性评估中,我们发现去除罕见变异可大幅提高检验统计量的 p 值。对于最优 LD 分割,V 检验报告的最优分割与之前不依赖假设检验的方法不同。我们的方法可在 R(CRAN:flintyR)和 Python(PyPI:flintyPy)中使用。
{"title":"A SIMPLE AND FLEXIBLE TEST OF SAMPLE EXCHANGEABILITY WITH APPLICATIONS TO STATISTICAL GENOMICS.","authors":"Alan J Aw, Jeffrey P Spence, Yun S Song","doi":"10.1214/23-aoas1817","DOIUrl":"10.1214/23-aoas1817","url":null,"abstract":"<p><p>In scientific studies involving analyses of multivariate data, basic but important questions often arise for the researcher: Is the sample exchangeable, meaning that the joint distribution of the sample is invariant to the ordering of the units? Are the features independent of one another, or perhaps the features can be grouped so that the groups are mutually independent? In statistical genomics, these considerations are fundamental to downstream tasks such as demographic inference and the construction of polygenic risk scores. We propose a non-parametric approach, which we call the V test, to address these two questions, namely, a test of sample exchangeability given dependency structure of features, and a test of feature independence given sample exchangeability. Our test is conceptually simple, yet fast and flexible. It controls the Type I error across realistic scenarios, and handles data of arbitrary dimensions by leveraging large-sample asymptotics. Through extensive simulations and a comparison against unsupervised tests of stratification based on random matrix theory, we find that our test compares favorably in various scenarios of interest. We apply the test to data from the 1000 Genomes Project, demonstrating how it can be employed to assess exchangeability of the genetic sample, or find optimal linkage disequilibrium (LD) splits for downstream analysis. For exchangeability assessment, we find that removing rare variants can substantially increase the <math><mi>p</mi></math>-value of the test statistic. For optimal LD splitting, the V test reports different optimal splits than previous approaches not relying on hypothesis testing. Software for our methods is available in R (CRAN: flintyR) and Python (PyPI: flintyPy).</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"18 1","pages":"858-881"},"PeriodicalIF":1.8,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11115382/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141089297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-01Epub Date: 2024-01-31DOI: 10.1214/23-aoas1782
Yiwen Zhang, Ran Dai, Ying Huang, Ross Prentice, Cheng Zheng
Systematic measurement error in self-reported data creates important challenges in association studies between dietary intakes and chronic disease risks, especially when multiple dietary components are studied jointly. The joint regression calibration method has been developed for measurement error correction when objectively measured biomarkers are available for all dietary components of interest. Unfortunately, objectively measured biomarkers are only available for very few dietary components, which limits the application of the joint regression calibration method. Recently, for single dietary components, controlled feeding studies have been performed to develop new biomarkers for many more dietary components. However, it is unclear whether the biomarkers separately developed for single dietary components are valid for joint calibration. In this paper, we show that biomarkers developed for single dietary components cannot be used for joint regression calibration. We propose new methods to utilize controlled feeding studies to develop valid biomarkers for joint regression calibration to estimate the association between multiple dietary components simultaneously with the disease of interest. Asymptotic distribution theory for the proposed estimators is derived. Extensive simulations are performed to study the finite sample performance of the proposed estimators. We apply our methods to examine the joint effects of sodium and potassium intakes on cardiovascular disease incidence using the Women's Health Initiative cohort data. We identify positive associations between sodium intake and cardiovascular diseases as well as negative associations between potassium intake and cardiovascular disease.
{"title":"USING SIMULTANEOUS REGRESSION CALIBRATION TO STUDY THE EFFECT OF MULTIPLE ERROR-PRONE EXPOSURES ON DISEASE RISK UTILIZING BIOMARKERS DEVELOPED FROM A CONTROLLED FEEDING STUDY.","authors":"Yiwen Zhang, Ran Dai, Ying Huang, Ross Prentice, Cheng Zheng","doi":"10.1214/23-aoas1782","DOIUrl":"10.1214/23-aoas1782","url":null,"abstract":"<p><p>Systematic measurement error in self-reported data creates important challenges in association studies between dietary intakes and chronic disease risks, especially when multiple dietary components are studied jointly. The joint regression calibration method has been developed for measurement error correction when objectively measured biomarkers are available for all dietary components of interest. Unfortunately, objectively measured biomarkers are only available for very few dietary components, which limits the application of the joint regression calibration method. Recently, for single dietary components, controlled feeding studies have been performed to develop new biomarkers for many more dietary components. However, it is unclear whether the biomarkers separately developed for single dietary components are valid for joint calibration. In this paper, we show that biomarkers developed for single dietary components cannot be used for joint regression calibration. We propose new methods to utilize controlled feeding studies to develop valid biomarkers for joint regression calibration to estimate the association between multiple dietary components simultaneously with the disease of interest. Asymptotic distribution theory for the proposed estimators is derived. Extensive simulations are performed to study the finite sample performance of the proposed estimators. We apply our methods to examine the joint effects of sodium and potassium intakes on cardiovascular disease incidence using the Women's Health Initiative cohort data. We identify positive associations between sodium intake and cardiovascular diseases as well as negative associations between potassium intake and cardiovascular disease.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"18 1","pages":"125-143"},"PeriodicalIF":1.3,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10836829/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139681864","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-01Epub Date: 2024-01-31DOI: 10.1214/23-aoas1797
Zikai Lin, Yajuan Si, Jian Kang
Image-on-scalar regression has been a popular approach to modeling the association between brain activities and scalar characteristics in neuroimaging research. The associations could be heterogeneous across individuals in the population, as indicated by recent large-scale neuroimaging studies, for example, the Adolescent Brain Cognitive Development (ABCD) Study. The ABCD data can inform our understanding of heterogeneous associations and how to leverage the heterogeneity and tailor interventions to increase the number of youths who benefit. It is of great interest to identify subgroups of individuals from the population such that: (1) within each subgroup the brain activities have homogeneous associations with the clinical measures; (2) across subgroups the associations are heterogeneous, and (3) the group allocation depends on individual characteristics. Existing image-on-scalar regression methods and clustering methods cannot directly achieve this goal. We propose a latent subgroup image-on-scalar regression model (LASIR) to analyze large-scale, multisite neuroimaging data with diverse sociode-mographics. LASIR introduces the latent subgroup for each individual and group-specific, spatially varying effects, with an efficient stochastic expectation maximization algorithm for inferences. We demonstrate that LASIR outperforms existing alternatives for subgroup identification of brain activation patterns with functional magnetic resonance imaging data via comprehensive simulations and applications to the ABCD study. We have released our reproducible codes for public use with the software package available on Github.
{"title":"LATENT SUBGROUP IDENTIFICATION IN IMAGE-ON-SCALAR REGRESSION.","authors":"Zikai Lin, Yajuan Si, Jian Kang","doi":"10.1214/23-aoas1797","DOIUrl":"10.1214/23-aoas1797","url":null,"abstract":"<p><p>Image-on-scalar regression has been a popular approach to modeling the association between brain activities and scalar characteristics in neuroimaging research. The associations could be heterogeneous across individuals in the population, as indicated by recent large-scale neuroimaging studies, for example, the Adolescent Brain Cognitive Development (ABCD) Study. The ABCD data can inform our understanding of heterogeneous associations and how to leverage the heterogeneity and tailor interventions to increase the number of youths who benefit. It is of great interest to identify subgroups of individuals from the population such that: (1) within each subgroup the brain activities have homogeneous associations with the clinical measures; (2) across subgroups the associations are heterogeneous, and (3) the group allocation depends on individual characteristics. Existing image-on-scalar regression methods and clustering methods cannot directly achieve this goal. We propose a latent subgroup image-on-scalar regression model (LASIR) to analyze large-scale, multisite neuroimaging data with diverse sociode-mographics. LASIR introduces the latent subgroup for each individual and group-specific, spatially varying effects, with an efficient stochastic expectation maximization algorithm for inferences. We demonstrate that LASIR outperforms existing alternatives for subgroup identification of brain activation patterns with functional magnetic resonance imaging data via comprehensive simulations and applications to the ABCD study. We have released our reproducible codes for public use with the software package available on Github.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"18 1","pages":"468-486"},"PeriodicalIF":1.8,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11156244/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141285216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-01Epub Date: 2024-01-31DOI: 10.1214/23-aoas1791
Zeda Li, Yu Ryan Yue, Scott A Bruce
We propose a novel analysis of power (ANOPOW) model for analyzing replicated nonstationary time series commonly encountered in experimental studies. Based on a locally stationary ANOPOW Cramér spectral representation, the proposed model can be used to compare the second-order time-varying frequency patterns among different groups of time series and to estimate group effects as functions of both time and frequency. Formulated in a Bayesian framework, independent two-dimensional second-order random walk (RW2D) priors are assumed on each of the time-varying functional effects for flexible and adaptive smoothing. A piecewise stationary approximation of the nonstationary time series is used to obtain localized estimates of time-varying spectra. Posterior distributions of the time-varying functional group effects are then obtained via integrated nested Laplace approximations (INLA) at a low computational cost. The large-sample distribution of local periodograms can be appropriately utilized to improve estimation accuracy since INLA allows modeling of data with various types of distributions. The usefulness of the proposed model is illustrated through two real data applications: analyses of seismic signals and pupil diameter time series in children with attention deficit hyperactivity disorder. Simulation studies, Supplementary Materials (Li, Yue and Bruce, 2023a), and R code (Li, Yue and Bruce, 2023b) for this article are also available.
我们提出了一种新颖的功率分析(ANOPOW)模型,用于分析实验研究中常见的重复非平稳时间序列。基于局部静止的 ANOPOW Cramér 频谱表示,所提出的模型可用于比较不同时间序列组间的二阶时变频率模式,并估算作为时间和频率函数的组效应。在贝叶斯框架下,假设每个时变函数效应都有独立的二维二阶随机游走(RW2D)先验,以实现灵活的自适应平滑。非平稳时间序列的片断平稳近似用于获得时变频谱的局部估计值。然后,通过集成嵌套拉普拉斯近似(INLA),以较低的计算成本获得时变功能组效应的后验分布。由于 INLA 可以对各种类型分布的数据建模,因此可以适当利用局部周期图的大样本分布来提高估计精度。本文通过两个实际数据应用说明了所提模型的实用性:地震信号分析和注意力缺陷多动障碍儿童的瞳孔直径时间序列分析。本文的仿真研究、补充材料(Li, Yue and Bruce, 2023a)和 R 代码(Li, Yue and Bruce, 2023b)也已发布。
{"title":"ANOPOW FOR REPLICATED NONSTATIONARY TIME SERIES IN EXPERIMENTS.","authors":"Zeda Li, Yu Ryan Yue, Scott A Bruce","doi":"10.1214/23-aoas1791","DOIUrl":"10.1214/23-aoas1791","url":null,"abstract":"<p><p>We propose a novel analysis of power (ANOPOW) model for analyzing replicated nonstationary time series commonly encountered in experimental studies. Based on a locally stationary ANOPOW Cramér spectral representation, the proposed model can be used to compare the second-order time-varying frequency patterns among different groups of time series and to estimate group effects as functions of both time and frequency. Formulated in a Bayesian framework, independent two-dimensional second-order random walk (RW2D) priors are assumed on each of the time-varying functional effects for flexible and adaptive smoothing. A piecewise stationary approximation of the nonstationary time series is used to obtain localized estimates of time-varying spectra. Posterior distributions of the time-varying functional group effects are then obtained via integrated nested Laplace approximations (INLA) at a low computational cost. The large-sample distribution of local periodograms can be appropriately utilized to improve estimation accuracy since INLA allows modeling of data with various types of distributions. The usefulness of the proposed model is illustrated through two real data applications: analyses of seismic signals and pupil diameter time series in children with attention deficit hyperactivity disorder. Simulation studies, Supplementary Materials (Li, Yue and Bruce, 2023a), and R code (Li, Yue and Bruce, 2023b) for this article are also available.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"18 1","pages":"328-349"},"PeriodicalIF":1.8,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10906746/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140023131","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-01Epub Date: 2024-01-31DOI: 10.1214/23-aoas1813
J Brandon Carter, Christopher R Browning, Bethany Boettner, Nicolo Pinchak, Catherine A Calder
Collective efficacy-the capacity of communities to exert social control toward the realization of their shared goals-is a foundational concept in the urban sociology and neighborhood effects literature. Traditionally, empirical studies of collective efficacy use large sample surveys to estimate collective efficacy of different neighborhoods within an urban setting. Such studies have demonstrated an association between collective efficacy and local variation in community violence, educational achievement, and health. Unlike traditional collective efficacy measurement strategies, the Adolescent Health and Development in Context (AHDC) Study implemented a new approach, obtaining spatially-referenced, place-based ratings of collective efficacy from a representative sample of individuals residing in Columbus, OH. In this paper we introduce a novel nonstationary spatial model for interpolation of the AHDC collective efficacy ratings across the study area, which leverages administrative data on land use. Our constructive model specification strategy involves dimension expansion of a latent spatial process and the use of a filter defined by the land-use partition of the study region to connect the latent multivariate spatial process to the observed ordinal ratings of collective efficacy. Careful consideration is given to the issues of parameter identifiability, computational efficiency of an MCMC algorithm for model fitting, and fine-scale spatial prediction of collective efficacy.
{"title":"LAND-USE FILTERING FOR NONSTATIONARY SPATIAL PREDICTION OF COLLECTIVE EFFICACY IN AN URBAN ENVIRONMENT.","authors":"J Brandon Carter, Christopher R Browning, Bethany Boettner, Nicolo Pinchak, Catherine A Calder","doi":"10.1214/23-aoas1813","DOIUrl":"10.1214/23-aoas1813","url":null,"abstract":"<p><p>Collective efficacy-the capacity of communities to exert social control toward the realization of their shared goals-is a foundational concept in the urban sociology and neighborhood effects literature. Traditionally, empirical studies of collective efficacy use large sample surveys to estimate collective efficacy of different neighborhoods within an urban setting. Such studies have demonstrated an association between collective efficacy and local variation in community violence, educational achievement, and health. Unlike traditional collective efficacy measurement strategies, the Adolescent Health and Development in Context (AHDC) Study implemented a new approach, obtaining spatially-referenced, place-based ratings of collective efficacy from a representative sample of individuals residing in Columbus, OH. In this paper we introduce a novel nonstationary spatial model for interpolation of the AHDC collective efficacy ratings across the study area, which leverages administrative data on land use. Our constructive model specification strategy involves dimension expansion of a latent spatial process and the use of a filter defined by the land-use partition of the study region to connect the latent multivariate spatial process to the observed ordinal ratings of collective efficacy. Careful consideration is given to the issues of parameter identifiability, computational efficiency of an MCMC algorithm for model fitting, and fine-scale spatial prediction of collective efficacy.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"18 1","pages":"794-818"},"PeriodicalIF":1.8,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11146085/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141238803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-01Epub Date: 2024-01-31DOI: 10.1214/23-aoas1809
Nicholas Hartman, Joseph M Messana, Jian Kang, Abhijit S Naik, Tempie H Shearon, Kevin He
Risk-adjusted quality measures are used to evaluate healthcare providers with respect to national norms while controlling for factors beyond their control. Existing healthcare provider profiling approaches typically assume that the between-provider variation in these measures is entirely due to meaningful differences in quality of care. However, in practice, much of the between-provider variation will be due to trivial fluctuations in healthcare quality, or unobservable confounding risk factors. If these additional sources of variation are not accounted for, conventional methods will disproportionately identify larger providers as outliers, even though their departures from the national norms may not be "extreme" or clinically meaningful. Motivated by efforts to evaluate the quality of care provided by transplant centers, we develop a composite evaluation score based on a novel individualized empirical null method, which robustly accounts for overdispersion due to unobserved risk factors, models the marginal variance of standardized scores as a function of the effective sample size, and only requires the use of publicly-available center-level statistics. The evaluations of United States kidney transplant centers based on the proposed composite score are substantially different from those based on conventional methods. Simulations show that the proposed empirical null approach more accurately classifies centers in terms of quality of care, compared to existing methods.
{"title":"COMPOSITE SCORES FOR TRANSPLANT CENTER EVALUATION: A NEW INDIVIDUALIZED EMPIRICAL NULL METHOD.","authors":"Nicholas Hartman, Joseph M Messana, Jian Kang, Abhijit S Naik, Tempie H Shearon, Kevin He","doi":"10.1214/23-aoas1809","DOIUrl":"10.1214/23-aoas1809","url":null,"abstract":"<p><p>Risk-adjusted quality measures are used to evaluate healthcare providers with respect to national norms while controlling for factors beyond their control. Existing healthcare provider profiling approaches typically assume that the between-provider variation in these measures is entirely due to meaningful differences in quality of care. However, in practice, much of the between-provider variation will be due to trivial fluctuations in healthcare quality, or unobservable confounding risk factors. If these additional sources of variation are not accounted for, conventional methods will disproportionately identify larger providers as outliers, even though their departures from the national norms may not be \"extreme\" or clinically meaningful. Motivated by efforts to evaluate the quality of care provided by transplant centers, we develop a composite evaluation score based on a novel individualized empirical null method, which robustly accounts for overdispersion due to unobserved risk factors, models the marginal variance of standardized scores as a function of the effective sample size, and only requires the use of publicly-available center-level statistics. The evaluations of United States kidney transplant centers based on the proposed composite score are substantially different from those based on conventional methods. Simulations show that the proposed empirical null approach more accurately classifies centers in terms of quality of care, compared to existing methods.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"18 1","pages":"729-748"},"PeriodicalIF":1.3,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11395314/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142300086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}