Pub Date : 2024-03-01Epub Date: 2024-01-31DOI: 10.1214/23-aoas1791
Zeda Li, Yu Ryan Yue, Scott A Bruce
We propose a novel analysis of power (ANOPOW) model for analyzing replicated nonstationary time series commonly encountered in experimental studies. Based on a locally stationary ANOPOW Cramér spectral representation, the proposed model can be used to compare the second-order time-varying frequency patterns among different groups of time series and to estimate group effects as functions of both time and frequency. Formulated in a Bayesian framework, independent two-dimensional second-order random walk (RW2D) priors are assumed on each of the time-varying functional effects for flexible and adaptive smoothing. A piecewise stationary approximation of the nonstationary time series is used to obtain localized estimates of time-varying spectra. Posterior distributions of the time-varying functional group effects are then obtained via integrated nested Laplace approximations (INLA) at a low computational cost. The large-sample distribution of local periodograms can be appropriately utilized to improve estimation accuracy since INLA allows modeling of data with various types of distributions. The usefulness of the proposed model is illustrated through two real data applications: analyses of seismic signals and pupil diameter time series in children with attention deficit hyperactivity disorder. Simulation studies, Supplementary Materials (Li, Yue and Bruce, 2023a), and R code (Li, Yue and Bruce, 2023b) for this article are also available.
我们提出了一种新颖的功率分析(ANOPOW)模型,用于分析实验研究中常见的重复非平稳时间序列。基于局部静止的 ANOPOW Cramér 频谱表示,所提出的模型可用于比较不同时间序列组间的二阶时变频率模式,并估算作为时间和频率函数的组效应。在贝叶斯框架下,假设每个时变函数效应都有独立的二维二阶随机游走(RW2D)先验,以实现灵活的自适应平滑。非平稳时间序列的片断平稳近似用于获得时变频谱的局部估计值。然后,通过集成嵌套拉普拉斯近似(INLA),以较低的计算成本获得时变功能组效应的后验分布。由于 INLA 可以对各种类型分布的数据建模,因此可以适当利用局部周期图的大样本分布来提高估计精度。本文通过两个实际数据应用说明了所提模型的实用性:地震信号分析和注意力缺陷多动障碍儿童的瞳孔直径时间序列分析。本文的仿真研究、补充材料(Li, Yue and Bruce, 2023a)和 R 代码(Li, Yue and Bruce, 2023b)也已发布。
{"title":"ANOPOW FOR REPLICATED NONSTATIONARY TIME SERIES IN EXPERIMENTS.","authors":"Zeda Li, Yu Ryan Yue, Scott A Bruce","doi":"10.1214/23-aoas1791","DOIUrl":"10.1214/23-aoas1791","url":null,"abstract":"<p><p>We propose a novel analysis of power (ANOPOW) model for analyzing replicated nonstationary time series commonly encountered in experimental studies. Based on a locally stationary ANOPOW Cramér spectral representation, the proposed model can be used to compare the second-order time-varying frequency patterns among different groups of time series and to estimate group effects as functions of both time and frequency. Formulated in a Bayesian framework, independent two-dimensional second-order random walk (RW2D) priors are assumed on each of the time-varying functional effects for flexible and adaptive smoothing. A piecewise stationary approximation of the nonstationary time series is used to obtain localized estimates of time-varying spectra. Posterior distributions of the time-varying functional group effects are then obtained via integrated nested Laplace approximations (INLA) at a low computational cost. The large-sample distribution of local periodograms can be appropriately utilized to improve estimation accuracy since INLA allows modeling of data with various types of distributions. The usefulness of the proposed model is illustrated through two real data applications: analyses of seismic signals and pupil diameter time series in children with attention deficit hyperactivity disorder. Simulation studies, Supplementary Materials (Li, Yue and Bruce, 2023a), and R code (Li, Yue and Bruce, 2023b) for this article are also available.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"18 1","pages":"328-349"},"PeriodicalIF":1.8,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10906746/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140023131","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-01Epub Date: 2024-01-31DOI: 10.1214/23-aoas1813
J Brandon Carter, Christopher R Browning, Bethany Boettner, Nicolo Pinchak, Catherine A Calder
Collective efficacy-the capacity of communities to exert social control toward the realization of their shared goals-is a foundational concept in the urban sociology and neighborhood effects literature. Traditionally, empirical studies of collective efficacy use large sample surveys to estimate collective efficacy of different neighborhoods within an urban setting. Such studies have demonstrated an association between collective efficacy and local variation in community violence, educational achievement, and health. Unlike traditional collective efficacy measurement strategies, the Adolescent Health and Development in Context (AHDC) Study implemented a new approach, obtaining spatially-referenced, place-based ratings of collective efficacy from a representative sample of individuals residing in Columbus, OH. In this paper we introduce a novel nonstationary spatial model for interpolation of the AHDC collective efficacy ratings across the study area, which leverages administrative data on land use. Our constructive model specification strategy involves dimension expansion of a latent spatial process and the use of a filter defined by the land-use partition of the study region to connect the latent multivariate spatial process to the observed ordinal ratings of collective efficacy. Careful consideration is given to the issues of parameter identifiability, computational efficiency of an MCMC algorithm for model fitting, and fine-scale spatial prediction of collective efficacy.
{"title":"LAND-USE FILTERING FOR NONSTATIONARY SPATIAL PREDICTION OF COLLECTIVE EFFICACY IN AN URBAN ENVIRONMENT.","authors":"J Brandon Carter, Christopher R Browning, Bethany Boettner, Nicolo Pinchak, Catherine A Calder","doi":"10.1214/23-aoas1813","DOIUrl":"10.1214/23-aoas1813","url":null,"abstract":"<p><p>Collective efficacy-the capacity of communities to exert social control toward the realization of their shared goals-is a foundational concept in the urban sociology and neighborhood effects literature. Traditionally, empirical studies of collective efficacy use large sample surveys to estimate collective efficacy of different neighborhoods within an urban setting. Such studies have demonstrated an association between collective efficacy and local variation in community violence, educational achievement, and health. Unlike traditional collective efficacy measurement strategies, the Adolescent Health and Development in Context (AHDC) Study implemented a new approach, obtaining spatially-referenced, place-based ratings of collective efficacy from a representative sample of individuals residing in Columbus, OH. In this paper we introduce a novel nonstationary spatial model for interpolation of the AHDC collective efficacy ratings across the study area, which leverages administrative data on land use. Our constructive model specification strategy involves dimension expansion of a latent spatial process and the use of a filter defined by the land-use partition of the study region to connect the latent multivariate spatial process to the observed ordinal ratings of collective efficacy. Careful consideration is given to the issues of parameter identifiability, computational efficiency of an MCMC algorithm for model fitting, and fine-scale spatial prediction of collective efficacy.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"18 1","pages":"794-818"},"PeriodicalIF":1.8,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11146085/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141238803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-01Epub Date: 2024-01-31DOI: 10.1214/23-aoas1809
Nicholas Hartman, Joseph M Messana, Jian Kang, Abhijit S Naik, Tempie H Shearon, Kevin He
Risk-adjusted quality measures are used to evaluate healthcare providers with respect to national norms while controlling for factors beyond their control. Existing healthcare provider profiling approaches typically assume that the between-provider variation in these measures is entirely due to meaningful differences in quality of care. However, in practice, much of the between-provider variation will be due to trivial fluctuations in healthcare quality, or unobservable confounding risk factors. If these additional sources of variation are not accounted for, conventional methods will disproportionately identify larger providers as outliers, even though their departures from the national norms may not be "extreme" or clinically meaningful. Motivated by efforts to evaluate the quality of care provided by transplant centers, we develop a composite evaluation score based on a novel individualized empirical null method, which robustly accounts for overdispersion due to unobserved risk factors, models the marginal variance of standardized scores as a function of the effective sample size, and only requires the use of publicly-available center-level statistics. The evaluations of United States kidney transplant centers based on the proposed composite score are substantially different from those based on conventional methods. Simulations show that the proposed empirical null approach more accurately classifies centers in terms of quality of care, compared to existing methods.
{"title":"COMPOSITE SCORES FOR TRANSPLANT CENTER EVALUATION: A NEW INDIVIDUALIZED EMPIRICAL NULL METHOD.","authors":"Nicholas Hartman, Joseph M Messana, Jian Kang, Abhijit S Naik, Tempie H Shearon, Kevin He","doi":"10.1214/23-aoas1809","DOIUrl":"10.1214/23-aoas1809","url":null,"abstract":"<p><p>Risk-adjusted quality measures are used to evaluate healthcare providers with respect to national norms while controlling for factors beyond their control. Existing healthcare provider profiling approaches typically assume that the between-provider variation in these measures is entirely due to meaningful differences in quality of care. However, in practice, much of the between-provider variation will be due to trivial fluctuations in healthcare quality, or unobservable confounding risk factors. If these additional sources of variation are not accounted for, conventional methods will disproportionately identify larger providers as outliers, even though their departures from the national norms may not be \"extreme\" or clinically meaningful. Motivated by efforts to evaluate the quality of care provided by transplant centers, we develop a composite evaluation score based on a novel individualized empirical null method, which robustly accounts for overdispersion due to unobserved risk factors, models the marginal variance of standardized scores as a function of the effective sample size, and only requires the use of publicly-available center-level statistics. The evaluations of United States kidney transplant centers based on the proposed composite score are substantially different from those based on conventional methods. Simulations show that the proposed empirical null approach more accurately classifies centers in terms of quality of care, compared to existing methods.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"18 1","pages":"729-748"},"PeriodicalIF":1.3,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11395314/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142300086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-01Epub Date: 2024-01-31DOI: 10.1214/23-aoas1798
Gang Xu, Amei Amei, Weimiao Wu, Yunqing Liu, Linchuan Shen, Edwin C Oh, Zuoheng Wang
Many genetic studies contain rich information on longitudinal phenotypes that require powerful analytical tools for optimal analysis. Genetic analysis of longitudinal data that incorporates temporal variation is important for understanding the genetic architecture and biological variation of complex diseases. Most of the existing methods assume that the contribution of genetic variants is constant over time and fail to capture the dynamic pattern of disease progression. However, the relative influence of genetic variants on complex traits fluctuates over time. In this study, we propose a retrospective varying coefficient mixed model association test, RVMMAT, to detect time-varying genetic effect on longitudinal binary traits. We model dynamic genetic effect using smoothing splines, estimate model parameters by maximizing a double penalized quasi-likelihood function, design a joint test using a Cauchy combination method, and evaluate statistical significance via a retrospective approach to achieve robustness to model misspecification. Through simulations we illustrated that the retrospective varying-coefficient test was robust to model misspecification under different ascertainment schemes and gained power over the association methods assuming constant genetic effect. We applied RVMMAT to a genome-wide association analysis of longitudinal measure of hypertension in the Multi-Ethnic Study of Atherosclerosis. Pathway analysis identified two important pathways related to G-protein signaling and DNA damage. Our results demonstrated that RVMMAT could detect biologically relevant loci and pathways in a genome scan and provided insight into the genetic architecture of hypertension.
许多遗传研究都包含丰富的纵向表型信息,需要强大的分析工具来进行优化分析。对包含时间变异的纵向数据进行遗传分析,对于了解复杂疾病的遗传结构和生物变异非常重要。现有的大多数方法都假定遗传变异的贡献随时间变化是恒定的,因此无法捕捉疾病进展的动态模式。然而,遗传变异对复杂性状的相对影响是随时间波动的。在本研究中,我们提出了一种回顾性变化系数混合模型关联检验--RVMMAT,以检测对纵向二元性状的时变遗传效应。我们使用平滑样条建立动态遗传效应模型,通过最大化双惩罚准似然比函数估计模型参数,使用考奇组合方法设计联合检验,并通过追溯方法评估统计显著性,以实现对模型错误规范的稳健性。通过模拟实验,我们证明了在不同的确定方案下,追溯性变化系数检验对模型错误规范具有稳健性,并且比假设恒定遗传效应的关联方法更有说服力。我们将 RVMMAT 应用于动脉粥样硬化多种族研究中高血压纵向测量的全基因组关联分析。通路分析确定了与 G 蛋白信号传导和 DNA 损伤相关的两条重要通路。我们的研究结果表明,RVMMAT 可以在基因组扫描中检测到与生物相关的位点和通路,并提供了对高血压遗传结构的深入了解。
{"title":"RETROSPECTIVE VARYING COEFFICIENT ASSOCIATION ANALYSIS OF LONGITUDINAL BINARY TRAITS: APPLICATION TO THE IDENTIFICATION OF GENETIC LOCI ASSOCIATED WITH HYPERTENSION.","authors":"Gang Xu, Amei Amei, Weimiao Wu, Yunqing Liu, Linchuan Shen, Edwin C Oh, Zuoheng Wang","doi":"10.1214/23-aoas1798","DOIUrl":"10.1214/23-aoas1798","url":null,"abstract":"<p><p>Many genetic studies contain rich information on longitudinal phenotypes that require powerful analytical tools for optimal analysis. Genetic analysis of longitudinal data that incorporates temporal variation is important for understanding the genetic architecture and biological variation of complex diseases. Most of the existing methods assume that the contribution of genetic variants is constant over time and fail to capture the dynamic pattern of disease progression. However, the relative influence of genetic variants on complex traits fluctuates over time. In this study, we propose a retrospective varying coefficient mixed model association test, RVMMAT, to detect time-varying genetic effect on longitudinal binary traits. We model dynamic genetic effect using smoothing splines, estimate model parameters by maximizing a double penalized quasi-likelihood function, design a joint test using a Cauchy combination method, and evaluate statistical significance via a retrospective approach to achieve robustness to model misspecification. Through simulations we illustrated that the retrospective varying-coefficient test was robust to model misspecification under different ascertainment schemes and gained power over the association methods assuming constant genetic effect. We applied RVMMAT to a genome-wide association analysis of longitudinal measure of hypertension in the Multi-Ethnic Study of Atherosclerosis. Pathway analysis identified two important pathways related to G-protein signaling and DNA damage. Our results demonstrated that RVMMAT could detect biologically relevant loci and pathways in a genome scan and provided insight into the genetic architecture of hypertension.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"18 1","pages":"487-505"},"PeriodicalIF":1.3,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10994004/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140868741","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-01Epub Date: 2023-10-30DOI: 10.1214/23-aoas1758
Jingjing Zou, Tuo Lin, Chongzhi Di, John Bellettiere, Marta M Jankowska, Sheri J Hartman, Dorothy D Sears, Andrea Z LaCroix, Cheryl L Rock, Loki Natarajan
Physical activity (PA) is significantly associated with many health outcomes. The wide usage of wearable accelerometer-based activity trackers in recent years has provided a unique opportunity for in-depth research on PA and its relations with health outcomes and interventions. Past analysis of activity tracker data relies heavily on aggregating minute-level PA records into day-level summary statistics in which important information of PA temporal/diurnal patterns is lost. In this paper we propose a novel functional data analysis approach based on Riemann manifolds for modeling PA and its longitudinal changes. We model smoothed minute-level PA of a day as one-dimensional Riemann manifolds and longitudinal changes in PA in different visits as deformations between manifolds. The variability in changes of PA among a cohort of subjects is characterized via variability in the deformation. Functional principal component analysis is further adopted to model the deformations, and PC scores are used as a proxy in modeling the relation between changes in PA and health outcomes and/or interventions. We conduct comprehensive analyses on data from two clinical trials: Reach for Health (RfH) and Metabolism, Exercise and Nutrition at UCSD (MENU), focusing on the effect of interventions on longitudinal changes in PA patterns and how different modes of changes in PA influence weight loss, respectively. The proposed approach reveals unique modes of changes, including overall enhanced PA, boosted morning PA, and shifts of active hours specific to each study cohort. The results bring new insights into the study of longitudinal changes in PA and health and have the potential to facilitate designing of effective health interventions and guidelines.
体力活动(PA)与许多健康结果密切相关。近年来,基于加速度计的可穿戴活动追踪器的广泛使用为深入研究体力活动及其与健康结果和干预措施的关系提供了一个独特的机会。以往对活动追踪器数据的分析主要依赖于将分钟级的活动量记录汇总成天级的汇总统计数据,这就失去了活动量时间/昼夜模式的重要信息。在本文中,我们提出了一种基于黎曼流形的新型功能数据分析方法,用于模拟 PA 及其纵向变化。我们将一天中平滑的分钟级 PA 建模为一维黎曼流形,并将不同访问中 PA 的纵向变化建模为流形之间的变形。一组受试者之间 PA 变化的变异性通过变形的变异性来表征。我们进一步采用功能主成分分析法对变形进行建模,并将 PC 分数作为代理变量,对 PA 变化与健康结果和/或干预措施之间的关系进行建模。我们对两项临床试验的数据进行了综合分析:我们对两项临床试验的数据进行了综合分析:Reach for Health (RfH) 和 Metabolism, Exercise and Nutrition at UCSD (MENU),分别侧重于干预措施对 PA 模式纵向变化的影响,以及 PA 的不同变化模式如何影响体重减轻。所提出的方法揭示了独特的变化模式,包括整体增强的活动量、增强的晨间活动量以及每个研究队列特有的活动时间变化。这些结果为研究运动量和健康的纵向变化带来了新的见解,并有可能促进设计有效的健康干预措施和指南。
{"title":"A RIEMANN MANIFOLD MODEL FRAMEWORK FOR LONGITUDINAL CHANGES IN PHYSICAL ACTIVITY PATTERNS.","authors":"Jingjing Zou, Tuo Lin, Chongzhi Di, John Bellettiere, Marta M Jankowska, Sheri J Hartman, Dorothy D Sears, Andrea Z LaCroix, Cheryl L Rock, Loki Natarajan","doi":"10.1214/23-aoas1758","DOIUrl":"10.1214/23-aoas1758","url":null,"abstract":"<p><p>Physical activity (PA) is significantly associated with many health outcomes. The wide usage of wearable accelerometer-based activity trackers in recent years has provided a unique opportunity for in-depth research on PA and its relations with health outcomes and interventions. Past analysis of activity tracker data relies heavily on aggregating minute-level PA records into day-level summary statistics in which important information of PA temporal/diurnal patterns is lost. In this paper we propose a novel functional data analysis approach based on Riemann manifolds for modeling PA and its longitudinal changes. We model smoothed minute-level PA of a day as one-dimensional Riemann manifolds and longitudinal changes in PA in different visits as deformations between manifolds. The variability in changes of PA among a cohort of subjects is characterized via variability in the deformation. Functional principal component analysis is further adopted to model the deformations, and PC scores are used as a proxy in modeling the relation between changes in PA and health outcomes and/or interventions. We conduct comprehensive analyses on data from two clinical trials: Reach for Health (RfH) and Metabolism, Exercise and Nutrition at UCSD (MENU), focusing on the effect of interventions on longitudinal changes in PA patterns and how different modes of changes in PA influence weight loss, respectively. The proposed approach reveals unique modes of changes, including overall enhanced PA, boosted morning PA, and shifts of active hours specific to each study cohort. The results bring new insights into the study of longitudinal changes in PA and health and have the potential to facilitate designing of effective health interventions and guidelines.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"17 4","pages":"3216-3240"},"PeriodicalIF":1.8,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11149895/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141249006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-01Epub Date: 2023-10-30DOI: 10.1214/23-aoas1745
Siqi Xiang, Wan Zhang, Siyao Liu, Katherine A Hoadley, Charles M Perou, Kai Zhang, J S Marron
In The Cancer Genome Atlas (TCGA) data set, there are many interesting nonlinear dependencies between pairs of genes that reveal important relationships and subtypes of cancer. Such genomic data analysis requires a rapid, powerful and interpretable detection process, especially in a high-dimensional environment. We study the nonlinear patterns among the expression of pairs of genes from TCGA using a powerful tool called Binary Expansion Testing. We find many nonlinear patterns, some of which are driven by known cancer subtypes, some of which are novel.
{"title":"PAIRWISE NONLINEAR DEPENDENCE ANALYSIS OF GENOMIC DATA.","authors":"Siqi Xiang, Wan Zhang, Siyao Liu, Katherine A Hoadley, Charles M Perou, Kai Zhang, J S Marron","doi":"10.1214/23-aoas1745","DOIUrl":"10.1214/23-aoas1745","url":null,"abstract":"<p><p>In The Cancer Genome Atlas (TCGA) data set, there are many interesting nonlinear dependencies between pairs of genes that reveal important relationships and subtypes of cancer. Such genomic data analysis requires a rapid, powerful and interpretable detection process, especially in a high-dimensional environment. We study the nonlinear patterns among the expression of pairs of genes from TCGA using a powerful tool called Binary Expansion Testing. We find many nonlinear patterns, some of which are driven by known cancer subtypes, some of which are novel.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"17 4","pages":"2924-2943"},"PeriodicalIF":1.3,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10688600/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138479190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-01Epub Date: 2023-10-30DOI: 10.1214/23-AOAS1747
By Sai Li, Tianxi Cai, Rui Duan
The limited representation of minorities and disadvantaged populations in large-scale clinical and genomics research poses a significant barrier to translating precision medicine research into practice. Prediction models are likely to underperform in underrepresented populations due to heterogeneity across populations, thereby exacerbating known health disparities. To address this issue, we propose FETA, a two-way data integration method that leverages a federated transfer learning approach to integrate heterogeneous data from diverse populations and multiple healthcare institutions, with a focus on a target population of interest having limited sample sizes. We show that FETA achieves performance comparable to the pooled analysis, where individual-level data is shared across institutions, with only a small number of communications across participating sites. Our theoretical analysis and simulation study demonstrate how FETA's estimation accuracy is influenced by communication budgets, privacy restrictions, and heterogeneity across populations. We apply FETA to multisite data from the electronic Medical Records and Genomics (eMERGE) Network to construct genetic risk prediction models for extreme obesity. Compared to models trained using target data only, source data only, and all data without accounting for population-level differences, FETA shows superior predictive performance. FETA has the potential to improve estimation and prediction accuracy in underrepresented populations and reduce the gap in model performance across populations.
{"title":"TARGETING UNDERREPRESENTED POPULATIONS IN PRECISION MEDICINE: A FEDERATED TRANSFER LEARNING APPROACH.","authors":"By Sai Li, Tianxi Cai, Rui Duan","doi":"10.1214/23-AOAS1747","DOIUrl":"10.1214/23-AOAS1747","url":null,"abstract":"<p><p>The limited representation of minorities and disadvantaged populations in large-scale clinical and genomics research poses a significant barrier to translating precision medicine research into practice. Prediction models are likely to underperform in underrepresented populations due to heterogeneity across populations, thereby exacerbating known health disparities. To address this issue, we propose FETA, a two-way data integration method that leverages a federated transfer learning approach to integrate heterogeneous data from diverse populations and multiple healthcare institutions, with a focus on a target population of interest having limited sample sizes. We show that FETA achieves performance comparable to the pooled analysis, where individual-level data is shared across institutions, with only a small number of communications across participating sites. Our theoretical analysis and simulation study demonstrate how FETA's estimation accuracy is influenced by communication budgets, privacy restrictions, and heterogeneity across populations. We apply FETA to multisite data from the electronic Medical Records and Genomics (eMERGE) Network to construct genetic risk prediction models for extreme obesity. Compared to models trained using target data only, source data only, and all data without accounting for population-level differences, FETA shows superior predictive performance. FETA has the potential to improve estimation and prediction accuracy in underrepresented populations and reduce the gap in model performance across populations.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"17 4","pages":"2970-2992"},"PeriodicalIF":1.3,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11417462/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142309007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-01Epub Date: 2023-10-30DOI: 10.1214/23-aoas1744
Walter Dempsey
Coronavirus case-count data has influenced government policies and drives most epidemiological forecasts. Limited testing is cited as the key driver behind minimal information on the COVID-19 pandemic. While expanded testing is laudable, measurement error and selection bias are the two greatest problems limiting our understanding of the COVID-19 pandemic; neither can be fully addressed by increased testing capacity. In this paper, we demonstrate their impact on estimation of point prevalence and the effective reproduction number. We show that estimates based on the millions of molecular tests in the US has the same mean square error as a small simple random sample. To address this, a procedure is presented that combines case-count data and random samples over time to estimate selection propensities based on key covariate information. We then combine these selection propensities with epidemiological forecast models to construct a doubly robust estimation method that accounts for both measurement-error and selection bias. This method is then applied to estimate Indiana's active infection prevalence using case-count, hospitalization, and death data with demographic information, a statewide random molecular sample collected from April 25-29th, and Delphi's COVID-19 Trends and Impact Survey. We end with a series of recommendations based on the proposed methodology.
{"title":"ADDRESSING SELECTION BIAS AND MEASUREMENT ERROR IN COVID-19 CASE COUNT DATA USING AUXILIARY INFORMATION.","authors":"Walter Dempsey","doi":"10.1214/23-aoas1744","DOIUrl":"https://doi.org/10.1214/23-aoas1744","url":null,"abstract":"<p><p>Coronavirus case-count data has influenced government policies and drives most epidemiological forecasts. Limited testing is cited as the key driver behind minimal information on the COVID-19 pandemic. While expanded testing is laudable, measurement error and selection bias are the two greatest problems limiting our understanding of the COVID-19 pandemic; neither can be fully addressed by increased testing capacity. In this paper, we demonstrate their impact on estimation of point prevalence and the effective reproduction number. We show that estimates based on the millions of molecular tests in the US has the same mean square error as a small simple random sample. To address this, a procedure is presented that combines case-count data and random samples over time to estimate selection propensities based on key covariate information. We then combine these selection propensities with epidemiological forecast models to construct a <i>doubly robust</i> estimation method that accounts for both measurement-error and selection bias. This method is then applied to estimate Indiana's active infection prevalence using case-count, hospitalization, and death data with demographic information, a statewide random molecular sample collected from April 25-29th, and Delphi's COVID-19 Trends and Impact Survey. We end with a series of recommendations based on the proposed methodology.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"17 4","pages":"2903-2923"},"PeriodicalIF":1.3,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11210953/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141472276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-01Epub Date: 2023-10-30DOI: 10.1214/23-aoas1746
Yue Wang, Ali Shojaie, Timothy Randolph, Parker Knight, Jing Ma
Motivated by emerging applications in ecology, microbiology, and neuroscience, this paper studies high-dimensional regression with two-way structured data. To estimate the high-dimensional coefficient vector, we propose the generalized matrix decomposition regression (GMDR) to efficiently leverage auxiliary information on row and column structures. GMDR extends the principal component regression (PCR) to two-way structured data, but unlike PCR, GMDR selects the components that are most predictive of the outcome, leading to more accurate prediction. For inference on regression coefficients of individual variables, we propose the generalized matrix decomposition inference (GMDI), a general high-dimensional inferential framework for a large family of estimators that include the proposed GMDR estimator. GMDI provides more flexibility for incorporating relevant auxiliary row and column structures. As a result, GMDI does not require the true regression coefficients to be sparse, but constrains the coordinate system representing the regression coefficients according to the column structure. GMDI also allows dependent and heteroscedastic observations. We study the theoretical properties of GMDI in terms of both the type-I error rate and power and demonstrate the effectiveness of GMDR and GMDI in simulation studies and an application to human microbiome data.
{"title":"GENERALIZED MATRIX DECOMPOSITION REGRESSION: ESTIMATION AND INFERENCE FOR TWO-WAY STRUCTURED DATA.","authors":"Yue Wang, Ali Shojaie, Timothy Randolph, Parker Knight, Jing Ma","doi":"10.1214/23-aoas1746","DOIUrl":"10.1214/23-aoas1746","url":null,"abstract":"<p><p>Motivated by emerging applications in ecology, microbiology, and neuroscience, this paper studies high-dimensional regression with two-way structured data. To estimate the high-dimensional coefficient vector, we propose the generalized matrix decomposition regression (GMDR) to efficiently leverage auxiliary information on row and column structures. GMDR extends the principal component regression (PCR) to two-way structured data, but unlike PCR, GMDR selects the components that are most predictive of the outcome, leading to more accurate prediction. For inference on regression coefficients of individual variables, we propose the generalized matrix decomposition inference (GMDI), a general high-dimensional inferential framework for a large family of estimators that include the proposed GMDR estimator. GMDI provides more flexibility for incorporating relevant auxiliary row and column structures. As a result, GMDI does not require the true regression coefficients to be sparse, but constrains the coordinate system representing the regression coefficients according to the column structure. GMDI also allows dependent and heteroscedastic observations. We study the theoretical properties of GMDI in terms of both the type-I error rate and power and demonstrate the effectiveness of GMDR and GMDI in simulation studies and an application to human microbiome data.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"17 4","pages":"2944-2969"},"PeriodicalIF":1.3,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10751029/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139040863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-01Epub Date: 2023-10-30DOI: 10.1214/23-aoas1762
Bomin Kim, Xiaoyue Niu, David Hunter, Xun CaO
Motivated by a study of United Nations voting behaviors, we introduce a regression model for a series of networks that are correlated over time. Our model is a dynamic extension of the additive and multiplicative effects network model (AMEN) of Hoff (2021). In addition to incorporating a temporal structure, the model accommodates two types of missing data thus allows the size of the network to vary over time. We demonstrate via simulations the necessity of various components of the model. We apply the model to the United Nations General Assembly voting data from 1983 to 2014 (Voeten, 2013) to answer interesting research questions regarding international voting behaviors. In addition to finding important factors that could explain the voting behaviors, the model-estimated additive effects, multiplicative effects, and their movements reveal meaningful foreign policy positions and alliances of various countries.
{"title":"A DYNAMIC ADDITIVE AND MULTIPLICATIVE EFFECTS NETWORK MODEL WITH APPLICATION TO THE UNITED NATIONS VOTING BEHAVIORS.","authors":"Bomin Kim, Xiaoyue Niu, David Hunter, Xun CaO","doi":"10.1214/23-aoas1762","DOIUrl":"10.1214/23-aoas1762","url":null,"abstract":"<p><p>Motivated by a study of United Nations voting behaviors, we introduce a regression model for a series of networks that are correlated over time. Our model is a dynamic extension of the additive and multiplicative effects network model (AMEN) of Hoff (2021). In addition to incorporating a temporal structure, the model accommodates two types of missing data thus allows the size of the network to vary over time. We demonstrate via simulations the necessity of various components of the model. We apply the model to the United Nations General Assembly voting data from 1983 to 2014 (Voeten, 2013) to answer interesting research questions regarding international voting behaviors. In addition to finding important factors that could explain the voting behaviors, the model-estimated additive effects, multiplicative effects, and their movements reveal meaningful foreign policy positions and alliances of various countries.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"17 4","pages":"3283-3299"},"PeriodicalIF":1.8,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10798233/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139514175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}