首页 > 最新文献

Annals of Applied Statistics最新文献

英文 中文
DYNAMIC CLASSIFICATION OF LATENT DISEASE PROGRESSION WITH AUXILIARY SURROGATE LABELS. 用辅助替代标签动态分类潜伏性疾病进展。
IF 1.4 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2026-03-01 Epub Date: 2026-03-20 DOI: 10.1214/26-aoas2150
Zexi Cai, Donglin Zeng, Karen S Marder, Lawrence S Honig, Yuanjia Wang

Disease progression prediction based on patients' evolving health information is challenging when true disease states are unknown due to diagnostic capabilities or high costs. For example, the absence of gold-standard neurological diagnoses hinders distinguishing Alzheimer's disease (AD) from related conditions such as AD-related dementias (ADRDs), including Lewy body dementia (LBD). Combining temporally dependent surrogate labels and health markers may improve disease prediction. However, existing literature models informative surrogate labels and observed variables that reflect the underlying states using purely generative approaches, often posing unrealistic assumptions on the outcomes and suffering from misspecification thereof. We propose integrating the conventional hidden Markov model as a generative model with a time-varying discriminative classification model to simultaneously handle potentially misspecified surrogate labels and incorporate important markers of disease progression. We develop an adaptive forward-backward algorithm with subjective labels for estimation, and utilize the modified posterior and Viterbi algorithms to predict the progression of future states or new patients based on objective markers only. Importantly, the adaptation eliminates the need to model the marginal distribution of longitudinal markers, a requirement in traditional algorithms. Asymptotic properties are established, and significant improvements in finite samples are demonstrated via simulation studies. Analysis of the neuropathological dataset of the National Alzheimer's Coordinating Center (NACC) shows much improved accuracy in distinguishing LBD from AD.

当由于诊断能力或高成本而无法确定真实疾病状态时,基于患者不断变化的健康信息进行疾病进展预测是具有挑战性的。例如,缺乏黄金标准的神经学诊断阻碍了将阿尔茨海默病(AD)与相关疾病(如AD相关痴呆(adrd),包括路易体痴呆(LBD))区分开来。结合暂时依赖的替代标签和健康标记可以改善疾病预测。然而,现有的文献模型使用纯粹的生成方法来反映潜在状态的替代标签和观察变量,经常对结果提出不切实际的假设,并且存在错误的说明。我们建议将传统的隐马尔可夫模型作为生成模型与时变判别分类模型相结合,以同时处理潜在的错误指定的替代标签并纳入疾病进展的重要标记。我们开发了一种带有主观标签的自适应前向后算法进行估计,并利用改进的后验和Viterbi算法仅基于客观标记来预测未来状态或新患者的进展。重要的是,这种自适应消除了传统算法中对纵向标记的边缘分布建模的需要。建立了渐近性质,并通过仿真研究证明了在有限样本下的显著改进。对国家阿尔茨海默病协调中心(NACC)神经病理学数据集的分析显示,区分LBD和AD的准确性大大提高。
{"title":"DYNAMIC CLASSIFICATION OF LATENT DISEASE PROGRESSION WITH AUXILIARY SURROGATE LABELS.","authors":"Zexi Cai, Donglin Zeng, Karen S Marder, Lawrence S Honig, Yuanjia Wang","doi":"10.1214/26-aoas2150","DOIUrl":"10.1214/26-aoas2150","url":null,"abstract":"<p><p>Disease progression prediction based on patients' evolving health information is challenging when true disease states are unknown due to diagnostic capabilities or high costs. For example, the absence of gold-standard neurological diagnoses hinders distinguishing Alzheimer's disease (AD) from related conditions such as AD-related dementias (ADRDs), including Lewy body dementia (LBD). Combining temporally dependent surrogate labels and health markers may improve disease prediction. However, existing literature models informative surrogate labels and observed variables that reflect the underlying states using purely generative approaches, often posing unrealistic assumptions on the outcomes and suffering from misspecification thereof. We propose integrating the conventional hidden Markov model as a generative model with a time-varying discriminative classification model to simultaneously handle potentially misspecified surrogate labels and incorporate important markers of disease progression. We develop an adaptive forward-backward algorithm with subjective labels for estimation, and utilize the modified posterior and Viterbi algorithms to predict the progression of future states or new patients based on objective markers only. Importantly, the adaptation eliminates the need to model the marginal distribution of longitudinal markers, a requirement in traditional algorithms. Asymptotic properties are established, and significant improvements in finite samples are demonstrated via simulation studies. Analysis of the neuropathological dataset of the National Alzheimer's Coordinating Center (NACC) shows much improved accuracy in distinguishing LBD from AD.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"20 1","pages":"641-662"},"PeriodicalIF":1.4,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13004507/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147500503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SEMIPARAMETRIC ANALYSIS OF INTERVAL-CENSORED DATA SUBJECT TO INACCURATE DIAGNOSES WITH A TERMINAL EVENT. 具有终末事件的不准确诊断的区间截尾数据的半参数分析。
IF 1.4 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2026-03-01 Epub Date: 2026-03-20 DOI: 10.1214/25-aoas2134
Yuhao Deng, Donglin Zeng, Yuanjia Wang

Interval-censoring frequently occurs in studies of chronic diseases where disease status is inferred from intermittently collected biomarkers. Although many methods have been developed to analyze such data, they typically assume perfect disease diagnosis, which often does not hold in practice due to the inherent imperfect clinical diagnosis of cognitive functions or measurement errors of biomarkers such as cerebrospinal fluid. In this work, we introduce a semiparametric modeling framework using the Cox proportional hazards model to address interval-censored data in the presence of inaccurate disease diagnosis. Our model incorporates sensitivity and specificity of the diagnosis to account for uncertainty in whether the interval truly contains the disease onset. Furthermore, the framework accommodates scenarios involving a terminal event and when diagnosis is accurate, such as through postmortem analysis. We propose a nonparametric maximum likelihood estimation method for inference and develop an efficient EM algorithm to ensure computational feasibility. The regression coefficient estimators are shown to be asymptotically normal, achieving semiparametric efficiency bounds. We further validate our approach through extensive simulation studies and an application assessing Alzheimer's disease (AD) risk. We find that amyloid-beta is significantly associated with AD, but Tau is predictive of both AD and mortality.

在慢性疾病的研究中经常出现间隔筛选,其中疾病状态是从间歇性收集的生物标志物推断出来的。虽然已经开发了许多方法来分析这些数据,但它们通常假设完美的疾病诊断,由于认知功能的临床诊断固有的不完善或脑脊液等生物标志物的测量误差,这在实践中往往不成立。在这项工作中,我们引入了一个使用Cox比例风险模型的半参数建模框架,以解决存在不准确疾病诊断的区间截除数据。我们的模型结合了诊断的敏感性和特异性,以解释间隔是否真正包含疾病发病的不确定性。此外,该框架还适用于涉及最终事件和诊断准确的情况,例如通过死后分析。我们提出了一种非参数极大似然估计推理方法,并开发了一种高效的EM算法,以确保计算的可行性。回归系数估计量是渐近正态的,得到半参数效率界。我们通过广泛的模拟研究和评估阿尔茨海默病(AD)风险的应用进一步验证了我们的方法。我们发现淀粉样蛋白- β与阿尔茨海默病显著相关,但Tau蛋白可预测阿尔茨海默病和死亡率。
{"title":"SEMIPARAMETRIC ANALYSIS OF INTERVAL-CENSORED DATA SUBJECT TO INACCURATE DIAGNOSES WITH A TERMINAL EVENT.","authors":"Yuhao Deng, Donglin Zeng, Yuanjia Wang","doi":"10.1214/25-aoas2134","DOIUrl":"10.1214/25-aoas2134","url":null,"abstract":"<p><p>Interval-censoring frequently occurs in studies of chronic diseases where disease status is inferred from intermittently collected biomarkers. Although many methods have been developed to analyze such data, they typically assume perfect disease diagnosis, which often does not hold in practice due to the inherent imperfect clinical diagnosis of cognitive functions or measurement errors of biomarkers such as cerebrospinal fluid. In this work, we introduce a semiparametric modeling framework using the Cox proportional hazards model to address interval-censored data in the presence of inaccurate disease diagnosis. Our model incorporates sensitivity and specificity of the diagnosis to account for uncertainty in whether the interval truly contains the disease onset. Furthermore, the framework accommodates scenarios involving a terminal event and when diagnosis is accurate, such as through postmortem analysis. We propose a nonparametric maximum likelihood estimation method for inference and develop an efficient EM algorithm to ensure computational feasibility. The regression coefficient estimators are shown to be asymptotically normal, achieving semiparametric efficiency bounds. We further validate our approach through extensive simulation studies and an application assessing Alzheimer's disease (AD) risk. We find that amyloid-beta is significantly associated with AD, but Tau is predictive of both AD and mortality.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"20 1","pages":"623-640"},"PeriodicalIF":1.4,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13004487/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147500496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
JOINT MODELING FOR LEARNING DECISION-MAKING DYNAMICS IN BEHAVIORAL EXPERIMENTS. 行为实验中决策动力学学习的联合建模。
IF 1.4 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2025-12-01 Epub Date: 2025-12-05 DOI: 10.1214/25-aoas2112
Yuan Bian, Xingche Guo, Yuanjia Wang

Major depressive disorder (MDD), a leading cause of disability and mortality, is associated with reward-processing abnormalities and concentration issues. Motivated by the probabilistic reward task from the Establishing Moderators and Biosignatures of Antidepressant Response in Clinical Care (EMBARC) study, we propose a novel framework that integrates the reinforcement learning (RL) model and drift-diffusion model (DDM) to jointly analyze reward-based decision-making with response times. To account for emerging evidence suggesting that decision-making may alternate between multiple interleaved strategies, we model latent state switching using a hidden Markov model (HMM). In the engaged state, decisions follow an RL-DDM, simultaneously capturing reward processing, decision dynamics, and temporal structure. In contrast, in the lapsed state, decision-making is modeled using a simplified DDM, where specific parameters are fixed to approximate random guessing with equal probability. The proposed method is implemented using a computationally efficient generalized expectation-maximization (EM) algorithm with forward-backward procedures. Through extensive numerical studies, we demonstrate that our proposed method outperforms competing approaches across various reward-generating distributions, under both strategy-switching and non-switching scenarios, as well as in the presence of input perturbations. When applied to the EMBARC study, our framework reveals that MDD patients exhibit lower overall engagement than healthy controls and experience longer responses when they do engage. Additionally, we show that neuroimaging measures of brain activities are associated with decision-making characteristics in the engaged state but not in the lapsed state, providing evidence of brain-behavior association specific to the engaged state.

重度抑郁症(MDD)是导致残疾和死亡的主要原因,与奖励处理异常和注意力问题有关。基于“临床护理中抗抑郁反应的调节因子和生物特征的建立”(EMBARC)研究的概率奖励任务,我们提出了一个整合强化学习(RL)模型和漂移扩散模型(DDM)的新框架,以共同分析基于反应时间的奖励决策。为了解释新出现的证据表明决策可能在多个交错策略之间交替,我们使用隐马尔可夫模型(HMM)建模潜在状态切换。在参与状态下,决策遵循RL-DDM,同时捕获奖励处理、决策动态和时间结构。在失效状态下,决策模型使用简化的DDM,其中固定特定参数以近似等概率随机猜测。该方法采用一种计算效率高的广义期望最大化(EM)算法实现。通过广泛的数值研究,我们证明了我们提出的方法优于各种奖励生成分布的竞争方法,无论是在策略切换和非切换场景下,还是在存在输入扰动的情况下。当应用于EMBARC研究时,我们的框架揭示了重度抑郁症患者比健康对照者表现出更低的整体参与,并且当他们参与时经历了更长的反应。此外,我们表明,大脑活动的神经成像测量与参与状态下的决策特征相关,而与失神状态无关,这为参与状态下的大脑行为关联提供了证据。
{"title":"JOINT MODELING FOR LEARNING DECISION-MAKING DYNAMICS IN BEHAVIORAL EXPERIMENTS.","authors":"Yuan Bian, Xingche Guo, Yuanjia Wang","doi":"10.1214/25-aoas2112","DOIUrl":"10.1214/25-aoas2112","url":null,"abstract":"<p><p>Major depressive disorder (MDD), a leading cause of disability and mortality, is associated with reward-processing abnormalities and concentration issues. Motivated by the probabilistic reward task from the Establishing Moderators and Biosignatures of Antidepressant Response in Clinical Care (EMBARC) study, we propose a novel framework that integrates the reinforcement learning (RL) model and drift-diffusion model (DDM) to jointly analyze reward-based decision-making with response times. To account for emerging evidence suggesting that decision-making may alternate between multiple interleaved strategies, we model latent state switching using a hidden Markov model (HMM). In the engaged state, decisions follow an RL-DDM, simultaneously capturing reward processing, decision dynamics, and temporal structure. In contrast, in the lapsed state, decision-making is modeled using a simplified DDM, where specific parameters are fixed to approximate random guessing with equal probability. The proposed method is implemented using a computationally efficient generalized expectation-maximization (EM) algorithm with forward-backward procedures. Through extensive numerical studies, we demonstrate that our proposed method outperforms competing approaches across various reward-generating distributions, under both strategy-switching and non-switching scenarios, as well as in the presence of input perturbations. When applied to the EMBARC study, our framework reveals that MDD patients exhibit lower overall engagement than healthy controls and experience longer responses when they do engage. Additionally, we show that neuroimaging measures of brain activities are associated with decision-making characteristics in the engaged state but not in the lapsed state, providing evidence of brain-behavior association specific to the engaged state.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"19 4","pages":"3372-3393"},"PeriodicalIF":1.4,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12814034/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146012947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PERSONALIZED RISK PREDICTION FOR CANCER SURVIVORS: A GENERALIZED BAYESIAN SEMI-PARAMETRIC MODEL OF RECURRENT EVENTS WITH COMPETING OUTCOMES. 癌症幸存者的个性化风险预测:具有竞争结果的复发事件的广义贝叶斯半参数模型。
IF 1.4 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2025-12-01 Epub Date: 2025-12-05 DOI: 10.1214/25-AOAS2083
Nam Hoai Nguyen, Seung Jun Shin, Elissa Dodd-Eaton, Jing Ning, Wenyi Wang

Multiple primary cancers are increasingly more frequent due to improved survival of cancer patients. Characteristics of the first primary cancer largely impact the risk of developing subsequent primary cancers. Hence, model-based risk characterization of cancer survivors that captures patient-specific variables is needed for healthcare policy making. We propose a Bayesian semi-parametric framework, where the occurrence processes of the competing cancer types follow independent non-homogeneous Poisson processes and adjust for covariates including the type and age at diagnosis of the first primary. Applying this framework to a historically collected cohort with families presenting a highly enriched history of multiple primary tumors and diverse cancer types, we have derived a suite of age-to-onset penetrance curves for cancer survivors. This includes penetrance estimates for second primary lung cancer, potentially impactful to ongoing cancer screening decisions. Using Receiver Operating Characteristic (ROC) curves, we have validated the good predictive performance of our models in predicting second primary lung cancer, sarcoma, breast cancer, and all other cancers combined, with areas under the curves (AUCs) at 0.89, 0.91, 0.76 and 0.68, respectively. In conclusion, our framework provides covariate-adjusted quantitative risk assessment for cancer survivors, hence moving a step closer to personalized health management for this unique population.

由于癌症患者生存率的提高,多发原发癌症越来越常见。第一原发性癌症的特征在很大程度上影响了发生后续原发性癌症的风险。因此,医疗保健政策制定需要基于模型的癌症幸存者风险表征,以捕获患者特定变量。我们提出了一个贝叶斯半参数框架,其中竞争癌症类型的发生过程遵循独立的非齐次泊松过程,并调整了协变量,包括首次原发性诊断时的类型和年龄。将这一框架应用于历史上收集的具有高度丰富的多种原发肿瘤和多种癌症类型病史的家庭队列,我们得出了一套癌症幸存者的年龄-发病外显率曲线。这包括第二原发性肺癌的外显率估计,这可能对正在进行的癌症筛查决策产生影响。使用受试者工作特征(ROC)曲线,我们验证了我们的模型在预测第二原发性肺癌、肉瘤、乳腺癌和所有其他癌症方面的良好预测性能,曲线下面积(auc)分别为0.89、0.91、0.76和0.68。总之,我们的框架为癌症幸存者提供了协变量调整的定量风险评估,从而向这一独特人群的个性化健康管理迈进了一步。
{"title":"PERSONALIZED RISK PREDICTION FOR CANCER SURVIVORS: A GENERALIZED BAYESIAN SEMI-PARAMETRIC MODEL OF RECURRENT EVENTS WITH COMPETING OUTCOMES.","authors":"Nam Hoai Nguyen, Seung Jun Shin, Elissa Dodd-Eaton, Jing Ning, Wenyi Wang","doi":"10.1214/25-AOAS2083","DOIUrl":"10.1214/25-AOAS2083","url":null,"abstract":"<p><p>Multiple primary cancers are increasingly more frequent due to improved survival of cancer patients. Characteristics of the first primary cancer largely impact the risk of developing subsequent primary cancers. Hence, model-based risk characterization of cancer survivors that captures patient-specific variables is needed for healthcare policy making. We propose a Bayesian semi-parametric framework, where the occurrence processes of the competing cancer types follow independent non-homogeneous Poisson processes and adjust for covariates including the type and age at diagnosis of the first primary. Applying this framework to a historically collected cohort with families presenting a highly enriched history of multiple primary tumors and diverse cancer types, we have derived a suite of age-to-onset penetrance curves for cancer survivors. This includes penetrance estimates for second primary lung cancer, potentially impactful to ongoing cancer screening decisions. Using Receiver Operating Characteristic (ROC) curves, we have validated the good predictive performance of our models in predicting second primary lung cancer, sarcoma, breast cancer, and all other cancers combined, with areas under the curves (AUCs) at 0.89, 0.91, 0.76 and 0.68, respectively. In conclusion, our framework provides covariate-adjusted quantitative risk assessment for cancer survivors, hence moving a step closer to personalized health management for this unique population.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"19 4","pages":"3091-3112"},"PeriodicalIF":1.4,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12955820/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147357368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MULTI-OBJECT DATA INTEGRATION IN THE STUDY OF PRIMARY PROGRESSIVE APHASIA. 原发性进行性失语症的多目标数据整合研究。
IF 1.4 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2025-12-01 Epub Date: 2025-12-05 DOI: 10.1214/25-aoas2071
Rene Gutierrez, Aaron Scheffler, Rajarshi Guhaniyogi, Maria Luisa Gorno-Tempini, Maria Luisa Mandelli, Giovanni Battistella

This article focuses on a multi-modal imaging data application where structural/anatomical information from gray matter (GM) and brain connectivity information in the form of a brain connectome network from functional magnetic resonance imaging (fMRI) are available for a number of subjects with different degrees of primary progressive aphasia (PPA), a neurodegenerative disorder (ND) measured through a speech rate measure on motor speech loss. The clinical/scientific goal in this study becomes the identification of brain regions of interest significantly related to the speech rate measure to gain insight into ND patterns. Viewing the brain connectome network and GM images as objects, we develop an integrated object response regression framework of network and GM images on the speech rate measure. A novel integrated prior formulation is proposed on network and structural image coefficients in order to exploit network information of the brain connectome while leveraging the interconnections among the two objects. The principled Bayesian framework allows the characterization of uncertainty in ascertaining a region being actively related to the speech rate measure. Our framework yields new insights into the relationship of brain regions associated with PPA, offering a deeper understanding of neuro-degenerative patterns of PPA. The supplementary file adds details about posterior computation and additional empirical results.

本文重点介绍了一种多模态成像数据应用,其中来自灰质(GM)的结构/解剖信息和来自功能磁共振成像(fMRI)的脑连接组网络形式的脑连接信息可用于许多患有不同程度原发性进行性失语(PPA)的受试者,PPA是一种神经退行性疾病(ND),通过对运动语言丧失的言语速率测量来测量。本研究的临床/科学目标是识别与言语速率测量显著相关的大脑区域,以深入了解ND模式。以脑连接组网络和GM图像为对象,在语音速率测量上建立了网络和GM图像的综合对象响应回归框架。提出了一种新的基于网络和结构图像系数的综合先验公式,以利用脑连接组的网络信息,同时利用两者之间的相互联系。原则贝叶斯框架允许表征不确定性在确定一个区域是积极相关的语音速率测量。我们的框架为PPA相关的大脑区域的关系提供了新的见解,为PPA的神经退行性模式提供了更深入的理解。补充文件增加了后验计算的细节和额外的经验结果。
{"title":"MULTI-OBJECT DATA INTEGRATION IN THE STUDY OF PRIMARY PROGRESSIVE APHASIA.","authors":"Rene Gutierrez, Aaron Scheffler, Rajarshi Guhaniyogi, Maria Luisa Gorno-Tempini, Maria Luisa Mandelli, Giovanni Battistella","doi":"10.1214/25-aoas2071","DOIUrl":"10.1214/25-aoas2071","url":null,"abstract":"<p><p>This article focuses on a multi-modal imaging data application where structural/anatomical information from gray matter (GM) and brain connectivity information in the form of a brain connectome network from functional magnetic resonance imaging (fMRI) are available for a number of subjects with different degrees of primary progressive aphasia (PPA), a neurodegenerative disorder (ND) measured through a speech rate measure on motor speech loss. The clinical/scientific goal in this study becomes the identification of brain regions of interest significantly related to the speech rate measure to gain insight into ND patterns. Viewing the brain connectome network and GM images as objects, we develop an integrated object response regression framework of network and GM images on the speech rate measure. A novel integrated prior formulation is proposed on network and structural image coefficients in order to exploit network information of the brain connectome while leveraging the interconnections among the two objects. The principled Bayesian framework allows the characterization of uncertainty in ascertaining a region being actively related to the speech rate measure. Our framework yields new insights into the relationship of brain regions associated with PPA, offering a deeper understanding of neuro-degenerative patterns of PPA. The supplementary file adds details about posterior computation and additional empirical results.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"19 4","pages":"3282-3303"},"PeriodicalIF":1.4,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12707422/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145776604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SUPERVISED LEARNING OF OUTCOME-RELEVANT ITEMS FROM A QUESTIONNAIRE VIA MIXED INTEGER OPTIMIZATION. 基于混合整数优化的问卷结果相关项的监督学习。
IF 1.4 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2025-12-01 Epub Date: 2025-12-05 DOI: 10.1214/25-AOAS2093
Leyao Zhang, Wen Wang, Mengtong Hu, Alan P Baptist, Peng Wang, Peter X K Song

Questionnaires are among the oldest and most widely used instruments in practice to measure variables relevant to traits of interest that cannot be easily measured by physical devices, for example, depression. In many clinical settings, the scope of an existing questionnaire is often unfit to apply to a new study population, whose underlying characteristics are different from those of the original population used for the questionnaire's development and/or validation. Motivated by a cohort study of elderly asthma patients, we aim to examine associations between clinical outcomes and quality of life (QoL) measured by a QoL questionnaire. To increase comparability, we consider a supervised learning method to identify a subset of questions whose summary score is strongly associated with a specific clinical outcome under investigation. The resultant set of selected items gives an optimal summary metric of the questionnaire, which improves both statistical power and clinical interpretation. Our item extraction procedure is built upon the best subset algorithm implemented by a mixed integer programming, which enjoys both theoretical guarantee of selection consistency and flexibility of handling nonresponse missing data. Moreover, estimation uncertainty is analyzed by the means of noise perturbation. Our methodology is first evaluated by extensive simulation studies with comparisons to existing methods and then applied to derive tailored QoL scores adaptive to two clinical outcomes of lung function measure (FEV1) and asthma control test (ACT), respectively, among elderly people with persistent asthma.

问卷调查是实践中最古老和最广泛使用的工具之一,用于测量与无法通过物理设备轻松测量的感兴趣特征相关的变量,例如抑郁症。在许多临床环境中,现有问卷的范围通常不适合应用于新的研究人群,其潜在特征与用于问卷开发和/或验证的原始人群不同。受一项老年哮喘患者队列研究的启发,我们旨在通过生活质量问卷调查临床结果与生活质量(QoL)之间的关系。为了增加可比性,我们考虑了一种监督学习方法来识别问题子集,这些问题的总结性得分与正在调查的特定临床结果密切相关。所选项目的结果集给出了问卷的最佳总结度量,这提高了统计能力和临床解释。我们的项目提取过程建立在混合整数规划实现的最佳子集算法的基础上,既具有选择一致性的理论保证,又具有处理无响应缺失数据的灵活性。此外,采用噪声扰动的方法分析了估计的不确定性。我们的方法首先通过广泛的模拟研究进行评估,并与现有方法进行比较,然后应用于在患有持续性哮喘的老年人中分别获得适合肺功能测量(FEV1)和哮喘控制测试(ACT)两种临床结果的量身定制的生活质量评分。
{"title":"SUPERVISED LEARNING OF OUTCOME-RELEVANT ITEMS FROM A QUESTIONNAIRE VIA MIXED INTEGER OPTIMIZATION.","authors":"Leyao Zhang, Wen Wang, Mengtong Hu, Alan P Baptist, Peng Wang, Peter X K Song","doi":"10.1214/25-AOAS2093","DOIUrl":"10.1214/25-AOAS2093","url":null,"abstract":"<p><p>Questionnaires are among the oldest and most widely used instruments in practice to measure variables relevant to traits of interest that cannot be easily measured by physical devices, for example, depression. In many clinical settings, the scope of an existing questionnaire is often unfit to apply to a new study population, whose underlying characteristics are different from those of the original population used for the questionnaire's development and/or validation. Motivated by a cohort study of elderly asthma patients, we aim to examine associations between clinical outcomes and quality of life (QoL) measured by a QoL questionnaire. To increase comparability, we consider a supervised learning method to identify a subset of questions whose summary score is strongly associated with a specific clinical outcome under investigation. The resultant set of selected items gives an optimal summary metric of the questionnaire, which improves both statistical power and clinical interpretation. Our item extraction procedure is built upon the best subset algorithm implemented by a mixed integer programming, which enjoys both theoretical guarantee of selection consistency and flexibility of handling nonresponse missing data. Moreover, estimation uncertainty is analyzed by the means of noise perturbation. Our methodology is first evaluated by extensive simulation studies with comparisons to existing methods and then applied to derive tailored QoL scores adaptive to two clinical outcomes of lung function measure (FEV1) and asthma control test (ACT), respectively, among elderly people with persistent asthma.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"19 4","pages":"3157-3178"},"PeriodicalIF":1.4,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12869357/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146127254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TREE-REGULARIZED BAYESIAN LATENT CLASS ANALYSIS FOR IMPROVING WEAKLY SEPARATED DIETARY PATTERN SUBTYPING IN SMALL-SIZED SUBPOPULATIONS. 树正则化贝叶斯潜类分析改善小尺度亚群中弱分离饮食模式亚型。
IF 1.4 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2025-12-01 Epub Date: 2025-12-05 DOI: 10.1214/25-aoas2067
By Mengbing Li, Briana Stephenson, Zhenke Wu

Dietary patterns synthesize multiple related diet components, which can be used by nutrition researchers to examine diet-disease relationships. Latent class models (LCMs) have been used to derive dietary patterns from dietary intake assessment, where each class profile represents the probabilities of exposure to a set of diet components. However, LCM-derived dietary patterns can exhibit strong similarities, or weak separation, resulting in numerical and inferential instabilities that challenge scientific interpretation. This issue is exacerbated in small-sized subpopulations. To address these issues, we provide a simple solution that empowers LCMs to improve dietary pattern estimation. We develop a tree-regularized Bayesian LCM that shares statistical strength between dietary patterns to make better estimates using limited data. This is achieved via a Dirichlet diffusion tree process that specifies a prior distribution for the unknown tree over classes. Dietary patterns that share proximity to one another in the tree are shrunk toward ancestral dietary patterns a priori, with the degree of shrinkage varying across prespecified food groups. Using dietary intake data from the Hispanic Community Health Study/Study of Latinos, we apply the proposed approach to a sample of 496 U.S. adults of South American ethnic background to identify and compare dietary patterns.

饮食模式综合了多种相关的饮食成分,可以被营养研究人员用来研究饮食与疾病的关系。潜在类别模型(lcm)已被用于从饮食摄入评估中得出饮食模式,其中每个类别概况代表暴露于一组饮食成分的概率。然而,lcm衍生的饮食模式可能表现出强烈的相似性或弱分离性,导致数字和推断的不稳定性,挑战科学解释。这个问题在小型亚种群中更加严重。为了解决这些问题,我们提供了一个简单的解决方案,使lcm能够改进饮食模式估计。我们开发了一种树正则化贝叶斯LCM,它在饮食模式之间共享统计强度,以便使用有限的数据进行更好的估计。这是通过狄利克雷扩散树过程实现的,该过程指定了未知树在类上的先验分布。在树中彼此相近的饮食模式会先验地向祖先的饮食模式缩小,缩小的程度因预先指定的食物组而异。使用来自西班牙裔社区健康研究/拉丁裔研究的饮食摄入数据,我们将建议的方法应用于496名南美种族背景的美国成年人样本,以确定和比较饮食模式。
{"title":"TREE-REGULARIZED BAYESIAN LATENT CLASS ANALYSIS FOR IMPROVING WEAKLY SEPARATED DIETARY PATTERN SUBTYPING IN SMALL-SIZED SUBPOPULATIONS.","authors":"By Mengbing Li, Briana Stephenson, Zhenke Wu","doi":"10.1214/25-aoas2067","DOIUrl":"10.1214/25-aoas2067","url":null,"abstract":"<p><p>Dietary patterns synthesize multiple related diet components, which can be used by nutrition researchers to examine diet-disease relationships. Latent class models (LCMs) have been used to derive dietary patterns from dietary intake assessment, where each class profile represents the probabilities of exposure to a set of diet components. However, LCM-derived dietary patterns can exhibit strong similarities, or weak separation, resulting in numerical and inferential instabilities that challenge scientific interpretation. This issue is exacerbated in small-sized subpopulations. To address these issues, we provide a simple solution that empowers LCMs to improve dietary pattern estimation. We develop a tree-regularized Bayesian LCM that shares statistical strength between dietary patterns to make better estimates using limited data. This is achieved via a Dirichlet diffusion tree process that specifies a prior distribution for the unknown tree over classes. Dietary patterns that share proximity to one another in the tree are shrunk toward ancestral dietary patterns a priori, with the degree of shrinkage varying across prespecified food groups. Using dietary intake data from the Hispanic Community Health Study/Study of Latinos, we apply the proposed approach to a sample of 496 U.S. adults of South American ethnic background to identify and compare dietary patterns.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"19 4","pages":"3003-3022"},"PeriodicalIF":1.4,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12867110/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146121126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FAST VARIABLE SELECTION FOR DISTRIBUTIONAL REGRESSION WITH APPLICATION TO CONTINUOUS GLUCOSE MONITORING DATA. 分布回归的快速变量选择及其在连续血糖监测数据中的应用。
IF 1.4 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2025-09-01 Epub Date: 2025-08-28 DOI: 10.1214/25-aoas2038
Alexander Coulter, Rashmi N Aurora, Naresh M Punjabi, Irina Gaynanova

With the growing prevalence of diabetes and the associated public health burden, it is crucial to identify modifiable factors that could improve patients' glycemic control. In this work, we seek to examine associations between medication usage, concurrent comorbidities, and glycemic control, utilizing data from continuous glucose monitors (CGMs). CGMs provide high-frequency interstitial glucose measurements, but reducing data to simple statistical summaries is common in clinical studies, resulting in substantial information loss. Recent advancements in the Fréchet regression framework allow to utilize more information by treating the full distributional representation of CGM data as the response, while sparsity regularization enables variable selection. However, the methodology does not scale to large datasets. Crucially, rigorous inference is not possible because the asymptotic behavior of the underlying estimates is unknown, while the application of resampling-based inference methods is computationally infeasible. We develop a new algorithm for sparse distributional regression by deriving a new explicit characterization of the gradient and Hessian of the underlying objective function, while also utilizing rotations on the sphere to perform feasible updates. The updated method is up to 10000+ fold faster than the original approach, opening the door for applying sparse distributional regression to large-scale datasets and enabling previously unattainable resampling-based inference. We combine our algorithm with stability selection to perform variable selection inference on CGM data from patients with type 2 diabetes and obstructive sleep apnea. We find a significant association between sulfonylurea medication and glucose variability without evidence of association with glucose mean. We also find that overnight oxygen desaturation variability has a stronger association with glucose regulation than overall oxygen desaturation levels.

随着糖尿病患病率的增加和相关的公共卫生负担,确定可以改善患者血糖控制的可改变因素至关重要。在这项工作中,我们利用连续血糖监测仪(cgm)的数据,试图检查药物使用、并发合并症和血糖控制之间的关系。cgm提供高频间质葡萄糖测量,但将数据简化为简单的统计摘要在临床研究中很常见,导致大量信息丢失。最近在fr回归框架中的进展允许通过将CGM数据的完整分布表示作为响应来利用更多的信息,而稀疏性正则化则支持变量选择。然而,该方法并不适用于大型数据集。至关重要的是,严格的推理是不可能的,因为底层估计的渐近行为是未知的,而基于重采样的推理方法的应用在计算上是不可行的。我们开发了一种新的稀疏分布回归算法,通过推导出一种新的明确的梯度和潜在目标函数的Hessian特征,同时还利用球体上的旋转来执行可行的更新。更新后的方法比原始方法快10000多倍,为将稀疏分布回归应用于大规模数据集打开了大门,并实现了以前无法实现的基于重采样的推理。我们将算法与稳定性选择相结合,对2型糖尿病和阻塞性睡眠呼吸暂停患者的CGM数据进行变量选择推理。我们发现磺脲类药物与血糖变异性之间存在显著关联,但没有证据表明与血糖平均值相关。我们还发现,与整体氧去饱和水平相比,夜间氧去饱和变异性与葡萄糖调节的关系更强。
{"title":"FAST VARIABLE SELECTION FOR DISTRIBUTIONAL REGRESSION WITH APPLICATION TO CONTINUOUS GLUCOSE MONITORING DATA.","authors":"Alexander Coulter, Rashmi N Aurora, Naresh M Punjabi, Irina Gaynanova","doi":"10.1214/25-aoas2038","DOIUrl":"10.1214/25-aoas2038","url":null,"abstract":"<p><p>With the growing prevalence of diabetes and the associated public health burden, it is crucial to identify modifiable factors that could improve patients' glycemic control. In this work, we seek to examine associations between medication usage, concurrent comorbidities, and glycemic control, utilizing data from continuous glucose monitors (CGMs). CGMs provide high-frequency interstitial glucose measurements, but reducing data to simple statistical summaries is common in clinical studies, resulting in substantial information loss. Recent advancements in the Fréchet regression framework allow to utilize more information by treating the full distributional representation of CGM data as the response, while sparsity regularization enables variable selection. However, the methodology does not scale to large datasets. Crucially, rigorous inference is not possible because the asymptotic behavior of the underlying estimates is unknown, while the application of resampling-based inference methods is computationally infeasible. We develop a new algorithm for sparse distributional regression by deriving a new explicit characterization of the gradient and Hessian of the underlying objective function, while also utilizing rotations on the sphere to perform feasible updates. The updated method is up to 10000+ fold faster than the original approach, opening the door for applying sparse distributional regression to large-scale datasets and enabling previously unattainable resampling-based inference. We combine our algorithm with stability selection to perform variable selection inference on CGM data from patients with type 2 diabetes and obstructive sleep apnea. We find a significant association between sulfonylurea medication and glucose variability without evidence of association with glucose mean. We also find that overnight oxygen desaturation variability has a stronger association with glucose regulation than overall oxygen desaturation levels.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"19 3","pages":"2105-2128"},"PeriodicalIF":1.4,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12700301/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145758247","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MIXED MODELING APPROACH FOR CHARACTERIZING THE GENETIC EFFECTS IN A LONGITUDINAL PHENOTYPE. 描述纵向表型遗传效应的混合建模方法。
IF 1.4 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2025-09-01 Epub Date: 2025-08-28 DOI: 10.1214/25-aoas2033
Pei Zhang, Paul S Albert, Hyokyoung G Hong

Approaches for estimating genetic effects at the individual level often focus on analyzing phenotypes at a single time point, with less attention given to longitudinal phenotypes. This paper introduces a mixed modeling approach that includes both genetic and individual-specific random effects, and is designed to estimate genetic effects on both the baseline and slope for a longitudinal trajectory. The inclusion of genetic effects on both baseline and slope, combined with the crossed structure of genetic and individual-specific random effects, creates complex dependencies across repeated measurements for all subjects. These complexities necessitate the development of novel estimation procedures for parameter estimation and individual-specific predictions of genetic effects on both baseline and slope. We employ an Average Information Restricted Maximum Likelihood (AI-ReML) algorithm to estimate the variance components corresponding to genetic and individual-specific effects for the baseline levels and rates of change for a longitudinal phenotype. The algorithm is used to characterizes the prostate-specific antigen (PSA) trajectories for participants who remained prostate cancer-free in the Prostate, Lung, Colorectal, and Ovarian (PLCO) Cancer Screening Trial. Understanding genetic and individual-specific variation in this population will provide insights for determining the role of genetics in cancer screening. Our results reveal significant genetic contributions to both the initial PSA levels and their progression over time, highlighting the role of these genetic factors on the variability of PSA across unaffected individuals. We show how genetic factors can be used to identify individuals prone to large baseline and increasing trajectories PSA values among individuals who are prostate cancer-free. In turn, we can identify groups of individuals who have a high probability of falsely screening positive for prostate cancer using well established cutoffs for early detection based on the level and rate of change in this biomarker. The results demonstrate the importance of incorporating genetic factors for monitoring PSA for more accurate prostate cancer detection.

估计个体水平遗传效应的方法通常侧重于分析单个时间点的表型,而对纵向表型的关注较少。本文介绍了一种混合建模方法,该方法包括遗传和个体特异性随机效应,旨在估计纵向轨迹基线和斜率上的遗传效应。包括基线和斜率的遗传效应,结合遗传和个体特异性随机效应的交叉结构,在所有受试者的重复测量中产生复杂的依赖关系。这些复杂性需要开发新的估计程序,用于参数估计和对基线和斜率的遗传效应的个人特定预测。我们采用平均信息限制最大似然(AI-ReML)算法来估计与纵向表型的基线水平和变化率的遗传和个体特异性影响相对应的方差成分。该算法用于在前列腺、肺、结直肠癌和卵巢癌(PLCO)癌症筛查试验中保持无前列腺癌的参与者的前列腺特异性抗原(PSA)轨迹特征。了解这一人群的遗传和个体特异性变异将为确定遗传学在癌症筛查中的作用提供见解。我们的研究结果揭示了遗传因素对初始PSA水平及其随时间变化的重要影响,强调了这些遗传因素在未受影响个体中PSA变异性的作用。我们展示了遗传因素如何用于识别无前列腺癌个体中PSA值基线较大和轨迹增加的个体。反过来,我们可以根据这种生物标志物的水平和变化速度,使用完善的早期检测截止值,识别出高概率误诊为前列腺癌阳性的个体群体。结果表明结合遗传因素监测PSA对于更准确的前列腺癌检测的重要性。
{"title":"MIXED MODELING APPROACH FOR CHARACTERIZING THE GENETIC EFFECTS IN A LONGITUDINAL PHENOTYPE.","authors":"Pei Zhang, Paul S Albert, Hyokyoung G Hong","doi":"10.1214/25-aoas2033","DOIUrl":"10.1214/25-aoas2033","url":null,"abstract":"<p><p>Approaches for estimating genetic effects at the individual level often focus on analyzing phenotypes at a single time point, with less attention given to longitudinal phenotypes. This paper introduces a mixed modeling approach that includes both genetic and individual-specific random effects, and is designed to estimate genetic effects on both the baseline and slope for a longitudinal trajectory. The inclusion of genetic effects on both baseline and slope, combined with the crossed structure of genetic and individual-specific random effects, creates complex dependencies across repeated measurements for all subjects. These complexities necessitate the development of novel estimation procedures for parameter estimation and individual-specific predictions of genetic effects on both baseline and slope. We employ an Average Information Restricted Maximum Likelihood (AI-ReML) algorithm to estimate the variance components corresponding to genetic and individual-specific effects for the baseline levels and rates of change for a longitudinal phenotype. The algorithm is used to characterizes the prostate-specific antigen (PSA) trajectories for participants who remained prostate cancer-free in the Prostate, Lung, Colorectal, and Ovarian (PLCO) Cancer Screening Trial. Understanding genetic and individual-specific variation in this population will provide insights for determining the role of genetics in cancer screening. Our results reveal significant genetic contributions to both the initial PSA levels and their progression over time, highlighting the role of these genetic factors on the variability of PSA across unaffected individuals. We show how genetic factors can be used to identify individuals prone to large baseline and increasing trajectories PSA values among individuals who are prostate cancer-free. In turn, we can identify groups of individuals who have a high probability of falsely screening positive for prostate cancer using well established cutoffs for early detection based on the level and rate of change in this biomarker. The results demonstrate the importance of incorporating genetic factors for monitoring PSA for more accurate prostate cancer detection.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"19 3","pages":"2070-2087"},"PeriodicalIF":1.4,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12395449/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144976964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
INTEGRATIVE ECOLOGICAL REGRESSION ANALYSIS OF U.S. COUNTY AND STATE LEVEL COVID-19 DEATH DATA FOR STUDYING HEALTH DISPARITY ASSOCIATIONS. 美国县和州一级COVID-19死亡数据的综合生态回归分析研究健康差异关联
IF 1.4 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2025-09-01 Epub Date: 2025-08-28 DOI: 10.1214/25-aoas2055
Daniel Li, Xihong Lin

It is of substantial interest to study health disparity associations with COVID-19 death rates. Although high-quality individual-level COVID-19 epidemiological data have been difficult to collect on a national scale, all United States (U.S.) counties have reported total COVID-19 death counts. A standard ecological analysis would then regress county total death counts by county-level covariates such as age, sex, and race percentages. However, such an analysis is limited by ecological bias and fallacy in which estimated county-level associations are different from individual-level associations. Fortunately, state-level age, sex, and race specific COVID-19 death counts are also available for all U.S. states, so this information can be integrated with county-level data for more informative ecological analyses. We propose an approximate log-linear random effects model to jointly model county-level total death counts and state-level age, sex, and race specific death counts. We then develop a penalized composite log-likelihood method for parameter estimation and perform simulation studies to evaluate our proposed approach. Lastly, we analyze COVID-19 death data from the entire U.S., show how incorporating state-level counts can prevent ecological bias and fallacy, and illustrate the heterogeneity in health disparity associations across different U.S. states.

研究健康差异与COVID-19死亡率之间的关系具有重大意义。尽管难以在全国范围内收集高质量的个人层面的COVID-19流行病学数据,但美国所有县都报告了COVID-19总死亡人数。然后,标准的生态分析将按县级协变量(如年龄、性别和种族百分比)对县总死亡人数进行回归。然而,这种分析受到生态偏差和谬误的限制,其中估计的县级关联不同于个人层面的关联。幸运的是,美国所有州也有州一级的年龄、性别和种族特定的COVID-19死亡人数,因此这些信息可以与县级数据相结合,以进行更有信息的生态分析。我们提出了一个近似对数线性随机效应模型来联合模拟县级总死亡人数和州级年龄、性别和种族特定死亡人数。然后,我们开发了一种惩罚复合对数似然方法用于参数估计,并进行模拟研究来评估我们提出的方法。最后,我们分析了来自整个美国的COVID-19死亡数据,展示了纳入州一级计数如何防止生态偏差和谬误,并说明了美国不同州之间健康差异关联的异质性。
{"title":"INTEGRATIVE ECOLOGICAL REGRESSION ANALYSIS OF U.S. COUNTY AND STATE LEVEL COVID-19 DEATH DATA FOR STUDYING HEALTH DISPARITY ASSOCIATIONS.","authors":"Daniel Li, Xihong Lin","doi":"10.1214/25-aoas2055","DOIUrl":"10.1214/25-aoas2055","url":null,"abstract":"<p><p>It is of substantial interest to study health disparity associations with COVID-19 death rates. Although high-quality individual-level COVID-19 epidemiological data have been difficult to collect on a national scale, all United States (U.S.) counties have reported total COVID-19 death counts. A standard ecological analysis would then regress county total death counts by county-level covariates such as age, sex, and race percentages. However, such an analysis is limited by ecological bias and fallacy in which estimated county-level associations are different from individual-level associations. Fortunately, state-level age, sex, and race specific COVID-19 death counts are also available for all U.S. states, so this information can be integrated with county-level data for more informative ecological analyses. We propose an approximate log-linear random effects model to jointly model county-level total death counts and state-level age, sex, and race specific death counts. We then develop a penalized composite log-likelihood method for parameter estimation and perform simulation studies to evaluate our proposed approach. Lastly, we analyze COVID-19 death data from the entire U.S., show how incorporating state-level counts can prevent ecological bias and fallacy, and illustrate the heterogeneity in health disparity associations across different U.S. states.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"19 3","pages":"2320-2338"},"PeriodicalIF":1.4,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12900166/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146203683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Annals of Applied Statistics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1