首页 > 最新文献

Annals of Applied Statistics最新文献

英文 中文
MULTI-OBJECT DATA INTEGRATION IN THE STUDY OF PRIMARY PROGRESSIVE APHASIA. 原发性进行性失语症的多目标数据整合研究。
IF 1.4 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2025-12-01 Epub Date: 2025-12-05 DOI: 10.1214/25-aoas2071
Rene Gutierrez, Aaron Scheffler, Rajarshi Guhaniyogi, Maria Luisa Gorno-Tempini, Maria Luisa Mandelli, Giovanni Battistella

This article focuses on a multi-modal imaging data application where structural/anatomical information from gray matter (GM) and brain connectivity information in the form of a brain connectome network from functional magnetic resonance imaging (fMRI) are available for a number of subjects with different degrees of primary progressive aphasia (PPA), a neurodegenerative disorder (ND) measured through a speech rate measure on motor speech loss. The clinical/scientific goal in this study becomes the identification of brain regions of interest significantly related to the speech rate measure to gain insight into ND patterns. Viewing the brain connectome network and GM images as objects, we develop an integrated object response regression framework of network and GM images on the speech rate measure. A novel integrated prior formulation is proposed on network and structural image coefficients in order to exploit network information of the brain connectome while leveraging the interconnections among the two objects. The principled Bayesian framework allows the characterization of uncertainty in ascertaining a region being actively related to the speech rate measure. Our framework yields new insights into the relationship of brain regions associated with PPA, offering a deeper understanding of neuro-degenerative patterns of PPA. The supplementary file adds details about posterior computation and additional empirical results.

本文重点介绍了一种多模态成像数据应用,其中来自灰质(GM)的结构/解剖信息和来自功能磁共振成像(fMRI)的脑连接组网络形式的脑连接信息可用于许多患有不同程度原发性进行性失语(PPA)的受试者,PPA是一种神经退行性疾病(ND),通过对运动语言丧失的言语速率测量来测量。本研究的临床/科学目标是识别与言语速率测量显著相关的大脑区域,以深入了解ND模式。以脑连接组网络和GM图像为对象,在语音速率测量上建立了网络和GM图像的综合对象响应回归框架。提出了一种新的基于网络和结构图像系数的综合先验公式,以利用脑连接组的网络信息,同时利用两者之间的相互联系。原则贝叶斯框架允许表征不确定性在确定一个区域是积极相关的语音速率测量。我们的框架为PPA相关的大脑区域的关系提供了新的见解,为PPA的神经退行性模式提供了更深入的理解。补充文件增加了后验计算的细节和额外的经验结果。
{"title":"MULTI-OBJECT DATA INTEGRATION IN THE STUDY OF PRIMARY PROGRESSIVE APHASIA.","authors":"Rene Gutierrez, Aaron Scheffler, Rajarshi Guhaniyogi, Maria Luisa Gorno-Tempini, Maria Luisa Mandelli, Giovanni Battistella","doi":"10.1214/25-aoas2071","DOIUrl":"10.1214/25-aoas2071","url":null,"abstract":"<p><p>This article focuses on a multi-modal imaging data application where structural/anatomical information from gray matter (GM) and brain connectivity information in the form of a brain connectome network from functional magnetic resonance imaging (fMRI) are available for a number of subjects with different degrees of primary progressive aphasia (PPA), a neurodegenerative disorder (ND) measured through a speech rate measure on motor speech loss. The clinical/scientific goal in this study becomes the identification of brain regions of interest significantly related to the speech rate measure to gain insight into ND patterns. Viewing the brain connectome network and GM images as objects, we develop an integrated object response regression framework of network and GM images on the speech rate measure. A novel integrated prior formulation is proposed on network and structural image coefficients in order to exploit network information of the brain connectome while leveraging the interconnections among the two objects. The principled Bayesian framework allows the characterization of uncertainty in ascertaining a region being actively related to the speech rate measure. Our framework yields new insights into the relationship of brain regions associated with PPA, offering a deeper understanding of neuro-degenerative patterns of PPA. The supplementary file adds details about posterior computation and additional empirical results.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"19 4","pages":"3282-3303"},"PeriodicalIF":1.4,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12707422/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145776604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FAST VARIABLE SELECTION FOR DISTRIBUTIONAL REGRESSION WITH APPLICATION TO CONTINUOUS GLUCOSE MONITORING DATA. 分布回归的快速变量选择及其在连续血糖监测数据中的应用。
IF 1.4 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2025-09-01 Epub Date: 2025-08-28 DOI: 10.1214/25-aoas2038
Alexander Coulter, Rashmi N Aurora, Naresh M Punjabi, Irina Gaynanova

With the growing prevalence of diabetes and the associated public health burden, it is crucial to identify modifiable factors that could improve patients' glycemic control. In this work, we seek to examine associations between medication usage, concurrent comorbidities, and glycemic control, utilizing data from continuous glucose monitors (CGMs). CGMs provide high-frequency interstitial glucose measurements, but reducing data to simple statistical summaries is common in clinical studies, resulting in substantial information loss. Recent advancements in the Fréchet regression framework allow to utilize more information by treating the full distributional representation of CGM data as the response, while sparsity regularization enables variable selection. However, the methodology does not scale to large datasets. Crucially, rigorous inference is not possible because the asymptotic behavior of the underlying estimates is unknown, while the application of resampling-based inference methods is computationally infeasible. We develop a new algorithm for sparse distributional regression by deriving a new explicit characterization of the gradient and Hessian of the underlying objective function, while also utilizing rotations on the sphere to perform feasible updates. The updated method is up to 10000+ fold faster than the original approach, opening the door for applying sparse distributional regression to large-scale datasets and enabling previously unattainable resampling-based inference. We combine our algorithm with stability selection to perform variable selection inference on CGM data from patients with type 2 diabetes and obstructive sleep apnea. We find a significant association between sulfonylurea medication and glucose variability without evidence of association with glucose mean. We also find that overnight oxygen desaturation variability has a stronger association with glucose regulation than overall oxygen desaturation levels.

随着糖尿病患病率的增加和相关的公共卫生负担,确定可以改善患者血糖控制的可改变因素至关重要。在这项工作中,我们利用连续血糖监测仪(cgm)的数据,试图检查药物使用、并发合并症和血糖控制之间的关系。cgm提供高频间质葡萄糖测量,但将数据简化为简单的统计摘要在临床研究中很常见,导致大量信息丢失。最近在fr回归框架中的进展允许通过将CGM数据的完整分布表示作为响应来利用更多的信息,而稀疏性正则化则支持变量选择。然而,该方法并不适用于大型数据集。至关重要的是,严格的推理是不可能的,因为底层估计的渐近行为是未知的,而基于重采样的推理方法的应用在计算上是不可行的。我们开发了一种新的稀疏分布回归算法,通过推导出一种新的明确的梯度和潜在目标函数的Hessian特征,同时还利用球体上的旋转来执行可行的更新。更新后的方法比原始方法快10000多倍,为将稀疏分布回归应用于大规模数据集打开了大门,并实现了以前无法实现的基于重采样的推理。我们将算法与稳定性选择相结合,对2型糖尿病和阻塞性睡眠呼吸暂停患者的CGM数据进行变量选择推理。我们发现磺脲类药物与血糖变异性之间存在显著关联,但没有证据表明与血糖平均值相关。我们还发现,与整体氧去饱和水平相比,夜间氧去饱和变异性与葡萄糖调节的关系更强。
{"title":"FAST VARIABLE SELECTION FOR DISTRIBUTIONAL REGRESSION WITH APPLICATION TO CONTINUOUS GLUCOSE MONITORING DATA.","authors":"Alexander Coulter, Rashmi N Aurora, Naresh M Punjabi, Irina Gaynanova","doi":"10.1214/25-aoas2038","DOIUrl":"10.1214/25-aoas2038","url":null,"abstract":"<p><p>With the growing prevalence of diabetes and the associated public health burden, it is crucial to identify modifiable factors that could improve patients' glycemic control. In this work, we seek to examine associations between medication usage, concurrent comorbidities, and glycemic control, utilizing data from continuous glucose monitors (CGMs). CGMs provide high-frequency interstitial glucose measurements, but reducing data to simple statistical summaries is common in clinical studies, resulting in substantial information loss. Recent advancements in the Fréchet regression framework allow to utilize more information by treating the full distributional representation of CGM data as the response, while sparsity regularization enables variable selection. However, the methodology does not scale to large datasets. Crucially, rigorous inference is not possible because the asymptotic behavior of the underlying estimates is unknown, while the application of resampling-based inference methods is computationally infeasible. We develop a new algorithm for sparse distributional regression by deriving a new explicit characterization of the gradient and Hessian of the underlying objective function, while also utilizing rotations on the sphere to perform feasible updates. The updated method is up to 10000+ fold faster than the original approach, opening the door for applying sparse distributional regression to large-scale datasets and enabling previously unattainable resampling-based inference. We combine our algorithm with stability selection to perform variable selection inference on CGM data from patients with type 2 diabetes and obstructive sleep apnea. We find a significant association between sulfonylurea medication and glucose variability without evidence of association with glucose mean. We also find that overnight oxygen desaturation variability has a stronger association with glucose regulation than overall oxygen desaturation levels.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"19 3","pages":"2105-2128"},"PeriodicalIF":1.4,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12700301/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145758247","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MIXED MODELING APPROACH FOR CHARACTERIZING THE GENETIC EFFECTS IN A LONGITUDINAL PHENOTYPE. 描述纵向表型遗传效应的混合建模方法。
IF 1.4 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2025-09-01 Epub Date: 2025-08-28 DOI: 10.1214/25-aoas2033
Pei Zhang, Paul S Albert, Hyokyoung G Hong

Approaches for estimating genetic effects at the individual level often focus on analyzing phenotypes at a single time point, with less attention given to longitudinal phenotypes. This paper introduces a mixed modeling approach that includes both genetic and individual-specific random effects, and is designed to estimate genetic effects on both the baseline and slope for a longitudinal trajectory. The inclusion of genetic effects on both baseline and slope, combined with the crossed structure of genetic and individual-specific random effects, creates complex dependencies across repeated measurements for all subjects. These complexities necessitate the development of novel estimation procedures for parameter estimation and individual-specific predictions of genetic effects on both baseline and slope. We employ an Average Information Restricted Maximum Likelihood (AI-ReML) algorithm to estimate the variance components corresponding to genetic and individual-specific effects for the baseline levels and rates of change for a longitudinal phenotype. The algorithm is used to characterizes the prostate-specific antigen (PSA) trajectories for participants who remained prostate cancer-free in the Prostate, Lung, Colorectal, and Ovarian (PLCO) Cancer Screening Trial. Understanding genetic and individual-specific variation in this population will provide insights for determining the role of genetics in cancer screening. Our results reveal significant genetic contributions to both the initial PSA levels and their progression over time, highlighting the role of these genetic factors on the variability of PSA across unaffected individuals. We show how genetic factors can be used to identify individuals prone to large baseline and increasing trajectories PSA values among individuals who are prostate cancer-free. In turn, we can identify groups of individuals who have a high probability of falsely screening positive for prostate cancer using well established cutoffs for early detection based on the level and rate of change in this biomarker. The results demonstrate the importance of incorporating genetic factors for monitoring PSA for more accurate prostate cancer detection.

估计个体水平遗传效应的方法通常侧重于分析单个时间点的表型,而对纵向表型的关注较少。本文介绍了一种混合建模方法,该方法包括遗传和个体特异性随机效应,旨在估计纵向轨迹基线和斜率上的遗传效应。包括基线和斜率的遗传效应,结合遗传和个体特异性随机效应的交叉结构,在所有受试者的重复测量中产生复杂的依赖关系。这些复杂性需要开发新的估计程序,用于参数估计和对基线和斜率的遗传效应的个人特定预测。我们采用平均信息限制最大似然(AI-ReML)算法来估计与纵向表型的基线水平和变化率的遗传和个体特异性影响相对应的方差成分。该算法用于在前列腺、肺、结直肠癌和卵巢癌(PLCO)癌症筛查试验中保持无前列腺癌的参与者的前列腺特异性抗原(PSA)轨迹特征。了解这一人群的遗传和个体特异性变异将为确定遗传学在癌症筛查中的作用提供见解。我们的研究结果揭示了遗传因素对初始PSA水平及其随时间变化的重要影响,强调了这些遗传因素在未受影响个体中PSA变异性的作用。我们展示了遗传因素如何用于识别无前列腺癌个体中PSA值基线较大和轨迹增加的个体。反过来,我们可以根据这种生物标志物的水平和变化速度,使用完善的早期检测截止值,识别出高概率误诊为前列腺癌阳性的个体群体。结果表明结合遗传因素监测PSA对于更准确的前列腺癌检测的重要性。
{"title":"MIXED MODELING APPROACH FOR CHARACTERIZING THE GENETIC EFFECTS IN A LONGITUDINAL PHENOTYPE.","authors":"Pei Zhang, Paul S Albert, Hyokyoung G Hong","doi":"10.1214/25-aoas2033","DOIUrl":"10.1214/25-aoas2033","url":null,"abstract":"<p><p>Approaches for estimating genetic effects at the individual level often focus on analyzing phenotypes at a single time point, with less attention given to longitudinal phenotypes. This paper introduces a mixed modeling approach that includes both genetic and individual-specific random effects, and is designed to estimate genetic effects on both the baseline and slope for a longitudinal trajectory. The inclusion of genetic effects on both baseline and slope, combined with the crossed structure of genetic and individual-specific random effects, creates complex dependencies across repeated measurements for all subjects. These complexities necessitate the development of novel estimation procedures for parameter estimation and individual-specific predictions of genetic effects on both baseline and slope. We employ an Average Information Restricted Maximum Likelihood (AI-ReML) algorithm to estimate the variance components corresponding to genetic and individual-specific effects for the baseline levels and rates of change for a longitudinal phenotype. The algorithm is used to characterizes the prostate-specific antigen (PSA) trajectories for participants who remained prostate cancer-free in the Prostate, Lung, Colorectal, and Ovarian (PLCO) Cancer Screening Trial. Understanding genetic and individual-specific variation in this population will provide insights for determining the role of genetics in cancer screening. Our results reveal significant genetic contributions to both the initial PSA levels and their progression over time, highlighting the role of these genetic factors on the variability of PSA across unaffected individuals. We show how genetic factors can be used to identify individuals prone to large baseline and increasing trajectories PSA values among individuals who are prostate cancer-free. In turn, we can identify groups of individuals who have a high probability of falsely screening positive for prostate cancer using well established cutoffs for early detection based on the level and rate of change in this biomarker. The results demonstrate the importance of incorporating genetic factors for monitoring PSA for more accurate prostate cancer detection.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"19 3","pages":"2070-2087"},"PeriodicalIF":1.4,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12395449/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144976964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
BAYESIAN LEARNING OF CLINICALLY MEANINGFUL SEPSIS PHENOTYPES IN NORTHERN TANZANIA. 坦桑尼亚北部临床意义的败血症表型的贝叶斯学习。
IF 1.4 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2025-09-01 Epub Date: 2025-08-28 DOI: 10.1214/25-aoas2045
Alexander Dombowsky, David B Dunson, Deng B Madut, Matthew P Rubach, Amy H Herring

Sepsis is a life-threatening condition caused by a dysregulated host response to infection. Recently, researchers have hypothesized that sepsis consists of a heterogeneous spectrum of distinct subtypes, motivating several studies to identify clusters of sepsis patients that correspond to subtypes, with the long-term goal of using these clusters to design subtype-specific treatments. Therefore, clinicians rely on clusters having a concrete medical interpretation, usually corresponding to clinically meaningful regions of the sample space that have a concrete implication to practitioners. In this article, we propose Clustering Around Meaningful Regions (CLAMR), a Bayesian clustering approach that explicitly models the medical interpretation of each cluster center. CLAMR favors clusterings that can be summarized via meaningful feature values, leading to medically significant sepsis patient clusters. We also provide details on measuring the effect of each feature on the clustering using Bayesian hypothesis tests, so one can assess what features are relevant for cluster interpretation. Our focus is on clustering sepsis patients from Moshi, Tanzania, where patients are younger and the prevalence of HIV infection is higher than in previous sepsis subtyping cohorts.

败血症是一种危及生命的疾病,由宿主对感染的反应失调引起。最近,研究人员假设脓毒症由不同亚型的异质谱组成,这促使一些研究确定与亚型相对应的脓毒症患者群,并利用这些群设计亚型特异性治疗的长期目标。因此,临床医生依赖具有具体医学解释的聚类,通常对应于对从业者具有具体含义的样本空间中有临床意义的区域。在本文中,我们提出了围绕有意义区域的聚类(CLAMR),这是一种贝叶斯聚类方法,它明确地模拟了每个聚类中心的医学解释。CLAMR倾向于可以通过有意义的特征值进行总结的聚类,从而导致具有医学意义的脓毒症患者聚类。我们还提供了使用贝叶斯假设检验测量每个特征对聚类的影响的详细信息,因此可以评估哪些特征与聚类解释相关。我们的重点是来自坦桑尼亚Moshi的聚类脓毒症患者,那里的患者更年轻,HIV感染的流行率高于以前的脓毒症亚型队列。
{"title":"BAYESIAN LEARNING OF CLINICALLY MEANINGFUL SEPSIS PHENOTYPES IN NORTHERN TANZANIA.","authors":"Alexander Dombowsky, David B Dunson, Deng B Madut, Matthew P Rubach, Amy H Herring","doi":"10.1214/25-aoas2045","DOIUrl":"10.1214/25-aoas2045","url":null,"abstract":"<p><p>Sepsis is a life-threatening condition caused by a dysregulated host response to infection. Recently, researchers have hypothesized that sepsis consists of a heterogeneous spectrum of distinct subtypes, motivating several studies to identify clusters of sepsis patients that correspond to subtypes, with the long-term goal of using these clusters to design subtype-specific treatments. Therefore, clinicians rely on clusters having a concrete medical interpretation, usually corresponding to clinically meaningful regions of the sample space that have a concrete implication to practitioners. In this article, we propose Clustering Around Meaningful Regions (CLAMR), a Bayesian clustering approach that explicitly models the medical interpretation of each cluster center. CLAMR favors clusterings that can be summarized via meaningful feature values, leading to medically significant sepsis patient clusters. We also provide details on measuring the effect of each feature on the clustering using Bayesian hypothesis tests, so one can assess what features are relevant for cluster interpretation. Our focus is on clustering sepsis patients from Moshi, Tanzania, where patients are younger and the prevalence of HIV infection is higher than in previous sepsis subtyping cohorts.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"19 3","pages":"2193-2217"},"PeriodicalIF":1.4,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12422288/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145042065","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
BAYESIAN DIFFERENTIAL CAUSAL DIRECTED ACYCLIC GRAPHS FOR OBSERVATIONAL ZERO-INFLATED COUNTS WITH AN APPLICATION TO TWO-SAMPLE SINGLE-CELL DATA. 观测零膨胀计数的贝叶斯微分因果有向无环图及其在双样本单细胞数据中的应用。
IF 1.4 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2025-09-01 Epub Date: 2025-08-28 DOI: 10.1214/25-aoas2042
Junsouk Choi, Robert S Chapkin, Yang Ni

Observational zero-inflated count data arise in a wide range of areas such as genomics. One of the common research questions is to identify causal relationships by learning the structure of a sparse directed acyclic graph (DAG). While structure learning of DAGs has been an active research area, existing methods do not adequately account for excessive zeros and therefore are not suitable for modeling zero-inflated count data. Moreover, it is often interesting to study differences in the causal networks for data collected from two experimental groups (control vs treatment). To explicitly account for zero-inflation and identify differential causal networks, we propose a novel Bayesian differential zero-inflated negative binomial DAG (DAG0) model. We prove that the causal relationships under the proposed DAG0 are fully identifiable from purely observational, cross-sectional data, using a general proof technique that is applicable beyond the proposed model. Bayesian inference based on parallel-tempered Markov chain Monte Carlo is developed to efficiently explore the multi-modal posterior landscape. We demonstrate the utility of the proposed DAG0 by comparing it with state-of-the-art alternative methods through extensive simulations. An application in a single-cell RNA-sequencing dataset generated under two experimental groups finds some interesting results that appear to be consistent with existing knowledge. A user-friendly R package that implements DAG0 is available at https://github.com/junsoukchoi/BayesDAG0.git.

观测零膨胀计数数据出现在广泛的领域,如基因组学。一个常见的研究问题是通过学习稀疏有向无环图(DAG)的结构来识别因果关系。虽然dag的结构学习一直是一个活跃的研究领域,但现有的方法不能充分考虑过多的零,因此不适合建模零膨胀计数数据。此外,研究从两个实验组(对照组与实验组)收集的数据的因果网络差异通常是有趣的。为了明确地解释零膨胀和识别差分因果网络,我们提出了一个新的贝叶斯微分零膨胀负二项DAG (DAG0)模型。我们使用一种适用于所提出模型之外的一般证明技术,证明了所提出的DAG0下的因果关系完全可以从纯粹的观察性横截面数据中识别出来。为了有效地探索多模态后验景观,提出了基于并行调节马尔可夫链蒙特卡罗的贝叶斯推理方法。我们通过广泛的模拟将所提出的DAG0与最先进的替代方法进行比较,从而证明了它的实用性。在两个实验组生成的单细胞rna测序数据集中的应用发现了一些有趣的结果,这些结果似乎与现有知识一致。一个实现DAG0的用户友好的R包可以在https://github.com/junsoukchoi/BayesDAG0.git上获得。
{"title":"BAYESIAN DIFFERENTIAL CAUSAL DIRECTED ACYCLIC GRAPHS FOR OBSERVATIONAL ZERO-INFLATED COUNTS WITH AN APPLICATION TO TWO-SAMPLE SINGLE-CELL DATA.","authors":"Junsouk Choi, Robert S Chapkin, Yang Ni","doi":"10.1214/25-aoas2042","DOIUrl":"10.1214/25-aoas2042","url":null,"abstract":"<p><p>Observational zero-inflated count data arise in a wide range of areas such as genomics. One of the common research questions is to identify causal relationships by learning the structure of a sparse directed acyclic graph (DAG). While structure learning of DAGs has been an active research area, existing methods do not adequately account for excessive zeros and therefore are not suitable for modeling zero-inflated count data. Moreover, it is often interesting to study differences in the causal networks for data collected from two experimental groups (control vs treatment). To explicitly account for zero-inflation and identify differential causal networks, we propose a novel Bayesian differential zero-inflated negative binomial DAG (DAG0) model. We prove that the causal relationships under the proposed DAG0 are fully identifiable from purely observational, cross-sectional data, using a general proof technique that is applicable beyond the proposed model. Bayesian inference based on parallel-tempered Markov chain Monte Carlo is developed to efficiently explore the multi-modal posterior landscape. We demonstrate the utility of the proposed DAG0 by comparing it with state-of-the-art alternative methods through extensive simulations. An application in a single-cell RNA-sequencing dataset generated under two experimental groups finds some interesting results that appear to be consistent with existing knowledge. A user-friendly R package that implements DAG0 is available at https://github.com/junsoukchoi/BayesDAG0.git.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"19 3","pages":"1908-1930"},"PeriodicalIF":1.4,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12395422/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144976941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AVERAGED PREDICTION MODELS (APM): IDENTIFYING CAUSAL EFFECTS IN CONTROLLED PRE-POST SETTINGS WITH APPLICATION TO GUN POLICY. 平均预测模型(apm):识别控制前后设置的因果关系,并应用于枪支政策。
IF 1.4 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2025-09-01 Epub Date: 2025-08-28 DOI: 10.1214/25-aoas2011
Thomas Leavitt, Laura A Hatfield

To investigate causal impacts, many researchers use controlled pre-post designs that compare over-time differences between a population exposed to a policy change and an unexposed comparison group. However, researchers using these designs often disagree about the "correct" specification of the causal model, perhaps most notably in analyses to identify the effects of gun policies on crime. To help settle these model specification debates, we propose a general identification framework that unifies a variety of models researchers use in practice. In this framework, which nests "brand name" designs like Difference-in-Differences as special cases, we use models to predict untreated outcomes and then correct the treated group's predictions using the comparison group's observed prediction errors. Our point identifying assumption is that treated and comparison groups would have equal prediction errors (in expectation) under no treatment. To choose among candidate models, we propose a data-driven procedure based on models' robustness to violations of this point identifying assumption. Our selection procedure averages over candidate models, weighting by each model's posterior probability of being the most robust given its differential average prediction errors in the pre-period. This approach offers a way out of debates over the "correct" model by choosing on robustness instead and has the desirable property of being feasible in the "locked box" of pre-intervention data only. We apply our methodology to the gun policy debate, focusing specifically on Missouri's 2007 repeal of its permit-to-purchase law, and provide an R package (apm) for implementation.

为了调查因果影响,许多研究人员使用受控的前后设计,比较暴露于政策变化的人群和未暴露的对照组之间的长期差异。然而,使用这些设计的研究人员经常对因果模型的“正确”说明持不同意见,也许最明显的是在确定枪支政策对犯罪的影响的分析中。为了帮助解决这些模型规范的争论,我们提出了一个通用的识别框架,它统一了研究人员在实践中使用的各种模型。在这个框架中,我们使用模型来预测未治疗组的结果,然后使用对照组观察到的预测误差来纠正治疗组的预测。我们的观点识别假设是,治疗组和对照组在没有治疗的情况下会有相同的预测误差(在期望中)。为了在候选模型中进行选择,我们提出了一个基于模型对违反这一点识别假设的鲁棒性的数据驱动程序。我们的选择过程对候选模型进行平均,根据每个模型的后验概率进行加权,给定其在前期的微分平均预测误差。这种方法提供了一种方法,通过选择鲁棒性来解决关于“正确”模型的争论,并且具有仅在干预前数据的“锁定框”中可行的理想特性。我们将我们的方法应用于枪支政策辩论,特别关注密苏里州2007年废除其购买许可法,并提供一个R包(apm)用于实施。
{"title":"AVERAGED PREDICTION MODELS (APM): IDENTIFYING CAUSAL EFFECTS IN CONTROLLED PRE-POST SETTINGS WITH APPLICATION TO GUN POLICY.","authors":"Thomas Leavitt, Laura A Hatfield","doi":"10.1214/25-aoas2011","DOIUrl":"10.1214/25-aoas2011","url":null,"abstract":"<p><p>To investigate causal impacts, many researchers use controlled pre-post designs that compare over-time differences between a population exposed to a policy change and an unexposed comparison group. However, researchers using these designs often disagree about the \"correct\" specification of the causal model, perhaps most notably in analyses to identify the effects of gun policies on crime. To help settle these model specification debates, we propose a general identification framework that unifies a variety of models researchers use in practice. In this framework, which nests \"brand name\" designs like Difference-in-Differences as special cases, we use models to predict untreated outcomes and then correct the treated group's predictions using the comparison group's observed prediction errors. Our point identifying assumption is that treated and comparison groups would have equal prediction errors (in expectation) under no treatment. To choose among candidate models, we propose a data-driven procedure based on models' robustness to violations of this point identifying assumption. Our selection procedure averages over candidate models, weighting by each model's posterior probability of being the most robust given its differential average prediction errors in the pre-period. This approach offers a way out of debates over the \"correct\" model by choosing on robustness instead and has the desirable property of being feasible in the \"locked box\" of pre-intervention data only. We apply our methodology to the gun policy debate, focusing specifically on Missouri's 2007 repeal of its permit-to-purchase law, and provide an R package (apm) for implementation.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"19 3","pages":"1826-1846"},"PeriodicalIF":1.4,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12633725/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145589860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SURROGATE SELECTION OVERSAMPLES EXPANDED T CELL CLONOTYPES. 选择扩增的t细胞克隆型。
IF 1.4 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2025-09-01 Epub Date: 2025-08-28 DOI: 10.1214/25-aoas2032
Peng Yu, Yumin Lian, Elliot Xie, Cindy L Zuleger, Richard J Albertini, Mark R Albertini, Michael A Newton

Surrogate selection is an experimental design that without sequencing any DNA can restrict a sample of cells to those carrying certain genomic mutations. In immunological disease studies, this design may provide a relatively easy approach to enrich a lymphocyte sample with cells relevant to the disease response because the emergence of neutral mutations associates with the proliferation history of clonal subpopulations. A statistical analysis of clonotype sizes provides a structured, quantitative perspective on this useful property of surrogate selection. Our model specification couples within-clonotype birth-death processes with an exchangeable model across clonotypes. Beyond enrichment questions about the surrogate selection design, our framework enables a study of sampling properties of elementary sample diversity statistics; it also points to new statistics that may usefully measure the burden of somatic genomic alterations associated with clonal expansion. We examine statistical properties of immunological samples governed by the coupled model specification, and we illustrate calculations in surrogate selection studies of melanoma and in single-cell genomic studies of T cell repertoires.

替代选择是一种实验设计,不需要对任何DNA进行测序,就可以将细胞样本限制在携带某些基因组突变的细胞中。在免疫学疾病研究中,由于中性突变的出现与克隆亚群的增殖历史相关,这种设计可能提供了一种相对简单的方法来丰富与疾病反应相关的淋巴细胞样本。克隆型大小的统计分析提供了一个结构化的,定量的角度对这一有用的属性选择代孕。我们的模型规范在克隆型出生-死亡过程中与跨克隆型的可交换模型耦合。除了关于代理选择设计的丰富问题之外,我们的框架还可以研究基本样本多样性统计的抽样特性;它还指出了新的统计数据,可以有效地测量与克隆扩增相关的体细胞基因组改变的负担。我们研究了由耦合模型规范控制的免疫样本的统计特性,并说明了黑色素瘤的替代选择研究和T细胞谱的单细胞基因组研究中的计算。
{"title":"SURROGATE SELECTION OVERSAMPLES EXPANDED T CELL CLONOTYPES.","authors":"Peng Yu, Yumin Lian, Elliot Xie, Cindy L Zuleger, Richard J Albertini, Mark R Albertini, Michael A Newton","doi":"10.1214/25-aoas2032","DOIUrl":"10.1214/25-aoas2032","url":null,"abstract":"<p><p>Surrogate selection is an experimental design that without sequencing any DNA can restrict a sample of cells to those carrying certain genomic mutations. In immunological disease studies, this design may provide a relatively easy approach to enrich a lymphocyte sample with cells relevant to the disease response because the emergence of neutral mutations associates with the proliferation history of clonal subpopulations. A statistical analysis of clonotype sizes provides a structured, quantitative perspective on this useful property of surrogate selection. Our model specification couples within-clonotype birth-death processes with an exchangeable model across clonotypes. Beyond enrichment questions about the surrogate selection design, our framework enables a study of sampling properties of elementary sample diversity statistics; it also points to new statistics that may usefully measure the burden of somatic genomic alterations associated with clonal expansion. We examine statistical properties of immunological samples governed by the coupled model specification, and we illustrate calculations in surrogate selection studies of melanoma and in single-cell genomic studies of T cell repertoires.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"19 3","pages":"1884-1907"},"PeriodicalIF":1.4,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12481847/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145208467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CONTRASTIVE LINEAR REGRESSION. 对比线性回归。
IF 1.4 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2025-09-01 Epub Date: 2025-08-28 DOI: 10.1214/24-aoas1977
Boyang Zhang, Sarah Nyquist, Andrew Jones, Barbara E Engelhardt, Didong Li

Contrastive dimension reduction methods have been developed for case-control study data to identify variation that is enriched in the foreground (case) data X relative to the background (control) data Y . Here we develop contrastive regression for the setting where there is a response variable r associated with each foreground observation. This situation occurs frequently when, for example, the unaffected controls do not have a disease grade or intervention dosage, but the affected cases have a disease grade or intervention dosage, as in autism severity, solid tumors stages, polyp sizes, or warfarin dosages. Our contrastive regression model captures shared low-dimensional variation between the predictors in the case and control groups and then explains the case-specific response variables through the variance that remains in the predictors after shared variation is removed. We show that, in one single-cell RNA sequencing dataset on cellular differentiation in chronic rhinosinusitis with and without nasal polyps and in another single-nucleus RNA sequencing dataset on autism severity in postmortem brain samples from donors with and without autism, our contrastive linear regression performs feature ranking and identifies biologically-informative predictors associated with response that cannot be identified using other approaches.

针对病例对照研究数据,已经开发了对比降维方法,以识别前景(病例)数据X相对于背景(对照)数据Y中丰富的变化。在这里,我们开发了对比回归的设置,其中有一个响应变量r与每个前景观测相关联。这种情况经常发生,例如,未受影响的对照组没有疾病等级或干预剂量,但受影响的病例有疾病等级或干预剂量,如自闭症严重程度、实体瘤分期、息肉大小或华法林剂量。我们的对比回归模型捕获了病例组和对照组中预测因子之间共有的低维变异,然后通过去除共有变异后预测因子中保留的变异来解释特定病例的响应变量。我们的研究表明,在一个单细胞RNA测序数据集中,慢性鼻窦炎伴鼻息肉和不伴鼻息肉的细胞分化,以及在另一个单细胞RNA测序数据集中,来自有或没有自闭症的捐赠者的死后脑样本中自闭症严重程度,我们的对比线性回归进行了特征排序,并确定了与反应相关的生物学信息预测因子,这些预测因子无法用其他方法识别。
{"title":"CONTRASTIVE LINEAR REGRESSION.","authors":"Boyang Zhang, Sarah Nyquist, Andrew Jones, Barbara E Engelhardt, Didong Li","doi":"10.1214/24-aoas1977","DOIUrl":"10.1214/24-aoas1977","url":null,"abstract":"<p><p>Contrastive dimension reduction methods have been developed for case-control study data to identify variation that is enriched in the foreground (case) data <math><mi>X</mi></math> relative to the background (control) data <math><mi>Y</mi></math> . Here we develop contrastive regression for the setting where there is a response variable <math><mi>r</mi></math> associated with each foreground observation. This situation occurs frequently when, for example, the unaffected controls do not have a disease grade or intervention dosage, but the affected cases have a disease grade or intervention dosage, as in autism severity, solid tumors stages, polyp sizes, or warfarin dosages. Our contrastive regression model captures shared low-dimensional variation between the predictors in the case and control groups and then explains the case-specific response variables through the variance that remains in the predictors after shared variation is removed. We show that, in one single-cell RNA sequencing dataset on cellular differentiation in chronic rhinosinusitis with and without nasal polyps and in another single-nucleus RNA sequencing dataset on autism severity in postmortem brain samples from donors with and without autism, our contrastive linear regression performs feature ranking and identifies biologically-informative predictors associated with response that cannot be identified using other approaches.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"19 3","pages":"1868-1883"},"PeriodicalIF":1.4,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12692120/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145745602","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
NETWORK-BASED MODELING OF EMOTIONAL EXPRESSIONS FOR MULTIPLE CANCERS VIA A LINGUISTIC ANALYSIS OF AN ONLINE HEALTH COMMUNITY. 通过对在线健康社区的语言分析,对多种癌症的情感表达进行基于网络的建模。
IF 1.4 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2025-09-01 Epub Date: 2025-08-28 DOI: 10.1214/25-aoas2047
Xinyan Fan, Mengque Liu, Shuangge Ma

The diagnosis and treatment of cancer can evoke a variety of adverse emotions. Online health communities (OHCs) provide a safe platform for cancer patients and those closely related to express emotions without fear of judgement or stigma. In the literature, linguistic analysis of OHCs is usually limited to a single disease and based on methods with various technical limitations. In this article we analyze posts from September 2003 to September 2022 on eight cancers that are publicly available at the American Cancer Society's Cancer Survivors Network (CSN). We propose a novel network analysis technique based on low-rank matrices. The proposed approach decomposes the emotional expression semantic networks into an across-cancer time-independent component (which describes the "baseline" that is shared by multiple cancers), a cancer-specific time-independent component (which describes cancer-specific properties), and an across-cancer time-dependent component (which accommodates temporal effects on multiple cancer communities). For the second and third components, respectively, we consider a novel clustering structure and a change point structure. A penalization approach is developed, and its theoretical and computational properties are carefully established. The analysis of the CSN data leads to sensible networks and deeper insights into emotions for cancer overall and specific cancer types.

癌症的诊断和治疗可引起各种不良情绪。在线卫生社区(OHCs)为癌症患者和那些与表达情绪密切相关的人提供了一个安全的平台,而不必担心被评判或污名化。在文献中,OHCs的语言分析通常仅限于单一疾病,并且基于具有各种技术限制的方法。在这篇文章中,我们分析了从2003年9月到2022年9月在美国癌症协会癌症幸存者网络(CSN)上公开的八种癌症的帖子。提出了一种基于低秩矩阵的网络分析方法。提出的方法将情感表达语义网络分解为跨癌症时间独立组件(描述多种癌症共享的“基线”),癌症特异性时间独立组件(描述癌症特异性属性)和跨癌症时间依赖组件(适应对多种癌症社区的时间影响)。对于第二部分和第三部分,我们分别考虑了一种新的聚类结构和变化点结构。提出了一种惩罚方法,并详细建立了其理论和计算性质。通过对CSN数据的分析,我们可以建立合理的网络,并更深入地了解癌症整体和特定癌症类型的情绪。
{"title":"NETWORK-BASED MODELING OF EMOTIONAL EXPRESSIONS FOR MULTIPLE CANCERS VIA A LINGUISTIC ANALYSIS OF AN ONLINE HEALTH COMMUNITY.","authors":"Xinyan Fan, Mengque Liu, Shuangge Ma","doi":"10.1214/25-aoas2047","DOIUrl":"10.1214/25-aoas2047","url":null,"abstract":"<p><p>The diagnosis and treatment of cancer can evoke a variety of adverse emotions. Online health communities (OHCs) provide a safe platform for cancer patients and those closely related to express emotions without fear of judgement or stigma. In the literature, linguistic analysis of OHCs is usually limited to a single disease and based on methods with various technical limitations. In this article we analyze posts from September 2003 to September 2022 on eight cancers that are publicly available at the American Cancer Society's Cancer Survivors Network (CSN). We propose a novel network analysis technique based on low-rank matrices. The proposed approach decomposes the emotional expression semantic networks into an across-cancer time-independent component (which describes the \"baseline\" that is shared by multiple cancers), a cancer-specific time-independent component (which describes cancer-specific properties), and an across-cancer time-dependent component (which accommodates temporal effects on multiple cancer communities). For the second and third components, respectively, we consider a novel clustering structure and a change point structure. A penalization approach is developed, and its theoretical and computational properties are carefully established. The analysis of the CSN data leads to sensible networks and deeper insights into emotions for cancer overall and specific cancer types.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"19 3","pages":"2218-2236"},"PeriodicalIF":1.4,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12525517/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145309914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A DEEP NEURAL NETWORK TWO-PART MODEL AND FEATURE IMPORTANCE TEST FOR SEMI-CONTINUOUS DATA. 半连续数据的深度神经网络两部分模型及特征重要性检验。
IF 1.3 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2025-06-01 Epub Date: 2025-05-28 DOI: 10.1214/25-aoas2013
Baiming Zou, Xinlei Mi, Shiyu Wan, Di Wu, James G Xenakis, Jianhua Hu, Fei Zou

Semi-continuous data frequently arise in clinical practice. For example, while many surgical patients still suffer from varying degrees of acute postoperative pain (POP) sometime after surgery (i.e., POP score > 0), others experience none (i.e., POP score = 0), indicating the existence of two distinct data processes at play. Existing parametric or semi-parametric two-part modeling methods for this type of semi-continuous data can fail to appropriately model the two underlying data processes as such methods rely heavily on (generalized) linear additive assumptions. However, many factors may interact to jointly influence the experience of POP non-additively and non-linearly. Motivated by this challenge and inspired by the flexibility of deep neural networks (DNN) to accurately approximate complex functions universally, we derive a DNN-based two-part model by adapting the conventional DNN methods with two additional components: a bootstrapping procedure along with a filtering algorithm to boost the stability of the conventional DNN, an approach we denote as sDNN. To improve the interpretability and transparency of sDNN, we further derive a feature importance testing procedure to identify important features associated with the outcome measurements of the two data processes, denoting this approach fsDNN. We show that fsDNN not only offers a statistical inference procedure for each feature under complex association but also that using the identified features can further improve the predictive performance of sDNN. The proposed sDNN- and fsDNN-based two-part models are applied to the analysis of real data from a POP study, in which application they clearly demonstrate advantages over the existing parametric and semi-parametric two-part models. Further, we conduct extensive numerical studies and draw comparisons with other machine learning methods to demonstrate that sDNN and fsDNN consistently outperform the existing two-part models and frequently used machine learning methods regardless of the data complexity. An R package implementing the proposed methods has been developed and is available in the Supplementary Material (Zou et al, 2025) and is also deposited on GitHub (https://github.com/BZou-lab/fsDNN).

临床实践中经常出现半连续数据。例如,虽然许多手术患者在手术后一段时间仍然遭受不同程度的急性术后疼痛(POP)(即POP评分> 0),但其他人则没有(即POP评分= 0),这表明存在两种不同的数据过程在起作用。对于这类半连续数据,现有的参数或半参数两部分建模方法可能无法适当地对两个潜在的数据过程进行建模,因为这些方法严重依赖于(广义的)线性可加性假设。然而,许多因素可能相互作用,共同影响POP体验的非加性和非线性。受到这一挑战的激励,并受到深度神经网络(DNN)精确近似复杂函数的灵活性的启发,我们通过将传统的DNN方法与两个额外组件相适应,推导出基于DNN的两部分模型:一个自举过程和一个滤波算法,以提高传统DNN的稳定性,我们将这种方法称为sDNN。为了提高sDNN的可解释性和透明度,我们进一步推导了一个特征重要性测试程序,以识别与两个数据处理的结果测量相关的重要特征,将该方法称为fsDNN。研究表明,fsDNN不仅为复杂关联下的每个特征提供了统计推理过程,而且利用识别出的特征可以进一步提高sDNN的预测性能。提出的基于sdn和fsdn的两部分模型应用于POP研究的实际数据分析,在应用中,它们明显优于现有的参数和半参数两部分模型。此外,我们进行了广泛的数值研究,并与其他机器学习方法进行了比较,以证明无论数据复杂性如何,sDNN和fsDNN始终优于现有的两部分模型和常用的机器学习方法。已经开发了实现所提出方法的R包,可在补充材料(Zou et al, 2025)中获得,也存放在GitHub (https://github.com/BZou-lab/fsDNN)上。
{"title":"A DEEP NEURAL NETWORK TWO-PART MODEL AND FEATURE IMPORTANCE TEST FOR SEMI-CONTINUOUS DATA.","authors":"Baiming Zou, Xinlei Mi, Shiyu Wan, Di Wu, James G Xenakis, Jianhua Hu, Fei Zou","doi":"10.1214/25-aoas2013","DOIUrl":"10.1214/25-aoas2013","url":null,"abstract":"<p><p>Semi-continuous data frequently arise in clinical practice. For example, while many surgical patients still suffer from varying degrees of acute postoperative pain (POP) sometime after surgery (i.e., POP score > 0), others experience none (i.e., POP score = 0), indicating the existence of two distinct data processes at play. Existing parametric or semi-parametric two-part modeling methods for this type of semi-continuous data can fail to appropriately model the two underlying data processes as such methods rely heavily on (generalized) linear additive assumptions. However, many factors may interact to jointly influence the experience of POP non-additively and non-linearly. Motivated by this challenge and inspired by the flexibility of deep neural networks (DNN) to accurately approximate complex functions universally, we derive a DNN-based two-part model by adapting the conventional DNN methods with two additional components: a bootstrapping procedure along with a filtering algorithm to boost the stability of the conventional DNN, an approach we denote as sDNN. To improve the interpretability and transparency of sDNN, we further derive a feature importance testing procedure to identify important features associated with the outcome measurements of the two data processes, denoting this approach fsDNN. We show that fsDNN not only offers a statistical inference procedure for each feature under complex association but also that using the identified features can further improve the predictive performance of sDNN. The proposed sDNN- and fsDNN-based two-part models are applied to the analysis of real data from a POP study, in which application they clearly demonstrate advantages over the existing parametric and semi-parametric two-part models. Further, we conduct extensive numerical studies and draw comparisons with other machine learning methods to demonstrate that sDNN and fsDNN consistently outperform the existing two-part models and frequently used machine learning methods regardless of the data complexity. An R package implementing the proposed methods has been developed and is available in the Supplementary Material (Zou et al, 2025) and is also deposited on GitHub (https://github.com/BZou-lab/fsDNN).</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"19 2","pages":"1314-1331"},"PeriodicalIF":1.3,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12263096/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144644080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Annals of Applied Statistics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1