Pub Date : 2025-06-01Epub Date: 2025-05-28DOI: 10.1214/24-aoas1955
Di Wang, Wen Ye, Randall Sung, Hui Jiang, Jeremy M G Taylor, Lisa Ly, Kevin He
Prediction of time-to-event data often suffers from rare event rates, small sample sizes, high dimensionality, and low signal-to-noise ratios. Incorporating published prediction models from external large-scale studies is expected to improve the performance of prognosis prediction from internal individual-level data. However, existing integration approaches typically assume that the underlying distributions of the external and internal data sources are similar, which is often invalid. To account for challenges, including heterogeneity, data sharing, and privacy constraints, we propose a failure time integration procedure, which utilizes a discrete hazard-based Kullback-Leibler discriminatory information measuring the discrepancy between the external models and the internal dataset. The asymptotic properties and simulation results show the advantage of the proposed method compared to those solely based on internal data. We apply the proposed method to improve prediction performance on a kidney transplant dataset from a local hospital by integrating this small-sized dataset with a published survival model obtained from the national transplant registry.
{"title":"KULLBACK-LEIBLER-BASED DISCRETE FAILURE TIME MODELS FOR INTEGRATION OF PUBLISHED PREDICTION MODELS WITH NEW TIME-TO-EVENT DATASET.","authors":"Di Wang, Wen Ye, Randall Sung, Hui Jiang, Jeremy M G Taylor, Lisa Ly, Kevin He","doi":"10.1214/24-aoas1955","DOIUrl":"10.1214/24-aoas1955","url":null,"abstract":"<p><p>Prediction of time-to-event data often suffers from rare event rates, small sample sizes, high dimensionality, and low signal-to-noise ratios. Incorporating published prediction models from external large-scale studies is expected to improve the performance of prognosis prediction from internal individual-level data. However, existing integration approaches typically assume that the underlying distributions of the external and internal data sources are similar, which is often invalid. To account for challenges, including heterogeneity, data sharing, and privacy constraints, we propose a failure time integration procedure, which utilizes a discrete hazard-based Kullback-Leibler discriminatory information measuring the discrepancy between the external models and the internal dataset. The asymptotic properties and simulation results show the advantage of the proposed method compared to those solely based on internal data. We apply the proposed method to improve prediction performance on a kidney transplant dataset from a local hospital by integrating this small-sized dataset with a published survival model obtained from the national transplant registry.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"19 2","pages":"1167-1189"},"PeriodicalIF":1.4,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12797872/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145971497","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-01Epub Date: 2025-05-28DOI: 10.1214/25-aoas2013
Baiming Zou, Xinlei Mi, Shiyu Wan, Di Wu, James G Xenakis, Jianhua Hu, Fei Zou
Semi-continuous data frequently arise in clinical practice. For example, while many surgical patients still suffer from varying degrees of acute postoperative pain (POP) sometime after surgery (i.e., POP score > 0), others experience none (i.e., POP score = 0), indicating the existence of two distinct data processes at play. Existing parametric or semi-parametric two-part modeling methods for this type of semi-continuous data can fail to appropriately model the two underlying data processes as such methods rely heavily on (generalized) linear additive assumptions. However, many factors may interact to jointly influence the experience of POP non-additively and non-linearly. Motivated by this challenge and inspired by the flexibility of deep neural networks (DNN) to accurately approximate complex functions universally, we derive a DNN-based two-part model by adapting the conventional DNN methods with two additional components: a bootstrapping procedure along with a filtering algorithm to boost the stability of the conventional DNN, an approach we denote as sDNN. To improve the interpretability and transparency of sDNN, we further derive a feature importance testing procedure to identify important features associated with the outcome measurements of the two data processes, denoting this approach fsDNN. We show that fsDNN not only offers a statistical inference procedure for each feature under complex association but also that using the identified features can further improve the predictive performance of sDNN. The proposed sDNN- and fsDNN-based two-part models are applied to the analysis of real data from a POP study, in which application they clearly demonstrate advantages over the existing parametric and semi-parametric two-part models. Further, we conduct extensive numerical studies and draw comparisons with other machine learning methods to demonstrate that sDNN and fsDNN consistently outperform the existing two-part models and frequently used machine learning methods regardless of the data complexity. An R package implementing the proposed methods has been developed and is available in the Supplementary Material (Zou et al, 2025) and is also deposited on GitHub (https://github.com/BZou-lab/fsDNN).
临床实践中经常出现半连续数据。例如,虽然许多手术患者在手术后一段时间仍然遭受不同程度的急性术后疼痛(POP)(即POP评分> 0),但其他人则没有(即POP评分= 0),这表明存在两种不同的数据过程在起作用。对于这类半连续数据,现有的参数或半参数两部分建模方法可能无法适当地对两个潜在的数据过程进行建模,因为这些方法严重依赖于(广义的)线性可加性假设。然而,许多因素可能相互作用,共同影响POP体验的非加性和非线性。受到这一挑战的激励,并受到深度神经网络(DNN)精确近似复杂函数的灵活性的启发,我们通过将传统的DNN方法与两个额外组件相适应,推导出基于DNN的两部分模型:一个自举过程和一个滤波算法,以提高传统DNN的稳定性,我们将这种方法称为sDNN。为了提高sDNN的可解释性和透明度,我们进一步推导了一个特征重要性测试程序,以识别与两个数据处理的结果测量相关的重要特征,将该方法称为fsDNN。研究表明,fsDNN不仅为复杂关联下的每个特征提供了统计推理过程,而且利用识别出的特征可以进一步提高sDNN的预测性能。提出的基于sdn和fsdn的两部分模型应用于POP研究的实际数据分析,在应用中,它们明显优于现有的参数和半参数两部分模型。此外,我们进行了广泛的数值研究,并与其他机器学习方法进行了比较,以证明无论数据复杂性如何,sDNN和fsDNN始终优于现有的两部分模型和常用的机器学习方法。已经开发了实现所提出方法的R包,可在补充材料(Zou et al, 2025)中获得,也存放在GitHub (https://github.com/BZou-lab/fsDNN)上。
{"title":"A DEEP NEURAL NETWORK TWO-PART MODEL AND FEATURE IMPORTANCE TEST FOR SEMI-CONTINUOUS DATA.","authors":"Baiming Zou, Xinlei Mi, Shiyu Wan, Di Wu, James G Xenakis, Jianhua Hu, Fei Zou","doi":"10.1214/25-aoas2013","DOIUrl":"10.1214/25-aoas2013","url":null,"abstract":"<p><p>Semi-continuous data frequently arise in clinical practice. For example, while many surgical patients still suffer from varying degrees of acute postoperative pain (POP) sometime after surgery (i.e., POP score > 0), others experience none (i.e., POP score = 0), indicating the existence of two distinct data processes at play. Existing parametric or semi-parametric two-part modeling methods for this type of semi-continuous data can fail to appropriately model the two underlying data processes as such methods rely heavily on (generalized) linear additive assumptions. However, many factors may interact to jointly influence the experience of POP non-additively and non-linearly. Motivated by this challenge and inspired by the flexibility of deep neural networks (DNN) to accurately approximate complex functions universally, we derive a DNN-based two-part model by adapting the conventional DNN methods with two additional components: a bootstrapping procedure along with a filtering algorithm to boost the stability of the conventional DNN, an approach we denote as sDNN. To improve the interpretability and transparency of sDNN, we further derive a feature importance testing procedure to identify important features associated with the outcome measurements of the two data processes, denoting this approach fsDNN. We show that fsDNN not only offers a statistical inference procedure for each feature under complex association but also that using the identified features can further improve the predictive performance of sDNN. The proposed sDNN- and fsDNN-based two-part models are applied to the analysis of real data from a POP study, in which application they clearly demonstrate advantages over the existing parametric and semi-parametric two-part models. Further, we conduct extensive numerical studies and draw comparisons with other machine learning methods to demonstrate that sDNN and fsDNN consistently outperform the existing two-part models and frequently used machine learning methods regardless of the data complexity. An R package implementing the proposed methods has been developed and is available in the Supplementary Material (Zou et al, 2025) and is also deposited on GitHub (https://github.com/BZou-lab/fsDNN).</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"19 2","pages":"1314-1331"},"PeriodicalIF":1.3,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12263096/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144644080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-01Epub Date: 2025-05-28DOI: 10.1214/24-aoas2007
Xin Liu, Patrick M Schnell
Electronic medical records (EMR) data contain rich information that can facilitate health-related studies but is collected primarily for purposes other than research. For recurrent events, EMR data often do not record event times or counts but only contain intermittently assessed and censored observations (i.e. upper and/or lower bounds for counts in a time interval) at uncontrolled times. This can result in non-contiguous or overlapping assessment intervals with censored event counts. Existing methods for analyzing intermittently assessed recurrent events assume disjoint assessment intervals with known counts (interval count data) due to a focus on prospective studies with controlled assessment times. We propose a Bayesian data augmentation method to analyze the complicated assessments in EMR data for recurrent events. Within a Gibbs sampler, event times are imputed by generating sets of event times from non-homogeneous Poisson processes and rejecting proposed sets that are incompatible with constraints imposed by assessment data. Based on the independent increments property of Poisson processes, we implement three techniques to speed up this rejection sampling imputation method for large EMR datasets: independent sampling by partitioning, truncated generation, and sequential sampling. In a simulation study we show our method accurately estimates parameters of log-linear Poisson process intensities. Although the proposed method can be applied generally to EMR data of recurrent events, our study is specifically motivated by identifying risk factors for falls due to cancer treatment and its supportive medications. We used the proposed method to analyze an EMR dataset comprising 5501 patients treated for breast cancer. Our analysis provides evidence supporting associations between certain risk factors (including classes of medications) and risk of falls.
{"title":"BAYESIAN DATA AUGMENTATION FOR RECURRENT EVENTS UNDER INTERMITTENT ASSESSMENT IN OVERLAPPING INTERVALS WITH APPLICATIONS TO EMR DATA.","authors":"Xin Liu, Patrick M Schnell","doi":"10.1214/24-aoas2007","DOIUrl":"10.1214/24-aoas2007","url":null,"abstract":"<p><p>Electronic medical records (EMR) data contain rich information that can facilitate health-related studies but is collected primarily for purposes other than research. For recurrent events, EMR data often do not record event times or counts but only contain intermittently assessed and censored observations (i.e. upper and/or lower bounds for counts in a time interval) at uncontrolled times. This can result in non-contiguous or overlapping assessment intervals with censored event counts. Existing methods for analyzing intermittently assessed recurrent events assume disjoint assessment intervals with known counts (interval count data) due to a focus on prospective studies with controlled assessment times. We propose a Bayesian data augmentation method to analyze the complicated assessments in EMR data for recurrent events. Within a Gibbs sampler, event times are imputed by generating sets of event times from non-homogeneous Poisson processes and rejecting proposed sets that are incompatible with constraints imposed by assessment data. Based on the independent increments property of Poisson processes, we implement three techniques to speed up this rejection sampling imputation method for large EMR datasets: independent sampling by partitioning, truncated generation, and sequential sampling. In a simulation study we show our method accurately estimates parameters of log-linear Poisson process intensities. Although the proposed method can be applied generally to EMR data of recurrent events, our study is specifically motivated by identifying risk factors for falls due to cancer treatment and its supportive medications. We used the proposed method to analyze an EMR dataset comprising 5501 patients treated for breast cancer. Our analysis provides evidence supporting associations between certain risk factors (including classes of medications) and risk of falls.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"19 2","pages":"1332-1361"},"PeriodicalIF":1.4,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12393837/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144976823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-01Epub Date: 2025-03-17DOI: 10.1214/24-aoas1970
Haotian Zou, Luo Xiao, Donglin Zeng, Sheng Luo
Alzheimer's Disease (AD) is a common neurodegenerative disorder impairing multiple domains. Recent AD studies, for example, the Alzheimer's Disease Neuroimaging Initiative (ADNI) study, collect multimodal data to better understand AD severity and progression. To facilitate precision medicine for high-risk individuals, it is essential to develop an AD predictive model that leverages multimodal data and provides accurate personalized predictions of dementia occurrences. In this article we propose a multivariate functional mixed model with longitudinal magnetic resonance imaging data (MFMM-LMRI) that jointly models longitudinal neurological scores, longitudinal voxelwise MRI data, and the survival outcome as dementia onset. We model longitudinal MRI data using the joint and individual variation explained (JIVE) approach. We investigate two functional forms linking the longitudinal and survival processes. We adopt the Markov chain Monte Carlo (MCMC) method to obtain posterior samples. We establish a dynamic prediction framework that predicts longitudinal trajectories and the probability of dementia occurrence. The simulation study with various sample sizes and event rates supports the validity of the method. We apply the MFMM-LMRI to the motivating ADNI study and conclude that additional ApoE-ϵ4 alleles and a higher latent disease profile are associated with a higher risk of dementia onset. We detect a significant association between the longitudinal MRI data and the survival outcome. The instantaneous model with longitudinal MRI data has the best fitting and predictive performance.
{"title":"DYNAMIC PREDICTION WITH MULTIVARIATE LONGITUDINAL OUTCOMES AND LONGITUDINAL MAGNETIC RESONANCE IMAGING DATA.","authors":"Haotian Zou, Luo Xiao, Donglin Zeng, Sheng Luo","doi":"10.1214/24-aoas1970","DOIUrl":"10.1214/24-aoas1970","url":null,"abstract":"<p><p>Alzheimer's Disease (AD) is a common neurodegenerative disorder impairing multiple domains. Recent AD studies, for example, the Alzheimer's Disease Neuroimaging Initiative (ADNI) study, collect multimodal data to better understand AD severity and progression. To facilitate precision medicine for high-risk individuals, it is essential to develop an AD predictive model that leverages multimodal data and provides accurate personalized predictions of dementia occurrences. In this article we propose a multivariate functional mixed model with longitudinal magnetic resonance imaging data (MFMM-LMRI) that jointly models longitudinal neurological scores, longitudinal voxelwise MRI data, and the survival outcome as dementia onset. We model longitudinal MRI data using the joint and individual variation explained (JIVE) approach. We investigate two functional forms linking the longitudinal and survival processes. We adopt the Markov chain Monte Carlo (MCMC) method to obtain posterior samples. We establish a dynamic prediction framework that predicts longitudinal trajectories and the probability of dementia occurrence. The simulation study with various sample sizes and event rates supports the validity of the method. We apply the MFMM-LMRI to the motivating ADNI study and conclude that additional ApoE-<i>ϵ</i>4 alleles and a higher latent disease profile are associated with a higher risk of dementia onset. We detect a significant association between the longitudinal MRI data and the survival outcome. The instantaneous model with longitudinal MRI data has the best fitting and predictive performance.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"19 1","pages":"505-528"},"PeriodicalIF":1.3,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12206078/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144530914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-01Epub Date: 2025-03-17DOI: 10.1214/24-aoas1988
Glenn Palmer, Amy H Herring, David B Dunson
Developmental epidemiology commonly focuses on assessing the association between multiple early life exposures and childhood health. Statistical analyses of data from such studies focus on inferring the contributions of individual exposures, while also characterizing time-varying and interacting effects. Such inferences are made more challenging by correlations among exposures, nonlinearity, and the curse of dimensionality. Motivated by studying the effects of prenatal bisphenol A (BPA) and phthalate exposures on glucose metabolism in adolescence using data from the ELEMENT study, we propose a low-rank longitudinal factor regression (LowFR) model for tractable inference on flexible longitudinal exposure effects. LowFR handles highly-correlated exposures using a Bayesian dynamic factor model, which is fit jointly with a health outcome via a novel factor regression approach. The model collapses on simpler and intuitive submodels when appropriate, while expanding to allow considerable flexibility in time-varying and interaction effects when supported by the data. After demonstrating LowFR's effectiveness in simulations, we use it to analyze the ELEMENT data and find that diethyl and dibutyl phthalate metabolite levels in trimesters 1 and 2 are associated with altered glucose metabolism in adolescence.
{"title":"LOW-RANK LONGITUDINAL FACTOR REGRESSION WITH APPLICATION TO CHEMICAL MIXTURES.","authors":"Glenn Palmer, Amy H Herring, David B Dunson","doi":"10.1214/24-aoas1988","DOIUrl":"https://doi.org/10.1214/24-aoas1988","url":null,"abstract":"<p><p>Developmental epidemiology commonly focuses on assessing the association between multiple early life exposures and childhood health. Statistical analyses of data from such studies focus on inferring the contributions of individual exposures, while also characterizing time-varying and interacting effects. Such inferences are made more challenging by correlations among exposures, nonlinearity, and the curse of dimensionality. Motivated by studying the effects of prenatal bisphenol A (BPA) and phthalate exposures on glucose metabolism in adolescence using data from the ELEMENT study, we propose a low-rank longitudinal factor regression (LowFR) model for tractable inference on flexible longitudinal exposure effects. LowFR handles highly-correlated exposures using a Bayesian dynamic factor model, which is fit jointly with a health outcome via a novel factor regression approach. The model collapses on simpler and intuitive submodels when appropriate, while expanding to allow considerable flexibility in time-varying and interaction effects when supported by the data. After demonstrating LowFR's effectiveness in simulations, we use it to analyze the ELEMENT data and find that diethyl and dibutyl phthalate metabolite levels in trimesters 1 and 2 are associated with altered glucose metabolism in adolescence.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"19 1","pages":"769-797"},"PeriodicalIF":1.3,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12013532/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144057647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-01Epub Date: 2025-03-17DOI: 10.1214/24-aoas1948
Shounak Chattopadhyay, Stephanie M Engel, David Dunson
There is abundant interest in assessing the joint effects of multiple exposures on human health. This is often referred to as the mixtures problem in environmental epidemiology and toxicology. Classically, studies have examined the adverse health effects of different chemicals one at a time, but there is concern that certain chemicals may act together to amplify each other's effects. Such amplification is referred to as synergistic interaction, while chemicals that inhibit each other's effects have antagonistic interactions. Current approaches for assessing the health effects of chemical mixtures do not explicitly consider synergy or antagonism in the modeling, instead focusing on either parametric or unconstrained nonparametric dose response surface modeling. The parametric case can be too inflexible, while nonparametric methods face a curse of dimensionality that leads to overly wiggly and uninterpretable surface estimates. We propose a Bayesian approach that decomposes the response surface into additive main effects and pairwise interaction effects and then detects synergistic and antagonistic interactions. Variable selection decisions for each interaction component are also provided. This Synergistic Antagonistic Interaction Detection (SAID) framework is evaluated relative to existing approaches using simulation experiments and an application to data from NHANES.
{"title":"INFERRING SYNERGISTIC AND ANTAGONISTIC INTERACTIONS IN MIXTURES OF EXPOSURES.","authors":"Shounak Chattopadhyay, Stephanie M Engel, David Dunson","doi":"10.1214/24-aoas1948","DOIUrl":"10.1214/24-aoas1948","url":null,"abstract":"<p><p>There is abundant interest in assessing the joint effects of multiple exposures on human health. This is often referred to as the mixtures problem in environmental epidemiology and toxicology. Classically, studies have examined the adverse health effects of different chemicals one at a time, but there is concern that certain chemicals may act together to amplify each other's effects. Such amplification is referred to as <i>synergistic</i> interaction, while chemicals that inhibit each other's effects have <i>antagonistic</i> interactions. Current approaches for assessing the health effects of chemical mixtures do not explicitly consider synergy or antagonism in the modeling, instead focusing on either parametric or unconstrained nonparametric dose response surface modeling. The parametric case can be too inflexible, while nonparametric methods face a curse of dimensionality that leads to overly wiggly and uninterpretable surface estimates. We propose a Bayesian approach that decomposes the response surface into additive main effects and pairwise interaction effects and then detects synergistic and antagonistic interactions. Variable selection decisions for each interaction component are also provided. This Synergistic Antagonistic Interaction Detection (SAID) framework is evaluated relative to existing approaches using simulation experiments and an application to data from NHANES.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"19 1","pages":"169-190"},"PeriodicalIF":1.4,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12393835/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144976792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-01Epub Date: 2025-03-17DOI: 10.1214/24-aoas1913
Falco J Bargagli-Stoffi, Costanza Tortú, Laura Forastiere
The bulk of causal inference studies rule out the presence of interference between units. However, in many real-world scenarios, units are interconnected by social, physical, or virtual ties, and the effect of the treatment can spill from one unit to other connected individuals in the network. In this paper, we develop a machine learning method that uses tree-based algorithms and a Horvitz-Thompson estimator to assess the heterogeneity of treatment and spillover effects with respect to individual, neighborhood, and network characteristics in the context of clustered networks and interference within clusters. The proposed network causal tree (NCT) algorithm has several advantages. First, it allows the investigation of the heterogeneity of the treatment effect, avoiding potential bias due to the presence of interference. Second, understanding the heterogeneity of both treatment and spillover effects can guide policymakers in scaling up interventions, designing targeting strategies, and increasing cost-effectiveness. We investigate the performance of our NCT method using a Monte Carlo simulation study and illustrate its application to assess the heterogeneous effects of information sessions on the uptake of a new weather insurance policy in rural China.
{"title":"HETEROGENEOUS TREATMENT AND SPILLOVER EFFECTS UNDER CLUSTERED NETWORK INTERFERENCE.","authors":"Falco J Bargagli-Stoffi, Costanza Tortú, Laura Forastiere","doi":"10.1214/24-aoas1913","DOIUrl":"10.1214/24-aoas1913","url":null,"abstract":"<p><p>The bulk of causal inference studies rule out the presence of interference between units. However, in many real-world scenarios, units are interconnected by social, physical, or virtual ties, and the effect of the treatment can spill from one unit to other connected individuals in the network. In this paper, we develop a machine learning method that uses tree-based algorithms and a Horvitz-Thompson estimator to assess the heterogeneity of treatment and spillover effects with respect to individual, neighborhood, and network characteristics in the context of clustered networks and interference within clusters. The proposed network causal tree (NCT) algorithm has several advantages. First, it allows the investigation of the heterogeneity of the treatment effect, avoiding potential bias due to the presence of interference. Second, understanding the heterogeneity of both treatment and spillover effects can guide policymakers in scaling up interventions, designing targeting strategies, and increasing cost-effectiveness. We investigate the performance of our NCT method using a Monte Carlo simulation study and illustrate its application to assess the heterogeneous effects of information sessions on the uptake of a new weather insurance policy in rural China.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"19 1","pages":"28-55"},"PeriodicalIF":1.3,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12245184/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144610248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-01Epub Date: 2024-10-31DOI: 10.1214/24-aoas1904
Nathan B Wikle, Corwin M Zigler
Causal inference with spatial environmental data is often challenging due to the presence of interference: outcomes for observational units depend on some combination of local and nonlocal treatment. This is especially relevant when estimating the effect of power plant emissions controls on population health, as pollution exposure is dictated by: (i) the location of point-source emissions as well as (ii) the transport of pollutants across space via dynamic physical-chemical processes. In this work we estimate the effectiveness of air quality interventions at coal-fired power plants in reducing two adverse health outcomes in Texas in 2016: pediatric asthma ED visits and Medicare all-cause mortality. We develop methods for causal inference with interference when the underlying network structure is not known with certainty and instead must be estimated from ancillary data. Notably, uncertainty in the interference structure is propagated to the resulting causal effect estimates. We offer a Bayesian, spatial mechanistic model for the interference mapping, which we combine with a flexible nonparametric outcome model to marginalize estimates of causal effects over uncertainty in the structure of interference. our analysis finds some evidence that emissions controls at upwind power plants reduce asthma ED visits and all-cause mortality; however, accounting for uncertainty in the interference renders the results largely inconclusive.
{"title":"CAUSAL HEALTH IMPACTS OF POWER PLANT EMISSION CONTROLS UNDER MODELED AND UNCERTAIN PHYSICAL PROCESS INTERFERENCE.","authors":"Nathan B Wikle, Corwin M Zigler","doi":"10.1214/24-aoas1904","DOIUrl":"10.1214/24-aoas1904","url":null,"abstract":"<p><p>Causal inference with spatial environmental data is often challenging due to the presence of interference: outcomes for observational units depend on some combination of local and nonlocal treatment. This is especially relevant when estimating the effect of power plant emissions controls on population health, as pollution exposure is dictated by: (i) the location of point-source emissions as well as (ii) the transport of pollutants across space via dynamic physical-chemical processes. In this work we estimate the effectiveness of air quality interventions at coal-fired power plants in reducing two adverse health outcomes in Texas in 2016: pediatric asthma ED visits and Medicare all-cause mortality. We develop methods for causal inference with interference when the underlying network structure is not known with certainty and instead must be estimated from ancillary data. Notably, uncertainty in the interference structure is propagated to the resulting causal effect estimates. We offer a Bayesian, spatial mechanistic model for the interference mapping, which we combine with a flexible nonparametric outcome model to marginalize estimates of causal effects over uncertainty in the structure of interference. our analysis finds some evidence that emissions controls at upwind power plants reduce asthma ED visits and all-cause mortality; however, accounting for uncertainty in the interference renders the results largely inconclusive.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"18 4","pages":"2753-2774"},"PeriodicalIF":1.3,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11619076/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142787678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-01Epub Date: 2024-10-31DOI: 10.1214/24-aoas1935
Nicholas Rios, Lingzhou Xue, Xiang Zhan
It is quite common to encounter compositional data in a regression framework in data analysis. When both responses and predictors are compositional, most existing models rely on a family of log-ratio based transformations to move the analysis from the simplex to the reals. This often makes the interpretation of the model more complex. A transformation-free regression model was recently developed, but it only allows for a single compositional predictor. However, many datasets include multiple compositional predictors of interest. Motivated by an application to hydrothermal liquefaction (HTL) data, a novel extension of this transformation-free regression model is provided that allows for two (or more) compositional predictors to be used via a latent variable mixture. A modified expectation-maximization algorithm is proposed to estimate model parameters, which are shown to have natural interpretations. Conformal inference is used to obtain prediction limits on the compositional response. The resulting methodology is applied to the HTL dataset. Extensions to multiple predictors are discussed.
{"title":"A LATENT VARIABLE MIXTURE MODEL FOR COMPOSITION-ON-COMPOSITION REGRESSION WITH APPLICATION TO CHEMICAL RECYCLING.","authors":"Nicholas Rios, Lingzhou Xue, Xiang Zhan","doi":"10.1214/24-aoas1935","DOIUrl":"10.1214/24-aoas1935","url":null,"abstract":"<p><p>It is quite common to encounter compositional data in a regression framework in data analysis. When both responses and predictors are compositional, most existing models rely on a family of log-ratio based transformations to move the analysis from the simplex to the reals. This often makes the interpretation of the model more complex. A transformation-free regression model was recently developed, but it only allows for a single compositional predictor. However, many datasets include multiple compositional predictors of interest. Motivated by an application to hydrothermal liquefaction (HTL) data, a novel extension of this transformation-free regression model is provided that allows for two (or more) compositional predictors to be used via a latent variable mixture. A modified expectation-maximization algorithm is proposed to estimate model parameters, which are shown to have natural interpretations. Conformal inference is used to obtain prediction limits on the compositional response. The resulting methodology is applied to the HTL dataset. Extensions to multiple predictors are discussed.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"18 4","pages":"3253-3273"},"PeriodicalIF":1.4,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12448131/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145114836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-01Epub Date: 2024-10-31DOI: 10.1214/24-AOAS1938
Jill Hasler, Yanyuan Ma, Yizheng Wei, Ravi Parikh, Jinbo Chen
When using electronic health records (EHRs) for clinical and translational research, additional data is often available from external sources to enrich the information extracted from EHRs. For example, academic biobanks have more granular data available, and patient reported data is often collected through small-scale surveys. It is common that the external data is available only for a small subset of patients who have EHR information. We propose efficient and robust methods for building and evaluating models for predicting the risk of binary outcomes using such integrated EHR data. Our method is built upon an idea derived from the two-phase design literature that modeling the availability of a patient's external data as a function of an EHR-based preliminary predictive score leads to effective utilization of the EHR data. Through both theoretical and simulation studies, we show that our method has high efficiency for estimating log-odds ratio parameters, the area under the ROC curve, as well as other measures for quantifying predictive accuracy. We apply our method to develop a model for predicting the short-term mortality risk of oncology patients, where the data was extracted from the University of Pennsylvania hospital system EHR and combined with survey-based patient reported outcome data.
{"title":"A SEMIPARAMETRIC METHOD FOR RISK PREDICTION USING INTEGRATED ELECTRONIC HEALTH RECORD DATA.","authors":"Jill Hasler, Yanyuan Ma, Yizheng Wei, Ravi Parikh, Jinbo Chen","doi":"10.1214/24-AOAS1938","DOIUrl":"10.1214/24-AOAS1938","url":null,"abstract":"<p><p>When using electronic health records (EHRs) for clinical and translational research, additional data is often available from external sources to enrich the information extracted from EHRs. For example, academic biobanks have more granular data available, and patient reported data is often collected through small-scale surveys. It is common that the external data is available only for a small subset of patients who have EHR information. We propose efficient and robust methods for building and evaluating models for predicting the risk of binary outcomes using such integrated EHR data. Our method is built upon an idea derived from the two-phase design literature that modeling the availability of a patient's external data as a function of an EHR-based preliminary predictive score leads to effective utilization of the EHR data. Through both theoretical and simulation studies, we show that our method has high efficiency for estimating log-odds ratio parameters, the area under the ROC curve, as well as other measures for quantifying predictive accuracy. We apply our method to develop a model for predicting the short-term mortality risk of oncology patients, where the data was extracted from the University of Pennsylvania hospital system EHR and combined with survey-based patient reported outcome data.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"18 4","pages":"3318-3337"},"PeriodicalIF":1.3,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11934126/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143711932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}