Pub Date : 2026-01-20DOI: 10.1093/biostatistics/kxaf051
Bonnie B Smith, Abhirup Datta, Brian Caffo
A number of domains in biomedical research use data with a large number of predictors all representing the same type of measurement. Often, an important summary is the within-person distribution of these predictors. Here we focus on settings where the mean relationship between outcome and predictors is fully captured by this distribution and, more generally, on problems where the goal is to learn a mapping that is invariant under permutations of the input vector. We compare unstructured neural networks, which do not explicitly incorporate the permutation invariance property, versus networks that we call ordered predictors neural networks. We show in simulations that the unstructured deep learning approach can yield higher prediction errors, compared to the approach that explicitly leverages the invariance to simplify the learning task. Additionally, in the context of neural Bayes estimation, in which neural networks are used to construct point estimators, we show that ordered predictors neural networks can yield substantially more precise estimators. We therefore recommend that, when permutation invariance is known or suspected to hold, investigators use a learning or statistical modeling approach that can leverage the invariance, rather than an unstructured deep learning approach.
{"title":"Shortcomings of deep learning for distributional predictors: a note.","authors":"Bonnie B Smith, Abhirup Datta, Brian Caffo","doi":"10.1093/biostatistics/kxaf051","DOIUrl":"10.1093/biostatistics/kxaf051","url":null,"abstract":"<p><p>A number of domains in biomedical research use data with a large number of predictors all representing the same type of measurement. Often, an important summary is the within-person distribution of these predictors. Here we focus on settings where the mean relationship between outcome and predictors is fully captured by this distribution and, more generally, on problems where the goal is to learn a mapping that is invariant under permutations of the input vector. We compare unstructured neural networks, which do not explicitly incorporate the permutation invariance property, versus networks that we call ordered predictors neural networks. We show in simulations that the unstructured deep learning approach can yield higher prediction errors, compared to the approach that explicitly leverages the invariance to simplify the learning task. Additionally, in the context of neural Bayes estimation, in which neural networks are used to construct point estimators, we show that ordered predictors neural networks can yield substantially more precise estimators. We therefore recommend that, when permutation invariance is known or suspected to hold, investigators use a learning or statistical modeling approach that can leverage the invariance, rather than an unstructured deep learning approach.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":"27 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2026-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12815889/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146004720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-20DOI: 10.1093/biostatistics/kxaf049
Alvin Sheng, Brian J Reich, Ana-Maria Staicu, Santhoshi N Krishnan, Arvind Rao, Timothy L Frankel
Recent advances in multiplex imaging have enabled researchers to locate different types of cells within a tissue sample. This is especially relevant for tumor immunology, as clinical regimes corresponding to different stages of disease or responses to treatment may manifest as different spatial arrangements of tumor and immune cells. Spatial point pattern modeling can be used to partition multiplex tissue images according to these regimes. To this end, we propose a two-stage approach: first, local intensities and pair correlation functions are estimated from the spatial point pattern of cells within each image, and the pair correlation functions are reduced in dimension via spectral decomposition of the covariance function. Second, the estimates are clustered in a Bayesian hierarchical model with spatially-dependent cluster labels. The clusters correspond to regimes of interest that are present across subjects; the cluster labels segment the spatial point patterns according to those regimes. Through Markov Chain Monte Carlo sampling, we jointly estimate and quantify uncertainty in the cluster assignment and spatial characteristics of each cluster. Simulations demonstrate the performance of the method, and it is applied to a set of multiplex immunofluorescence images of diseased pancreatic tissue.
{"title":"A two-stage approach for segmenting spatial point patterns applied to multiplex imaging.","authors":"Alvin Sheng, Brian J Reich, Ana-Maria Staicu, Santhoshi N Krishnan, Arvind Rao, Timothy L Frankel","doi":"10.1093/biostatistics/kxaf049","DOIUrl":"10.1093/biostatistics/kxaf049","url":null,"abstract":"<p><p>Recent advances in multiplex imaging have enabled researchers to locate different types of cells within a tissue sample. This is especially relevant for tumor immunology, as clinical regimes corresponding to different stages of disease or responses to treatment may manifest as different spatial arrangements of tumor and immune cells. Spatial point pattern modeling can be used to partition multiplex tissue images according to these regimes. To this end, we propose a two-stage approach: first, local intensities and pair correlation functions are estimated from the spatial point pattern of cells within each image, and the pair correlation functions are reduced in dimension via spectral decomposition of the covariance function. Second, the estimates are clustered in a Bayesian hierarchical model with spatially-dependent cluster labels. The clusters correspond to regimes of interest that are present across subjects; the cluster labels segment the spatial point patterns according to those regimes. Through Markov Chain Monte Carlo sampling, we jointly estimate and quantify uncertainty in the cluster assignment and spatial characteristics of each cluster. Simulations demonstrate the performance of the method, and it is applied to a set of multiplex immunofluorescence images of diseased pancreatic tissue.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":"27 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2026-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12815891/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146004726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-20DOI: 10.1093/biostatistics/kxaf052
Jessie K Edwards, Stephen R Cole, Paul N Zivich, Benjamin Ackerman, Sonia Napravnik, Heather Henderson, Timothy Lash, Bonnie E Shook-Sa
Mortality risk estimated from studies that ascertain date of death through linkage to vital statistics registries may be subject to outcome measurement error. As a result, some deaths among study participants may not be captured, some study participants who are alive may be falsely categorized as deceased, and some deaths may be recorded at incorrect times, leading to bias in estimates of mortality risk and survival. Here, we illustrate an extension of the Rogan-Gladen estimator to account for outcome measurement error in risk and survival functions in settings with right censoring. As a motivating application, we consider and account for outcome measurement error that could be induced by incomplete and/or incorrect linkage to death registries when estimating mortality risk among people entering care for HIV in the University of North Carolina Center for AIDS Research HIV Clinical Cohort between 2001 and 2022. A series of simulation studies demonstrates that the approach performed well even when participants selected into the validation study were at higher mortality risk than the main study. The proposed approach may be parameterized using internal or external validation data or used as a form of quantitative bias analysis.
{"title":"Risk functions with outcome measurement error.","authors":"Jessie K Edwards, Stephen R Cole, Paul N Zivich, Benjamin Ackerman, Sonia Napravnik, Heather Henderson, Timothy Lash, Bonnie E Shook-Sa","doi":"10.1093/biostatistics/kxaf052","DOIUrl":"10.1093/biostatistics/kxaf052","url":null,"abstract":"<p><p>Mortality risk estimated from studies that ascertain date of death through linkage to vital statistics registries may be subject to outcome measurement error. As a result, some deaths among study participants may not be captured, some study participants who are alive may be falsely categorized as deceased, and some deaths may be recorded at incorrect times, leading to bias in estimates of mortality risk and survival. Here, we illustrate an extension of the Rogan-Gladen estimator to account for outcome measurement error in risk and survival functions in settings with right censoring. As a motivating application, we consider and account for outcome measurement error that could be induced by incomplete and/or incorrect linkage to death registries when estimating mortality risk among people entering care for HIV in the University of North Carolina Center for AIDS Research HIV Clinical Cohort between 2001 and 2022. A series of simulation studies demonstrates that the approach performed well even when participants selected into the validation study were at higher mortality risk than the main study. The proposed approach may be parameterized using internal or external validation data or used as a form of quantitative bias analysis.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":"27 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2026-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12815892/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146004702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-31DOI: 10.1093/biostatistics/kxae044
Anastasiia Holovchak, Helen McIlleron, Paolo Denti, Michael Schomaker
Missing data in multiple variables is a common issue. We investigate the applicability of the framework of graphical models for handling missing data to a complex longitudinal pharmacological study of children with HIV treated with an efavirenz-based regimen as part of the CHAPAS-3 trial. Specifically, we examine whether the causal effects of interest, defined through static interventions on multiple continuous variables, can be recovered (estimated consistently) from the available data only. So far, no general algorithms are available to decide on recoverability, and decisions have to be made on a case-by-case basis. We emphasize the sensitivity of recoverability to even the smallest changes in the graph structure, and present recoverability results for three plausible missingness-directed acyclic graphs (m-DAGs) in the CHAPAS-3 study, informed by clinical knowledge. Furthermore, we propose the concept of a "closed missingness mechanism": if missing data are generated based on this mechanism, an available case analysis is admissible for consistent estimation of any statistical or causal estimand, even if data are missing not at random. Both simulations and theoretical considerations demonstrate how, in the assumed MNAR setting of our study, a complete or available case analysis can be superior to multiple imputation, and estimation results vary depending on the assumed missingness DAG. Our analyses demonstrate an innovative application of missingness DAGs to complex longitudinal real-world data, while highlighting the sensitivity of the results with respect to the assumed causal model.
多个变量的缺失数据是一个常见问题。我们研究了处理缺失数据的图形模型框架在一项复杂的纵向药理学研究中的适用性,该研究是 CHAPAS-3 试验的一部分,研究对象是接受以依非韦伦为基础的方案治疗的 HIV 感染儿童。具体来说,我们研究了通过对多个连续变量的静态干预所确定的相关因果效应是否可以仅从现有数据中恢复(一致估计)。到目前为止,还没有可用来决定可恢复性的通用算法,必须根据具体情况做出决定。我们强调了可恢复性对图结构中最小变化的敏感性,并介绍了 CHAPAS-3 研究中三个可信的缺失指向无环图(m-DAG)的可恢复性结果,这些结果是以临床知识为基础的。此外,我们还提出了 "封闭缺失机制 "的概念:如果缺失数据是基于这种机制产生的,那么即使数据不是随机缺失,也可以通过可用的病例分析对任何统计或因果估计进行一致的估计。模拟和理论考虑都表明,在我们研究的假定 MNAR 设置中,完整或可用案例分析如何优于多重估算,估算结果因假定的缺失 DAG 而异。我们的分析展示了缺失 DAG 在复杂的纵向真实世界数据中的创新应用,同时强调了结果对假定因果模型的敏感性。
{"title":"Recoverability of causal effects under presence of missing data: a longitudinal case study.","authors":"Anastasiia Holovchak, Helen McIlleron, Paolo Denti, Michael Schomaker","doi":"10.1093/biostatistics/kxae044","DOIUrl":"10.1093/biostatistics/kxae044","url":null,"abstract":"<p><p>Missing data in multiple variables is a common issue. We investigate the applicability of the framework of graphical models for handling missing data to a complex longitudinal pharmacological study of children with HIV treated with an efavirenz-based regimen as part of the CHAPAS-3 trial. Specifically, we examine whether the causal effects of interest, defined through static interventions on multiple continuous variables, can be recovered (estimated consistently) from the available data only. So far, no general algorithms are available to decide on recoverability, and decisions have to be made on a case-by-case basis. We emphasize the sensitivity of recoverability to even the smallest changes in the graph structure, and present recoverability results for three plausible missingness-directed acyclic graphs (m-DAGs) in the CHAPAS-3 study, informed by clinical knowledge. Furthermore, we propose the concept of a \"closed missingness mechanism\": if missing data are generated based on this mechanism, an available case analysis is admissible for consistent estimation of any statistical or causal estimand, even if data are missing not at random. Both simulations and theoretical considerations demonstrate how, in the assumed MNAR setting of our study, a complete or available case analysis can be superior to multiple imputation, and estimation results vary depending on the assumed missingness DAG. Our analyses demonstrate an innovative application of missingness DAGs to complex longitudinal real-world data, while highlighting the sensitivity of the results with respect to the assumed causal model.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":" ","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7617375/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142649856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Joint modeling of longitudinal and time-to-event data, particularly through shared parameter models (SPMs), is a common approach for handling longitudinal marker data with an informative terminal event. A critical but often neglected assumption in this context is that the visiting/observation process is noninformative, depending solely on past marker values and visit times. When this assumption fails, the visiting process becomes informative, resulting potentially to biased SPM estimates. Existing methods generally rely on a conditional independence assumption, positing that the marker model, visiting process, and time-to-event model are independent given shared or correlated random effects. Moreover, they are typically built on an intensity-based visiting process using calendar time. This study introduces a unified approach for jointly modeling a normally distributed marker, the visiting process, and time-to-event data in the form of competing risks. Our model conditions on the history of observed marker values, prior visit times, the marker's random effects, and possibly a frailty term independent of the random effects. While our approach aligns with the shared-parameter framework, it does not presume conditional independence between the processes. Additionally, the visiting process can be defined on either a gap time scale, via proportional hazard models, or a calendar time scale, via proportional intensity models. Through extensive simulation studies, we assess the performance of our proposed methodology. We demonstrate that disregarding an informative visiting process can yield significantly biased marker estimates. However, misspecification of the visiting process can also lead to biased estimates. The gap time formulation exhibits greater robustness compared to the intensity-based model when the visiting process is misspecified. In general, enriching the visiting process with prior visit history enhances performance. We further apply our methodology to real longitudinal data from HIV, where visit frequency varies substantially among individuals.
{"title":"Shared parameter modeling of longitudinal data allowing for possibly informative visiting process and terminal event.","authors":"Christos Thomadakis, Loukia Meligkotsidou, Nikos Pantazis, Giota Touloumi","doi":"10.1093/biostatistics/kxae041","DOIUrl":"10.1093/biostatistics/kxae041","url":null,"abstract":"<p><p>Joint modeling of longitudinal and time-to-event data, particularly through shared parameter models (SPMs), is a common approach for handling longitudinal marker data with an informative terminal event. A critical but often neglected assumption in this context is that the visiting/observation process is noninformative, depending solely on past marker values and visit times. When this assumption fails, the visiting process becomes informative, resulting potentially to biased SPM estimates. Existing methods generally rely on a conditional independence assumption, positing that the marker model, visiting process, and time-to-event model are independent given shared or correlated random effects. Moreover, they are typically built on an intensity-based visiting process using calendar time. This study introduces a unified approach for jointly modeling a normally distributed marker, the visiting process, and time-to-event data in the form of competing risks. Our model conditions on the history of observed marker values, prior visit times, the marker's random effects, and possibly a frailty term independent of the random effects. While our approach aligns with the shared-parameter framework, it does not presume conditional independence between the processes. Additionally, the visiting process can be defined on either a gap time scale, via proportional hazard models, or a calendar time scale, via proportional intensity models. Through extensive simulation studies, we assess the performance of our proposed methodology. We demonstrate that disregarding an informative visiting process can yield significantly biased marker estimates. However, misspecification of the visiting process can also lead to biased estimates. The gap time formulation exhibits greater robustness compared to the intensity-based model when the visiting process is misspecified. In general, enriching the visiting process with prior visit history enhances performance. We further apply our methodology to real longitudinal data from HIV, where visit frequency varies substantially among individuals.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":" ","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11911807/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142513409","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Previous studies have identified attenuated pre-speech activity and speech sound suppression in individuals with Schizophrenia, with similar patterns observed in basic tasks entailing button-pressing to perceive a tone. However, it remains unclear whether these patterns are uniform across individuals or vary from person to person. Motivated by electroencephalographic (EEG) data from a Schizophrenia study, we develop a generalized functional linear mixed model (GFLMM) for repeated measurements by incorporating subject-specific functional random effects associated with multiple functional predictors. To assess the significance of these functional effects, we employ two different multivariate functional principal component analysis methods, which transform the GFLMM into a conventional generalized linear mixed model, thereby facilitating its implementation with standard software. Furthermore, we introduce a cutting-edge testing approach utilizing working responses to detect both subject-specific and predictor-specific functional random effects. Monte Carlo simulation studies demonstrate the effectiveness of our proposed testing method. Application of the proposed methods to the Schizophrenia data reveals significant subject-specific effects of human brain activity in the frontal zone (Fz) and the central zone (Cz), providing valuable insights into the potential variations among individuals, from healthy controls to those diagnosed with Schizophrenia.
{"title":"Unveiling Schizophrenia: a study with generalized functional linear mixed model via the investigation of functional random effects.","authors":"Rongxiang Rui, Wei Xiong, Jianxin Pan, Maozai Tian","doi":"10.1093/biostatistics/kxae049","DOIUrl":"10.1093/biostatistics/kxae049","url":null,"abstract":"<p><p>Previous studies have identified attenuated pre-speech activity and speech sound suppression in individuals with Schizophrenia, with similar patterns observed in basic tasks entailing button-pressing to perceive a tone. However, it remains unclear whether these patterns are uniform across individuals or vary from person to person. Motivated by electroencephalographic (EEG) data from a Schizophrenia study, we develop a generalized functional linear mixed model (GFLMM) for repeated measurements by incorporating subject-specific functional random effects associated with multiple functional predictors. To assess the significance of these functional effects, we employ two different multivariate functional principal component analysis methods, which transform the GFLMM into a conventional generalized linear mixed model, thereby facilitating its implementation with standard software. Furthermore, we introduce a cutting-edge testing approach utilizing working responses to detect both subject-specific and predictor-specific functional random effects. Monte Carlo simulation studies demonstrate the effectiveness of our proposed testing method. Application of the proposed methods to the Schizophrenia data reveals significant subject-specific effects of human brain activity in the frontal zone (Fz) and the central zone (Cz), providing valuable insights into the potential variations among individuals, from healthy controls to those diagnosed with Schizophrenia.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":"26 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142933522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-31DOI: 10.1093/biostatistics/kxae028
Rong Li, Shaodong Xu, Yang Li, Zuojian Tang, Di Feng, James Cai, Shuangge Ma
Cancer is molecularly heterogeneous, with seemingly similar patients having different molecular landscapes and accordingly different clinical behaviors. In recent studies, gene expression networks have been shown as more effective/informative for cancer heterogeneity analysis than some simpler measures. Gene interconnections can be classified as "direct" and "indirect," where the latter can be caused by shared genomic regulators (such as transcription factors, microRNAs, and other regulatory molecules) and other mechanisms. It has been suggested that incorporating the regulators of gene expressions in network analysis and focusing on the direct interconnections can lead to a deeper understanding of the more essential gene interconnections. Such analysis can be seriously challenged by the large number of parameters (jointly caused by network analysis, incorporation of regulators, and heterogeneity) and often weak signals. To effectively tackle this problem, we propose incorporating prior information contained in the published literature. A key challenge is that such prior information can be partial or even wrong. We develop a two-step procedure that can flexibly accommodate different levels of prior information quality. Simulation demonstrates the effectiveness of the proposed approach and its superiority over relevant competitors. In the analysis of a breast cancer dataset, findings different from the alternatives are made, and the identified sample subgroups have important clinical differences.
{"title":"Incorporating prior information in gene expression network-based cancer heterogeneity analysis.","authors":"Rong Li, Shaodong Xu, Yang Li, Zuojian Tang, Di Feng, James Cai, Shuangge Ma","doi":"10.1093/biostatistics/kxae028","DOIUrl":"10.1093/biostatistics/kxae028","url":null,"abstract":"<p><p>Cancer is molecularly heterogeneous, with seemingly similar patients having different molecular landscapes and accordingly different clinical behaviors. In recent studies, gene expression networks have been shown as more effective/informative for cancer heterogeneity analysis than some simpler measures. Gene interconnections can be classified as \"direct\" and \"indirect,\" where the latter can be caused by shared genomic regulators (such as transcription factors, microRNAs, and other regulatory molecules) and other mechanisms. It has been suggested that incorporating the regulators of gene expressions in network analysis and focusing on the direct interconnections can lead to a deeper understanding of the more essential gene interconnections. Such analysis can be seriously challenged by the large number of parameters (jointly caused by network analysis, incorporation of regulators, and heterogeneity) and often weak signals. To effectively tackle this problem, we propose incorporating prior information contained in the published literature. A key challenge is that such prior information can be partial or even wrong. We develop a two-step procedure that can flexibly accommodate different levels of prior information quality. Simulation demonstrates the effectiveness of the proposed approach and its superiority over relevant competitors. In the analysis of a breast cancer dataset, findings different from the alternatives are made, and the identified sample subgroups have important clinical differences.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":" ","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12550826/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141794124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-31DOI: 10.1093/biostatistics/kxae019
Thai-Son Tang, Zhihui Liu, Ali Hosni, John Kim, Olli Saarela
The goal of radiation therapy for cancer is to deliver prescribed radiation dose to the tumor while minimizing dose to the surrounding healthy tissues. To evaluate treatment plans, the dose distribution to healthy organs is commonly summarized as dose-volume histograms (DVHs). Normal tissue complication probability (NTCP) modeling has centered around making patient-level risk predictions with features extracted from the DVHs, but few have considered adapting a causal framework to evaluate the safety of alternative treatment plans. We propose causal estimands for NTCP based on deterministic and stochastic interventions, as well as propose estimators based on marginal structural models that impose bivariable monotonicity between dose, volume, and toxicity risk. The properties of these estimators are studied through simulations, and their use is illustrated in the context of radiotherapy treatment of anal canal cancer patients.
{"title":"A marginal structural model for normal tissue complication probability.","authors":"Thai-Son Tang, Zhihui Liu, Ali Hosni, John Kim, Olli Saarela","doi":"10.1093/biostatistics/kxae019","DOIUrl":"10.1093/biostatistics/kxae019","url":null,"abstract":"<p><p>The goal of radiation therapy for cancer is to deliver prescribed radiation dose to the tumor while minimizing dose to the surrounding healthy tissues. To evaluate treatment plans, the dose distribution to healthy organs is commonly summarized as dose-volume histograms (DVHs). Normal tissue complication probability (NTCP) modeling has centered around making patient-level risk predictions with features extracted from the DVHs, but few have considered adapting a causal framework to evaluate the safety of alternative treatment plans. We propose causal estimands for NTCP based on deterministic and stochastic interventions, as well as propose estimators based on marginal structural models that impose bivariable monotonicity between dose, volume, and toxicity risk. The properties of these estimators are studied through simulations, and their use is illustrated in the context of radiotherapy treatment of anal canal cancer patients.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":" ","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11823140/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141565187","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-31DOI: 10.1093/biostatistics/kxae051
Corwin Zigler, Vera Liu, Fabrizia Mealli, Laura Forastiere
Evaluating air quality interventions is confronted with the challenge of interference since interventions at a particular pollution source likely impact air quality and health at distant locations, and air quality and health at any given location are likely impacted by interventions at many sources. The structure of interference in this context is dictated by complex atmospheric processes governing how pollution emitted from a particular source is transformed and transported across space and can be cast with a bipartite structure reflecting the two distinct types of units: (i) interventional units on which treatments are applied or withheld to change pollution emissions; and (ii) outcome units on which outcomes of primary interest are measured. We propose new estimands for bipartite causal inference with interference that construe two components of treatment: a "key-associated" (or "individual") treatment and an "upwind" (or "neighborhood") treatment. Estimation is carried out using a covariate adjustment approach based on a joint propensity score. A reduced-complexity atmospheric model characterizes the structure of the interference network by modeling the movement of air parcels through time and space. The new methods are deployed to evaluate the effectiveness of installing flue-gas desulfurization scrubbers on 472 coal-burning power plants (the interventional units) in reducing Medicare hospitalizations among 21,577,552 Medicare beneficiaries residing across 25,553 ZIP codes in the United States (the outcome units).
{"title":"Bipartite interference and air pollution transport: estimating health effects of power plant interventions.","authors":"Corwin Zigler, Vera Liu, Fabrizia Mealli, Laura Forastiere","doi":"10.1093/biostatistics/kxae051","DOIUrl":"10.1093/biostatistics/kxae051","url":null,"abstract":"<p><p>Evaluating air quality interventions is confronted with the challenge of interference since interventions at a particular pollution source likely impact air quality and health at distant locations, and air quality and health at any given location are likely impacted by interventions at many sources. The structure of interference in this context is dictated by complex atmospheric processes governing how pollution emitted from a particular source is transformed and transported across space and can be cast with a bipartite structure reflecting the two distinct types of units: (i) interventional units on which treatments are applied or withheld to change pollution emissions; and (ii) outcome units on which outcomes of primary interest are measured. We propose new estimands for bipartite causal inference with interference that construe two components of treatment: a \"key-associated\" (or \"individual\") treatment and an \"upwind\" (or \"neighborhood\") treatment. Estimation is carried out using a covariate adjustment approach based on a joint propensity score. A reduced-complexity atmospheric model characterizes the structure of the interference network by modeling the movement of air parcels through time and space. The new methods are deployed to evaluate the effectiveness of installing flue-gas desulfurization scrubbers on 472 coal-burning power plants (the interventional units) in reducing Medicare hospitalizations among 21,577,552 Medicare beneficiaries residing across 25,553 ZIP codes in the United States (the outcome units).</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":"26 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11823286/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143048850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-31DOI: 10.1093/biostatistics/kxaf007
Abigail Loe, Susan Murray, Zhenke Wu
Recurrent events are common in clinical, healthcare, social, and behavioral studies, yet methods for dynamic risk prediction of these events are limited. To overcome some long-standing challenges in analyzing censored recurrent event data, a recent regression analysis framework constructs a censored longitudinal dataset consisting of times to the first recurrent event in multiple pre-specified follow-up windows of length $ tau $(XMT models). Traditional regression models struggle with nonlinear and multiway interactions, with success depending on the skill of the statistical programmer. With a staggering number of potential predictors being generated from genetic, -omic, and electronic health records sources, machine learning approaches such as the random forest regression are growing in popularity, as they can nonparametrically incorporate information from many predictors with nonlinear and multiway interactions involved in prediction. In this article, we (i) develop a random forest approach for dynamically predicting probabilities of remaining event-free during a subsequent $ tau $-duration follow-up period from a reconstructed censored longitudinal data set, (ii) modify the XMT regression approach to predict these same probabilities, subject to the limitations that traditional regression models typically have, and (iii) demonstrate how to incorporate patient-specific history of recurrent events for prediction in settings where this information may be partially missing. We show the increased ability of our random forest algorithm for predicting the probability of remaining event-free over a $ tau $-duration follow-up window when compared to our modified XMT method for prediction in settings where association between predictors and recurrent event outcomes is complex in nature. We also show the importance of incorporating past recurrent event history in prediction algorithms when event times are correlated within a subject. The proposed random forest algorithm is demonstrated using recurrent exacerbation data from the trial of Azithromycin for the Prevention of Exacerbations of Chronic Obstructive Pulmonary Disease.
复发事件在临床、医疗保健、社会和行为研究中很常见,但这些事件的动态风险预测方法有限。为了克服一些长期存在的问题,最近的回归分析框架构建了一个经过审查的纵向数据集,该数据集由多个预先指定的长度为$ tau $的后续窗口(XMT模型)中的第一个重复事件的时间组成。传统的回归模型与非线性和多方向的相互作用作斗争,其成功取决于统计程序员的技能。随着从遗传、基因组学和电子健康记录来源生成的潜在预测因子数量惊人,随机森林回归等机器学习方法越来越受欢迎,因为它们可以将来自许多预测因子的信息与预测中涉及的非线性和多向交互非参数化地结合起来。在本文中,我们(i)开发了一种随机森林方法,用于从重建的经审查的纵向数据集动态预测后续$ tau $持续时间随访期间剩余无事件的概率,(ii)修改XMT回归方法来预测这些相同的概率,但受传统回归模型通常具有的局限性的限制。(iii)演示如何在可能部分缺少这些信息的情况下,将患者特定的复发事件历史纳入预测。与改进的XMT方法相比,我们的随机森林算法在预测因子和复发事件结果之间的关联本质上是复杂的情况下,预测在$ tau $持续时间的随访窗口内剩余事件无概率的能力有所提高。我们还展示了当事件时间在主题内相关时,在预测算法中纳入过去循环事件历史的重要性。该随机森林算法使用阿奇霉素预防慢性阻塞性肺疾病加重试验的复发性加重数据进行了验证。
{"title":"Random forest for dynamic risk prediction of recurrent events: a pseudo-observation approach.","authors":"Abigail Loe, Susan Murray, Zhenke Wu","doi":"10.1093/biostatistics/kxaf007","DOIUrl":"10.1093/biostatistics/kxaf007","url":null,"abstract":"<p><p>Recurrent events are common in clinical, healthcare, social, and behavioral studies, yet methods for dynamic risk prediction of these events are limited. To overcome some long-standing challenges in analyzing censored recurrent event data, a recent regression analysis framework constructs a censored longitudinal dataset consisting of times to the first recurrent event in multiple pre-specified follow-up windows of length $ tau $(XMT models). Traditional regression models struggle with nonlinear and multiway interactions, with success depending on the skill of the statistical programmer. With a staggering number of potential predictors being generated from genetic, -omic, and electronic health records sources, machine learning approaches such as the random forest regression are growing in popularity, as they can nonparametrically incorporate information from many predictors with nonlinear and multiway interactions involved in prediction. In this article, we (i) develop a random forest approach for dynamically predicting probabilities of remaining event-free during a subsequent $ tau $-duration follow-up period from a reconstructed censored longitudinal data set, (ii) modify the XMT regression approach to predict these same probabilities, subject to the limitations that traditional regression models typically have, and (iii) demonstrate how to incorporate patient-specific history of recurrent events for prediction in settings where this information may be partially missing. We show the increased ability of our random forest algorithm for predicting the probability of remaining event-free over a $ tau $-duration follow-up window when compared to our modified XMT method for prediction in settings where association between predictors and recurrent event outcomes is complex in nature. We also show the importance of incorporating past recurrent event history in prediction algorithms when event times are correlated within a subject. The proposed random forest algorithm is demonstrated using recurrent exacerbation data from the trial of Azithromycin for the Prevention of Exacerbations of Chronic Obstructive Pulmonary Disease.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":"26 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143626883","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}