Pub Date : 2025-12-29DOI: 10.1007/s12561-025-09510-8
Mingfei Dong, Donatello Telesca, Abigail Dickinson, Catherine Sugar, Sara J Webb, Shafali Jeste, April R Levin, Frederick Shic, Adam Naples, Susan Faja, Geraldine Dawson, James C McPartland, Damla Şentürk
Event-related spectral perturbations (ERSPs) capture dynamic changes in electroencephalography (EEG) power across frequency and trial time. Even though they are obtained at the trial level, they are commonly averaged across trials and analyzed at the subject level for enhancing the signal-to-noise ratio. While evoked activity is stimulus-locked, representing the brain's predictable response to stimuli, induced signals that are not strictly locked to stimulus presentation are thought to be generated by higher-order processes, such as attention and integration. Motivated by joint modeling of multilevel (trials nested in subjects) and multivariate (evoked and induced) ERSP data from a visual-evoked potentials (VEP) task, we propose a multilevel multivariate functional principal components analysis (FPCA) for high-dimensional functional outcomes as a function of time and frequency. The proposed estimation procedure utilizes multilevel univariate FPCA decompositions along each variate of the multivariate outcome using fast covariance estimation and incorporates the dependency across outcome variates at each level of the data. Hence, the proposed approach for multilevel multivariate FPCA can efficiently scale up to higher dimensional functional outcomes and increasing number of variates in the multivariate functional outcome vector. Extensive simulations show the efficacy of the proposed approach, while applications to VEP data lead to new insights on autism-specific neural activity patterns. The autistic group shows significantly lower evoked and higher induced gamma power compared to the neurotypical group. In addition, while subject level variation is dominated by variation in the stimulus-locked evoked signal in neurotypical development, it is dominated by induced power in autism.
{"title":"Multilevel Multivariate Functional Principal Component Analysis of Evoked and Induced Event-Related Spectral Perturbations.","authors":"Mingfei Dong, Donatello Telesca, Abigail Dickinson, Catherine Sugar, Sara J Webb, Shafali Jeste, April R Levin, Frederick Shic, Adam Naples, Susan Faja, Geraldine Dawson, James C McPartland, Damla Şentürk","doi":"10.1007/s12561-025-09510-8","DOIUrl":"10.1007/s12561-025-09510-8","url":null,"abstract":"<p><p>Event-related spectral perturbations (ERSPs) capture dynamic changes in electroencephalography (EEG) power across frequency and trial time. Even though they are obtained at the trial level, they are commonly averaged across trials and analyzed at the subject level for enhancing the signal-to-noise ratio. While evoked activity is stimulus-locked, representing the brain's predictable response to stimuli, induced signals that are not strictly locked to stimulus presentation are thought to be generated by higher-order processes, such as attention and integration. Motivated by joint modeling of multilevel (trials nested in subjects) and multivariate (evoked and induced) ERSP data from a visual-evoked potentials (VEP) task, we propose a multilevel multivariate functional principal components analysis (FPCA) for high-dimensional functional outcomes as a function of time and frequency. The proposed estimation procedure utilizes multilevel univariate FPCA decompositions along each variate of the multivariate outcome using fast covariance estimation and incorporates the dependency across outcome variates at each level of the data. Hence, the proposed approach for multilevel multivariate FPCA can efficiently scale up to higher dimensional functional outcomes and increasing number of variates in the multivariate functional outcome vector. Extensive simulations show the efficacy of the proposed approach, while applications to VEP data lead to new insights on autism-specific neural activity patterns. The autistic group shows significantly lower evoked and higher induced gamma power compared to the neurotypical group. In addition, while subject level variation is dominated by variation in the stimulus-locked evoked signal in neurotypical development, it is dominated by induced power in autism.</p>","PeriodicalId":45094,"journal":{"name":"Statistics in Biosciences","volume":" ","pages":""},"PeriodicalIF":0.4,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12834560/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146067612","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-18DOI: 10.1007/s12561-025-09505-5
Kehao Zhu, Yingye Zheng, Kwun Chuen Gary Chan
As advancements in novel biomarker-based algorithms and models accelerate their use in disease risk prediction, it is crucial to evaluate these models within the context of their intended clinical application. Prediction models output the absolute risk of disease; subsequently, patient counseling and shared decision-making are based on the estimated individual risk and cost-benefit assessment. The overall impact of the application is referred to as clinical utility, which received significant attention and desire to incorporate into model assessment lately. The classic Brier score is a popular measure of prediction accuracy; however, it is insufficient for effectively assessing clinical utility. To address this limitation, we propose a class of weighted Brier scores that aligns with the decision-theoretic framework of clinical utility. Additionally, we decompose the weighted Brier score into discrimination and calibration components, and we link the weighted Brier score to the measure, which has been proposed as an alternative to the area under the receiver operating characteristic curve. This theoretical link to the measure further supports our weighting method and underscores the essential elements of discrimination and calibration in risk prediction evaluation. The practical use of the weighted Brier score as an overall summary is demonstrated using data from a prostate cancer study.
{"title":"Weighted Brier Score - an Overall Summary Measure for Risk Prediction Models with Clinical Utility Consideration.","authors":"Kehao Zhu, Yingye Zheng, Kwun Chuen Gary Chan","doi":"10.1007/s12561-025-09505-5","DOIUrl":"10.1007/s12561-025-09505-5","url":null,"abstract":"<p><p>As advancements in novel biomarker-based algorithms and models accelerate their use in disease risk prediction, it is crucial to evaluate these models within the context of their intended clinical application. Prediction models output the absolute risk of disease; subsequently, patient counseling and shared decision-making are based on the estimated individual risk and cost-benefit assessment. The overall impact of the application is referred to as clinical utility, which received significant attention and desire to incorporate into model assessment lately. The classic Brier score is a popular measure of prediction accuracy; however, it is insufficient for effectively assessing clinical utility. To address this limitation, we propose a class of weighted Brier scores that aligns with the decision-theoretic framework of clinical utility. Additionally, we decompose the weighted Brier score into discrimination and calibration components, and we link the weighted Brier score to the <math><mi>H</mi></math> measure, which has been proposed as an alternative to the area under the receiver operating characteristic curve. This theoretical link to the <math><mi>H</mi></math> measure further supports our weighting method and underscores the essential elements of discrimination and calibration in risk prediction evaluation. The practical use of the weighted Brier score as an overall summary is demonstrated using data from a prostate cancer study.</p>","PeriodicalId":45094,"journal":{"name":"Statistics in Biosciences","volume":" ","pages":""},"PeriodicalIF":0.4,"publicationDate":"2025-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12523994/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145309467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-07DOI: 10.1007/s12561-025-09499-0
Xinran Huang, Xinyang Jiang, Ruosha Li, Jing Ning
The discriminative performance of biomarkers often changes over time and exhibits heterogeneity across subgroups defined by patient characteristics. Assessing how this performance varies with these factors is crucial for a comprehensive evaluation of biomarkers and to identify areas for improvement in sub-populations with poor performance. Additionally, the presence of competing risks complicates the assessment of discriminative performance. Ignoring competing risks can lead to misleading conclusions, as the biomarker's performance for the event of interest, such as disease onset, may be confounded by its performance for competing events, such as death. To address these challenges, we develop a regression model to assess the impact of covariates on the discriminative performance of biomarkers, characterized by the covariate-specific time-dependent Area-undercurve (AUC) for a specific cause. We construct a pseudo partial-likelihood for estimation and inference and establish the asymptotic properties of the proposed estimators. Through simulation studies, we demonstrate the finite sample performance of these estimators, and we apply the proposed method to data from the African American Study of Kidney Disease and Hypertension (AASK).
{"title":"Accounting for Competing Risks in the Assessment of Prognostic Biomarkers' Discriminative Accuracy.","authors":"Xinran Huang, Xinyang Jiang, Ruosha Li, Jing Ning","doi":"10.1007/s12561-025-09499-0","DOIUrl":"https://doi.org/10.1007/s12561-025-09499-0","url":null,"abstract":"<p><p>The discriminative performance of biomarkers often changes over time and exhibits heterogeneity across subgroups defined by patient characteristics. Assessing how this performance varies with these factors is crucial for a comprehensive evaluation of biomarkers and to identify areas for improvement in sub-populations with poor performance. Additionally, the presence of competing risks complicates the assessment of discriminative performance. Ignoring competing risks can lead to misleading conclusions, as the biomarker's performance for the event of interest, such as disease onset, may be confounded by its performance for competing events, such as death. To address these challenges, we develop a regression model to assess the impact of covariates on the discriminative performance of biomarkers, characterized by the covariate-specific time-dependent Area-undercurve (AUC) for a specific cause. We construct a pseudo partial-likelihood for estimation and inference and establish the asymptotic properties of the proposed estimators. Through simulation studies, we demonstrate the finite sample performance of these estimators, and we apply the proposed method to data from the African American Study of Kidney Disease and Hypertension (AASK).</p>","PeriodicalId":45094,"journal":{"name":"Statistics in Biosciences","volume":" ","pages":""},"PeriodicalIF":0.4,"publicationDate":"2025-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12366773/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144973349","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-07DOI: 10.1007/s12561-025-09496-3
Nicholas Hartman, Kevin He
In health services applications where the patients are clustered within common institutions or geographic regions, it is often of interest to estimate the treatment effects of the medical providers after adjusting for confounding risk factors that are related to patients' choices of provider but beyond the providers' control. While most existing risk-adjustment methods are only capable of controlling for patient-level confounding risk factors (e.g., age or comorbidities), there are often important cluster-level confounding variables (e.g., regional or community-level risk factors) that should be accounted for in provider evaluations. These adjustments for cluster-level confounding factors are further complicated by the limited availability of protected patient health data, the inevitable influence of unobservable confounding factors, and the presence of outlying cluster units. To address these issues, we propose a privacy-preserving model and a novel Pseudo-Bayesian inference method to robustly assess the providers' treatment effects with adjustments for observed cluster-level confounders and corrections for overdispersion from unobserved cluster-level confounding factors. We derive theoretical connections between our proposed estimation method and the Correlated Random Effects model, uncovering several advantages in terms of estimation stability, computational efficiency, and privacy preservation. Motivated by efforts to improve equity in transplant care, we apply these methods to evaluate transplant centers while adjusting for observed geographic disparities in donor organ availability and correcting for overdispersion from unobservable confounding factors, such as the complex impact of the COVID-19 pandemic.
{"title":"Robust Privacy-Preserving Models for Cluster-Level Confounding: Recognizing Disparities in Access to Transplantation.","authors":"Nicholas Hartman, Kevin He","doi":"10.1007/s12561-025-09496-3","DOIUrl":"10.1007/s12561-025-09496-3","url":null,"abstract":"<p><p>In health services applications where the patients are clustered within common institutions or geographic regions, it is often of interest to estimate the treatment effects of the medical providers after adjusting for confounding risk factors that are related to patients' choices of provider but beyond the providers' control. While most existing risk-adjustment methods are only capable of controlling for patient-level confounding risk factors (e.g., age or comorbidities), there are often important cluster-level confounding variables (e.g., regional or community-level risk factors) that should be accounted for in provider evaluations. These adjustments for cluster-level confounding factors are further complicated by the limited availability of protected patient health data, the inevitable influence of unobservable confounding factors, and the presence of outlying cluster units. To address these issues, we propose a privacy-preserving model and a novel Pseudo-Bayesian inference method to robustly assess the providers' treatment effects with adjustments for observed cluster-level confounders and corrections for overdispersion from unobserved cluster-level confounding factors. We derive theoretical connections between our proposed estimation method and the Correlated Random Effects model, uncovering several advantages in terms of estimation stability, computational efficiency, and privacy preservation. Motivated by efforts to improve equity in transplant care, we apply these methods to evaluate transplant centers while adjusting for observed geographic disparities in donor organ availability and correcting for overdispersion from unobservable confounding factors, such as the complex impact of the COVID-19 pandemic.</p>","PeriodicalId":45094,"journal":{"name":"Statistics in Biosciences","volume":" ","pages":""},"PeriodicalIF":0.4,"publicationDate":"2025-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12830051/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146054195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Longitudinally observed functional data are commonly encountered in biomedical studies. Under the weak separability assumption of the high dimensional covariance, the recently proposed Bayesian longitudinal functional principal component analysis (B-LFPCA) achieves the decomposition of the multidimensional signal into highly interpretable lower dimensional summaries, including eigenfunctions that capture directions of variation in the data along the longitudinal and functional dimensions. B-LFPCA provides uncertainty quantification of the estimated functional decomposition components through simultaneous parametric credible bands formed using the posterior sample. However, these traditional summaries are inherently based on point-wise summaries of the estimated functional components and do not take into account the functional nature of the estimated quantities. We introduce central posterior envelopes (CPEs) for uncertainty quantification of the low-dimensional B-LFPCA decomposition components based on functional depth ordering of the posterior estimates. The proposed CPEs are fully data-driven visualization tools, displaying the most-central regions of the posterior sample at specified -level percentile contours. Modified band depth and modified volume depth are utilized to order posterior sample of functional decomposition components, including the mean function and the marginal longitudinal and functional eigenfunctions. The proposed CPEs are applied to analyze the longitudinally observed Event Related Potentials (ERPs) recorded during an implicit learning paradigm, leading to novel insights on longitudinal learning trends across a group of autistic kids and their neurotypical peers. Finally, effectiveness of the proposed CPEs is demonstrated through extensive simulations that explore different scenarios of increased variability in the longitudinal functional data.
{"title":"Central Posterior Envelopes for Bayesian Longitudinal Functional Principal Component Analysis.","authors":"Joanna Boland, Qi Qian, Donatello Telesca, Shafali Jeste, Abigail Dickinson, Damla Şentürk","doi":"10.1007/s12561-025-09497-2","DOIUrl":"10.1007/s12561-025-09497-2","url":null,"abstract":"<p><p>Longitudinally observed functional data are commonly encountered in biomedical studies. Under the weak separability assumption of the high dimensional covariance, the recently proposed Bayesian longitudinal functional principal component analysis (B-LFPCA) achieves the decomposition of the multidimensional signal into highly interpretable lower dimensional summaries, including eigenfunctions that capture directions of variation in the data along the longitudinal and functional dimensions. B-LFPCA provides uncertainty quantification of the estimated functional decomposition components through simultaneous parametric credible bands formed using the posterior sample. However, these traditional summaries are inherently based on point-wise summaries of the estimated functional components and do not take into account the functional nature of the estimated quantities. We introduce central posterior envelopes (CPEs) for uncertainty quantification of the low-dimensional B-LFPCA decomposition components based on functional depth ordering of the posterior estimates. The proposed CPEs are fully data-driven visualization tools, displaying the most-central regions of the posterior sample at specified <math><mi>α</mi></math> -level percentile contours. Modified band depth and modified volume depth are utilized to order posterior sample of functional decomposition components, including the mean function and the marginal longitudinal and functional eigenfunctions. The proposed CPEs are applied to analyze the longitudinally observed Event Related Potentials (ERPs) recorded during an implicit learning paradigm, leading to novel insights on longitudinal learning trends across a group of autistic kids and their neurotypical peers. Finally, effectiveness of the proposed CPEs is demonstrated through extensive simulations that explore different scenarios of increased variability in the longitudinal functional data.</p>","PeriodicalId":45094,"journal":{"name":"Statistics in Biosciences","volume":" ","pages":""},"PeriodicalIF":0.4,"publicationDate":"2025-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12716410/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145805508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-12DOI: 10.1007/s12561-025-09493-6
Panpan Zhang, Sharon X Xie
In this paper, we compare the performance of available-case analysis (ACA) and several multiple imputation (MI) approaches for handling missing data problems in longitudinal analysis through estimation bias and relative efficiency. When the missingness of covariates depends on observed responses, ACA produces estimation bias, but it is preferred when there are only missing values in longitudinal responses. Multilevel MI methods are not always a solution to longitudinal data analysis. Single-level MI methods, like fully conditional specification (FCS), provide unbiased estimates under a variety of missing data scenarios, and improve efficiency gain in certain scenarios. The general assumption of missing data mechanism is missing at random (MAR). We carry out a systematic synthetic data analysis where missing data exist in longitudinal outcomes or/and covariates under different kinds of missing data generation procedures. The analysis model is a linear mixed-effects model. For each of the missing data scenarios, we give our recommendation (between ACA and a specific MI method) based on theoretical justifications and extensive simulations. In addition, a longitudinal neurodegenerative disease dataset is used as a real case study.
{"title":"Bias and Efficiency Comparison between Multiple Imputation and Available-Case Analysis for Missing Data in Longitudinal Models.","authors":"Panpan Zhang, Sharon X Xie","doi":"10.1007/s12561-025-09493-6","DOIUrl":"10.1007/s12561-025-09493-6","url":null,"abstract":"<p><p>In this paper, we compare the performance of available-case analysis (ACA) and several multiple imputation (MI) approaches for handling missing data problems in longitudinal analysis through estimation bias and relative efficiency. When the missingness of covariates depends on observed responses, ACA produces estimation bias, but it is preferred when there are only missing values in longitudinal responses. Multilevel MI methods are not always a solution to longitudinal data analysis. Single-level MI methods, like fully conditional specification (FCS), provide unbiased estimates under a variety of missing data scenarios, and improve efficiency gain in certain scenarios. The general assumption of missing data mechanism is missing at random (MAR). We carry out a systematic synthetic data analysis where missing data exist in longitudinal outcomes or/and covariates under different kinds of missing data generation procedures. The analysis model is a linear mixed-effects model. For each of the missing data scenarios, we give our recommendation (between ACA and a specific MI method) based on theoretical justifications and extensive simulations. In addition, a longitudinal neurodegenerative disease dataset is used as a real case study.</p>","PeriodicalId":45094,"journal":{"name":"Statistics in Biosciences","volume":" ","pages":""},"PeriodicalIF":0.4,"publicationDate":"2025-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12356228/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144875909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-01Epub Date: 2023-10-28DOI: 10.1007/s12561-023-09394-6
Kan Chen, Qishuo Yin, Qi Long
Estimating treatment effects is of great importance for many biomedical applications with observational data. Particularly, interpretability of the treatment effects is preferable for many biomedical researchers. In this paper, we first provide a theoretical analysis and derive an upper bound for the bias of average treatment effect (ATE) estimation under the strong ignorability assumption. Derived by leveraging appealing properties of the weighted energy distance, our upper bound is tighter than what has been reported in the literature. Motivated by the theoretical analysis, we propose a novel objective function for estimating the ATE that uses the energy distance balancing score and hence does not require the correct specification of the propensity score model. We also leverage recently developed neural additive models to improve interpretability of deep learning models used for potential outcome prediction. We further enhance our proposed model with an energy distance balancing score weighted regularization. The superiority of our proposed model over current state-of-the-art methods is demonstrated in semi-synthetic experiments using two benchmark datasets, namely, IHDP and ACIC, as well as is examined through the study of the effect of smoking on the blood level of cadmium using NHANES.
{"title":"Covariate-Balancing-Aware Interpretable Deep Learning Models for Treatment Effect Estimation.","authors":"Kan Chen, Qishuo Yin, Qi Long","doi":"10.1007/s12561-023-09394-6","DOIUrl":"10.1007/s12561-023-09394-6","url":null,"abstract":"<p><p>Estimating treatment effects is of great importance for many biomedical applications with observational data. Particularly, interpretability of the treatment effects is preferable for many biomedical researchers. In this paper, we first provide a theoretical analysis and derive an upper bound for the bias of average treatment effect (ATE) estimation under the strong ignorability assumption. Derived by leveraging appealing properties of the weighted energy distance, our upper bound is tighter than what has been reported in the literature. Motivated by the theoretical analysis, we propose a novel objective function for estimating the ATE that uses the energy distance balancing score and hence does not require the correct specification of the propensity score model. We also leverage recently developed neural additive models to improve interpretability of deep learning models used for potential outcome prediction. We further enhance our proposed model with an energy distance balancing score weighted regularization. The superiority of our proposed model over current state-of-the-art methods is demonstrated in semi-synthetic experiments using two benchmark datasets, namely, IHDP and ACIC, as well as is examined through the study of the effect of smoking on the blood level of cadmium using NHANES.</p>","PeriodicalId":45094,"journal":{"name":"Statistics in Biosciences","volume":"17 1","pages":"132-150"},"PeriodicalIF":0.8,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11957463/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143765096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-01Epub Date: 2024-06-14DOI: 10.1007/s12561-024-09434-9
Jing Zhai, Youngwon Choi, Xingyi Yang, Yin Chen, Kenneth Knox, Homer L Twigg, Joong-Ho Won, Hua Zhou, Jin J Zhou
Evidence linking the microbiome to human health is rapidly growing. The microbiome profile has the potential as a novel predictive biomarker for many diseases. However, tables of bacterial counts are typically sparse, and bacteria are classified within a hierarchy of taxonomic levels, ranging from species to phylum. Existing tools focus on identifying microbiome associations at either the community level or a specific, pre-defined taxonomic level. Incorporating the evolutionary relationship between bacteria can enhance data interpretation. This approach allows for aggregating microbiome contributions, leading to more accurate and interpretable results. We present DeepBiome, a phylogeny-informed neural network architecture, to predict phenotypes from microbiome counts and uncover the microbiome-phenotype association network. It utilizes microbiome abundance as input and employs phylogenetic taxonomy to guide the neural network's architecture. Leveraging phylogenetic information, DeepBiome is applicable to both regression and reduces the need for extensive tuning of the deep learning architecture, minimizes overfitting, and, crucially, enables the visualization of the path from microbiome counts to disease. It classification problems. Simulation studies and real-life data analysis have shown that DeepBiome is both highly accurate and efficient. It offers deep insights into complex microbiome-phenotype associations, even with small to moderate training sample sizes. In practice, the specific taxonomic level at which microbiome clusters tag the association remains unknown. Therefore, the main advantage of the presented method over other analytical methods is that it offers an ecological and evolutionary understanding of host-microbe interactions, which is important for microbiome-based medicine. DeepBiome is implemented using Python packages Keras and TensorFlow. It is an open-source tool available at https://github.com/Young-won/DeepBiome.
{"title":"DeepBiome: A Phylogenetic Tree Informed Deep Neural Network for Microbiome Data Analysis.","authors":"Jing Zhai, Youngwon Choi, Xingyi Yang, Yin Chen, Kenneth Knox, Homer L Twigg, Joong-Ho Won, Hua Zhou, Jin J Zhou","doi":"10.1007/s12561-024-09434-9","DOIUrl":"10.1007/s12561-024-09434-9","url":null,"abstract":"<p><p>Evidence linking the microbiome to human health is rapidly growing. The microbiome profile has the potential as a novel predictive biomarker for many diseases. However, tables of bacterial counts are typically sparse, and bacteria are classified within a hierarchy of taxonomic levels, ranging from species to phylum. Existing tools focus on identifying microbiome associations at either the community level or a specific, pre-defined taxonomic level. Incorporating the evolutionary relationship between bacteria can enhance data interpretation. This approach allows for aggregating microbiome contributions, leading to more accurate and interpretable results. We present DeepBiome, a phylogeny-informed neural network architecture, to predict phenotypes from microbiome counts and uncover the microbiome-phenotype association network. It utilizes microbiome abundance as input and employs phylogenetic taxonomy to guide the neural network's architecture. Leveraging phylogenetic information, DeepBiome is applicable to both regression and reduces the need for extensive tuning of the deep learning architecture, minimizes overfitting, and, crucially, enables the visualization of the path from microbiome counts to disease. It classification problems. Simulation studies and real-life data analysis have shown that DeepBiome is both highly accurate and efficient. It offers deep insights into complex microbiome-phenotype associations, even with small to moderate training sample sizes. In practice, the specific taxonomic level at which microbiome clusters tag the association remains unknown. Therefore, the main advantage of the presented method over other analytical methods is that it offers an ecological and evolutionary understanding of host-microbe interactions, which is important for microbiome-based medicine. DeepBiome is implemented using Python packages Keras and TensorFlow. It is an open-source tool available at https://github.com/Young-won/DeepBiome.</p>","PeriodicalId":45094,"journal":{"name":"Statistics in Biosciences","volume":"17 1","pages":"191-215"},"PeriodicalIF":0.4,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12395559/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144973306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-05DOI: 10.1007/s12561-025-09476-7
Jeremy Rubin, Fan Fan, Laura Barisoni, Andrew R Janowczyk, Jarcy Zee
Image features that characterize tubules from digitized kidney biopsies may offer insight into disease prognosis as novel biomarkers. For each subject, we can construct a matrix whose entries are a common set of image features (e.g., area, orientation, eccentricity) that are measured for each tubule from that subject's biopsy. Previous scalar-on-matrix regression approaches which can predict scalar outcomes using image feature matrices cannot handle varying numbers of tubules across subjects. We propose the CLUstering Structured laSSO (CLUSSO), a novel scalar-on-matrix regression technique that allows for unbalanced numbers of tubules, to predict scalar outcomes from the image feature matrices. Through classifying tubules into one of two different clusters, CLUSSO averages and weights tubular feature values within-subject and within-cluster to create balanced feature matrices that can then be used with structured lasso regression. We develop the theoretical large tubule sample properties for the error bounds of the feature coefficient estimates. Simulation study results indicate that CLUSSO often achieves a lower false positive rate and higher true positive rate for identifying the image features which truly affect outcomes relative to a naive method that averages feature values across all tubules. Additionally, we find that CLUSSO has lower bias and can predict outcomes with a competitive accuracy to the naïve approach. Finally, we applied CLUSSO to tubular image features from kidney biopsies of glomerular disease subjects from the Nephrotic Syndrome Study Network (NEPTUNE) to predict kidney function and used subjects from the Cure Glomerulonephropathy (CureGN) study as an external validation set.
{"title":"Novel Scalar-on-matrix Regression for Unbalanced Feature Matrices.","authors":"Jeremy Rubin, Fan Fan, Laura Barisoni, Andrew R Janowczyk, Jarcy Zee","doi":"10.1007/s12561-025-09476-7","DOIUrl":"10.1007/s12561-025-09476-7","url":null,"abstract":"<p><p>Image features that characterize tubules from digitized kidney biopsies may offer insight into disease prognosis as novel biomarkers. For each subject, we can construct a matrix whose entries are a common set of image features (e.g., area, orientation, eccentricity) that are measured for each tubule from that subject's biopsy. Previous scalar-on-matrix regression approaches which can predict scalar outcomes using image feature matrices cannot handle varying numbers of tubules across subjects. We propose the CLUstering Structured laSSO (CLUSSO), a novel scalar-on-matrix regression technique that allows for unbalanced numbers of tubules, to predict scalar outcomes from the image feature matrices. Through classifying tubules into one of two different clusters, CLUSSO averages and weights tubular feature values within-subject and within-cluster to create balanced feature matrices that can then be used with structured lasso regression. We develop the theoretical large tubule sample properties for the error bounds of the feature coefficient estimates. Simulation study results indicate that CLUSSO often achieves a lower false positive rate and higher true positive rate for identifying the image features which truly affect outcomes relative to a naive method that averages feature values across all tubules. Additionally, we find that CLUSSO has lower bias and can predict outcomes with a competitive accuracy to the naïve approach. Finally, we applied CLUSSO to tubular image features from kidney biopsies of glomerular disease subjects from the Nephrotic Syndrome Study Network (NEPTUNE) to predict kidney function and used subjects from the Cure Glomerulonephropathy (CureGN) study as an external validation set.</p>","PeriodicalId":45094,"journal":{"name":"Statistics in Biosciences","volume":" ","pages":""},"PeriodicalIF":0.4,"publicationDate":"2025-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12456458/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145138874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-14DOI: 10.1007/s12561-025-09473-w
Paula R Langner, Elizabeth Juarez-Colunga, Lucas N Marzec, Gary K Grunwald, John D Rice
In studies with a recurrent event outcome, events may be captured as counts during subsequent intervals or follow-up times either by design or for ease of analysis. In many cases, recurrent events may be further coarsened such that only an indicator of one or more events in an interval is observed at the follow-up time, resulting in a loss of information relative to a record of all events. In this paper, we examine efficiency loss when coarsening longitudinally observed counts to binary indicators and aspects of the design which impact the ability to estimate a treatment effect of interest. The investigation was motivated by a study of patients with cardiac implantable electronic devices in which investigators aimed to examine the effect of a treatment on events detected by the devices over time. In order to study components of such a recurrent event process impacted by data coarsening, we derive the asymptotic relative efficiency (ARE) of a treatment effect estimator utilizing a coarsened binary outcome relative to an alternative estimator using the count outcome. We compare the efficiencies and consider conditions where the binary process maintains good efficiency in estimating a treatment effect. We present an application of the methods to a data set consisting of seizure counts in a sample of patients with epilepsy.
{"title":"Efficiency loss with binary pre-processing of continuous monitoring data.","authors":"Paula R Langner, Elizabeth Juarez-Colunga, Lucas N Marzec, Gary K Grunwald, John D Rice","doi":"10.1007/s12561-025-09473-w","DOIUrl":"10.1007/s12561-025-09473-w","url":null,"abstract":"<p><p>In studies with a recurrent event outcome, events may be captured as counts during subsequent intervals or follow-up times either by design or for ease of analysis. In many cases, recurrent events may be further coarsened such that only an indicator of one or more events in an interval is observed at the follow-up time, resulting in a loss of information relative to a record of all events. In this paper, we examine efficiency loss when coarsening longitudinally observed counts to binary indicators and aspects of the design which impact the ability to estimate a treatment effect of interest. The investigation was motivated by a study of patients with cardiac implantable electronic devices in which investigators aimed to examine the effect of a treatment on events detected by the devices over time. In order to study components of such a recurrent event process impacted by data coarsening, we derive the asymptotic relative efficiency (ARE) of a treatment effect estimator utilizing a coarsened binary outcome relative to an alternative estimator using the count outcome. We compare the efficiencies and consider conditions where the binary process maintains good efficiency in estimating a treatment effect. We present an application of the methods to a data set consisting of seizure counts in a sample of patients with epilepsy.</p>","PeriodicalId":45094,"journal":{"name":"Statistics in Biosciences","volume":" ","pages":""},"PeriodicalIF":0.8,"publicationDate":"2025-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12266715/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144660627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}