Pub Date : 2023-06-01Epub Date: 2023-05-01DOI: 10.1214/22-aoas1656
Tingting Yu, Lang Wu, Jin Qiu, Peter B Gilbert
In jointly modelling longitudinal and survival data, the longitudinal data may be complex in the sense that they may contain outliers and may be left censored. Motivated from an HIV vaccine study, we propose a robust method for joint models of longitudinal and survival data, where the outliers in longitudinal data are addressed using a multivariate t-distribution for b-outliers and using an M-estimator for e-outliers. We also propose a computationally efficient method for approximate likelihood inference. The proposed method is evaluated by simulation studies. Based on the proposed models and method, we analyze the HIV vaccine data and find a strong association between longitudinal biomarkers and the risk of HIV infection.
在对纵向数据和生存数据进行联合建模时,纵向数据可能比较复杂,因为它们可能包含异常值,也可能会被留存。受一项艾滋病疫苗研究的启发,我们提出了一种用于纵向数据和生存数据联合建模的稳健方法,其中对 b 型离群值使用多元 t 分布,对 e 型离群值使用 M 估计器来处理纵向数据中的离群值。我们还提出了一种计算效率高的近似似然推断方法。我们通过模拟研究对提出的方法进行了评估。根据提出的模型和方法,我们分析了 HIV 疫苗数据,发现纵向生物标志物与 HIV 感染风险之间存在密切联系。
{"title":"Robust joint modelling of left-censored longitudinal data and survival data with application to HIV vaccine studies.","authors":"Tingting Yu, Lang Wu, Jin Qiu, Peter B Gilbert","doi":"10.1214/22-aoas1656","DOIUrl":"10.1214/22-aoas1656","url":null,"abstract":"<p><p>In jointly modelling longitudinal and survival data, the longitudinal data may be complex in the sense that they may contain outliers and may be left censored. Motivated from an HIV vaccine study, we propose a robust method for joint models of longitudinal and survival data, where the outliers in longitudinal data are addressed using a multivariate t-distribution for b-outliers and using an M-estimator for e-outliers. We also propose a computationally efficient method for approximate likelihood inference. The proposed method is evaluated by simulation studies. Based on the proposed models and method, we analyze the HIV vaccine data and find a strong association between longitudinal biomarkers and the risk of HIV infection.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"17 2","pages":"1017-1037"},"PeriodicalIF":1.8,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10312337/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10135025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-06-01Epub Date: 2023-05-01DOI: 10.1214/22-aoas1674
Yifei Sun, Sy Han Chiou, Colin O Wu, Meghan McGarry, Chiung-Yu Huang
With the availability of massive amounts of data from electronic health records and registry databases, incorporating time-varying patient information to improve risk prediction has attracted great attention. To exploit the growing amount of predictor information over time, we develop a unified framework for landmark prediction using survival tree ensembles, where an updated prediction can be performed when new information becomes available. Compared to conventional landmark prediction with fixed landmark times, our methods allow the landmark times to be subject-specific and triggered by an intermediate clinical event. Moreover, the nonparametric approach circumvents the thorny issue of model incompatibility at different landmark times. In our framework, both the longitudinal predictors and the event time outcome are subject to right censoring, and thus existing tree-based approaches cannot be directly applied. To tackle the analytical challenges, we propose a risk-set-based ensemble procedure by averaging martingale estimating equations from individual trees. Extensive simulation studies are conducted to evaluate the performance of our methods. The methods are applied to the Cystic Fibrosis Foundation Patient Registry (CFFPR) data to perform dynamic prediction of lung disease in cystic fibrosis patients and to identify important prognosis factors.
{"title":"DYNAMIC RISK PREDICTION TRIGGERED BY INTERMEDIATE EVENTS USING SURVIVAL TREE ENSEMBLES.","authors":"Yifei Sun, Sy Han Chiou, Colin O Wu, Meghan McGarry, Chiung-Yu Huang","doi":"10.1214/22-aoas1674","DOIUrl":"10.1214/22-aoas1674","url":null,"abstract":"<p><p>With the availability of massive amounts of data from electronic health records and registry databases, incorporating time-varying patient information to improve risk prediction has attracted great attention. To exploit the growing amount of predictor information over time, we develop a unified framework for landmark prediction using survival tree ensembles, where an updated prediction can be performed when new information becomes available. Compared to conventional landmark prediction with fixed landmark times, our methods allow the landmark times to be subject-specific and triggered by an intermediate clinical event. Moreover, the nonparametric approach circumvents the thorny issue of model incompatibility at different landmark times. In our framework, both the longitudinal predictors and the event time outcome are subject to right censoring, and thus existing tree-based approaches cannot be directly applied. To tackle the analytical challenges, we propose a risk-set-based ensemble procedure by averaging martingale estimating equations from individual trees. Extensive simulation studies are conducted to evaluate the performance of our methods. The methods are applied to the Cystic Fibrosis Foundation Patient Registry (CFFPR) data to perform dynamic prediction of lung disease in cystic fibrosis patients and to identify important prognosis factors.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"17 2","pages":"1375-1397"},"PeriodicalIF":1.8,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10241448/pdf/nihms-1846847.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9974256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Preoperative opioid use has been reported to be associated with higher preoperative opioid demand, worse postoperative outcomes, and increased postoperative healthcare utilization and expenditures. Understanding the risk of preoperative opioid use helps establish patient-centered pain management. In the field of machine learning, deep neural network (DNN) has emerged as a powerful means for risk assessment because of its superb prediction power; however, the blackbox algorithms may make the results less interpretable than statistical models. Bridging the gap between the statistical and machine learning fields, we propose a novel Interpretable Neural Network Regression (INNER), which combines the strengths of statistical and DNN models. We use the proposed INNER to conduct individualized risk assessment of preoperative opioid use. Intensive simulations and an analysis of 34,186 patients expecting surgery in the Analgesic Outcomes Study (AOS) show that the proposed INNER not only can accurately predict the preoperative opioid use using preoperative characteristics as DNN, but also can estimate the patient-specific odds of opioid use without pain and the odds ratio of opioid use for a unit increase in the reported overall body pain, leading to more straight-forward interpretations of the tendency to use opioids than DNN. Our results identify the patient characteristics that are strongly associated with opioid use and is largely consistent with the previous findings, providing evidence that INNER is a useful tool for individualized risk assessment of preoperative opioid use.
{"title":"INDIVIDUALIZED RISK ASSESSMENT OF PREOPERATIVE OPIOID USE BY INTERPRETABLE NEURAL NETWORK REGRESSION.","authors":"Yuming Sun, Jian Kang, Chad Brummett, Yi Li","doi":"10.1214/22-aoas1634","DOIUrl":"https://doi.org/10.1214/22-aoas1634","url":null,"abstract":"<p><p>Preoperative opioid use has been reported to be associated with higher preoperative opioid demand, worse postoperative outcomes, and increased postoperative healthcare utilization and expenditures. Understanding the risk of preoperative opioid use helps establish patient-centered pain management. In the field of machine learning, deep neural network (DNN) has emerged as a powerful means for risk assessment because of its superb prediction power; however, the blackbox algorithms may make the results less interpretable than statistical models. Bridging the gap between the statistical and machine learning fields, we propose a novel Interpretable Neural Network Regression (INNER), which combines the strengths of statistical and DNN models. We use the proposed INNER to conduct individualized risk assessment of preoperative opioid use. Intensive simulations and an analysis of 34,186 patients expecting surgery in the Analgesic Outcomes Study (AOS) show that the proposed INNER not only can accurately predict the preoperative opioid use using preoperative characteristics as DNN, but also can estimate the patient-specific odds of opioid use without pain and the odds ratio of opioid use for a unit increase in the reported overall body pain, leading to more straight-forward interpretations of the tendency to use opioids than DNN. Our results identify the patient characteristics that are strongly associated with opioid use and is largely consistent with the previous findings, providing evidence that INNER is a useful tool for individualized risk assessment of preoperative opioid use.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"17 1","pages":"434-453"},"PeriodicalIF":1.8,"publicationDate":"2023-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10065608/pdf/nihms-1836641.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9282926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-03-01Epub Date: 2023-01-24DOI: 10.1214/22-aoas1633
Tananun Songdechakraiwut, Moo K Chung
This paper proposes a novel topological learning framework that integrates networks of different sizes and topology through persistent homology. Such challenging task is made possible through the introduction of a computationally efficient topological loss. The use of the proposed loss bypasses the intrinsic computational bottleneck associated with matching networks. We validate the method in extensive statistical simulations to assess its effectiveness when discriminating networks with different topology. The method is further demonstrated in a twin brain imaging study where we determine if brain networks are genetically heritable. The challenge here is due to the difficulty of overlaying the topologically different functional brain networks obtained from resting-state functional MRI onto the template structural brain network obtained through diffusion MRI.
{"title":"TOPOLOGICAL LEARNING FOR BRAIN NETWORKS.","authors":"Tananun Songdechakraiwut, Moo K Chung","doi":"10.1214/22-aoas1633","DOIUrl":"10.1214/22-aoas1633","url":null,"abstract":"<p><p>This paper proposes a novel topological learning framework that integrates networks of different sizes and topology through persistent homology. Such challenging task is made possible through the introduction of a computationally efficient topological loss. The use of the proposed loss bypasses the intrinsic computational bottleneck associated with matching networks. We validate the method in extensive statistical simulations to assess its effectiveness when discriminating networks with different topology. The method is further demonstrated in a twin brain imaging study where we determine if brain networks are genetically heritable. The challenge here is due to the difficulty of overlaying the topologically different functional brain networks obtained from resting-state functional MRI onto the template structural brain network obtained through diffusion MRI.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"17 1","pages":"403-433"},"PeriodicalIF":1.3,"publicationDate":"2023-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9997114/pdf/nihms-1868875.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9481040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
By Sangwon Hyun, Mattias Rolf Cape, Francois Ribalet, Jacob Bien
The ocean is filled with microscopic microalgae, called phytoplankton, which together are responsible for as much photosynthesis as all plants on land combined. Our ability to predict their response to the warming ocean relies on understanding how the dynamics of phytoplankton populations is influenced by changes in environmental conditions. One powerful technique to study the dynamics of phytoplankton is flow cytometry which measures the optical properties of thousands of individual cells per second. Today, oceanographers are able to collect flow cytometry data in real time onboard a moving ship, providing them with fine-scale resolution of the distribution of phytoplankton across thousands of kilometers. One of the current challenges is to understand how these small- and large-scale variations relate to environmental conditions, such as nutrient availability, temperature, light and ocean currents. In this paper we propose a novel sparse mixture of multivariate regressions model to estimate the time-varying phytoplankton subpopulations while simultaneously identifying the specific environmental covariates that are predictive of the observed changes to these subpopulations. We demonstrate the usefulness and interpretability of the approach using both synthetic data and real observations collected on an oceanographic cruise conducted in the northeast Pacific in the spring of 2017.
{"title":"MODELING CELL POPULATIONS MEASURED BY FLOW CYTOMETRY WITH COVARIATES USING SPARSE MIXTURE OF REGRESSIONS.","authors":"By Sangwon Hyun, Mattias Rolf Cape, Francois Ribalet, Jacob Bien","doi":"10.1214/22-aoas1631","DOIUrl":"https://doi.org/10.1214/22-aoas1631","url":null,"abstract":"<p><p>The ocean is filled with microscopic microalgae, called phytoplankton, which together are responsible for as much photosynthesis as all plants on land combined. Our ability to predict their response to the warming ocean relies on understanding how the dynamics of phytoplankton populations is influenced by changes in environmental conditions. One powerful technique to study the dynamics of phytoplankton is flow cytometry which measures the optical properties of thousands of individual cells per second. Today, oceanographers are able to collect flow cytometry data in real time onboard a moving ship, providing them with fine-scale resolution of the distribution of phytoplankton across thousands of kilometers. One of the current challenges is to understand how these small- and large-scale variations relate to environmental conditions, such as nutrient availability, temperature, light and ocean currents. In this paper we propose a novel sparse mixture of multivariate regressions model to estimate the time-varying phytoplankton subpopulations while simultaneously identifying the specific environmental covariates that are predictive of the observed changes to these subpopulations. We demonstrate the usefulness and interpretability of the approach using both synthetic data and real observations collected on an oceanographic cruise conducted in the northeast Pacific in the spring of 2017.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"17 1","pages":"357-377"},"PeriodicalIF":1.8,"publicationDate":"2023-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10360992/pdf/nihms-1917146.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9905301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mingwei Tang, Gytis Dudas, Trevor Bedford, Vladimir N Minin
Phylodynamics is a set of population genetics tools that aim at reconstructing demographic history of a population based on molecular sequences of individuals sampled from the population of interest. One important task in phylodynamics is to estimate changes in (effective) population size. When applied to infectious disease sequences such estimation of population size trajectories can provide information about changes in the number of infections. To model changes in the number of infected individuals, current phylodynamic methods use non-parametric approaches (e.g., Bayesian curve-fitting based on change-point models or Gaussian process priors), parametric approaches (e.g., based on differential equations), and stochastic modeling in conjunction with likelihood-free Bayesian methods. The first class of methods yields results that are hard to interpret epidemiologically. The second class of methods provides estimates of important epidemiological parameters, such as infection and removal/recovery rates, but ignores variation in the dynamics of infectious disease spread. The third class of methods is the most advantageous statistically, but relies on computationally intensive particle filtering techniques that limits its applications. We propose a Bayesian model that combines phylodynamic inference and stochastic epidemic models, and achieves computational tractability by using a linear noise approximation (LNA) - a technique that allows us to approximate probability densities of stochastic epidemic model trajectories. LNA opens the door for using modern Markov chain Monte Carlo tools to approximate the joint posterior distribution of the disease transmission parameters and of high dimensional vectors describing unobserved changes in the stochastic epidemic model compartment sizes (e.g., numbers of infectious and susceptible individuals). In a simulation study, we show that our method can successfully recover parameters of stochastic epidemic models. We apply our estimation technique to Ebola genealogies estimated using viral genetic data from the 2014 epidemic in Sierra Leone and Liberia.
{"title":"Fitting stochastic epidemic models to gene genealogies using linear noise approximation.","authors":"Mingwei Tang, Gytis Dudas, Trevor Bedford, Vladimir N Minin","doi":"10.1214/21-aoas1583","DOIUrl":"https://doi.org/10.1214/21-aoas1583","url":null,"abstract":"<p><p>Phylodynamics is a set of population genetics tools that aim at reconstructing demographic history of a population based on molecular sequences of individuals sampled from the population of interest. One important task in phylodynamics is to estimate changes in (effective) population size. When applied to infectious disease sequences such estimation of population size trajectories can provide information about changes in the number of infections. To model changes in the number of infected individuals, current phylodynamic methods use non-parametric approaches (e.g., Bayesian curve-fitting based on change-point models or Gaussian process priors), parametric approaches (e.g., based on differential equations), and stochastic modeling in conjunction with likelihood-free Bayesian methods. The first class of methods yields results that are hard to interpret epidemiologically. The second class of methods provides estimates of important epidemiological parameters, such as infection and removal/recovery rates, but ignores variation in the dynamics of infectious disease spread. The third class of methods is the most advantageous statistically, but relies on computationally intensive particle filtering techniques that limits its applications. We propose a Bayesian model that combines phylodynamic inference and stochastic epidemic models, and achieves computational tractability by using a linear noise approximation (LNA) - a technique that allows us to approximate probability densities of stochastic epidemic model trajectories. LNA opens the door for using modern Markov chain Monte Carlo tools to approximate the joint posterior distribution of the disease transmission parameters and of high dimensional vectors describing unobserved changes in the stochastic epidemic model compartment sizes (e.g., numbers of infectious and susceptible individuals). In a simulation study, we show that our method can successfully recover parameters of stochastic epidemic models. We apply our estimation technique to Ebola genealogies estimated using viral genetic data from the 2014 epidemic in Sierra Leone and Liberia.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"17 1","pages":"1-22"},"PeriodicalIF":1.8,"publicationDate":"2023-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10237588/pdf/nihms-1891709.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9955586","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-03-01Epub Date: 2023-01-24DOI: 10.1214/22-aoas1618
Ben Sheng, Changcheng Li, Le Bao, Runze Li
Accurate HIV incidence estimation based on individual recent infection status (recent vs long-term infection) is important for monitoring the epidemic, targeting interventions to those at greatest risk of new infection, and evaluating existing programs of prevention and treatment. Starting from 2015, the Population-based HIV Impact Assessment (PHIA) individual-level surveys are implemented in the most-affected countries in sub-Saharan Africa. PHIA is a nationally-representative HIV-focused survey that combines household visits with key questions and cutting-edge technologies such as biomarker tests for HIV antibody and HIV viral load which offer the unique opportunity of distinguishing between recent infection and long-term infection, and providing relevant HIV information by age, gender, and location. In this article, we propose a semi-supervised logistic regression model for estimating individual level HIV recency status. It incorporates information from multiple data sources - the PHIA survey where the true HIV recency status is unknown, and the cohort studies provided in the literature where the relationship between HIV recency status and the covariates are presented in the form of a contingency table. It also utilizes the national level HIV incidence estimates from the epidemiology model. Applying the proposed model to Malawi PHIA data, we demonstrate that our approach is more accurate for the individual level estimation and more appropriate for estimating HIV recency rates at aggregated levels than the current practice - the binary classification tree (BCT).
{"title":"Probabilistic HIV recency classification-a logistic regression without labeled individual level training data.","authors":"Ben Sheng, Changcheng Li, Le Bao, Runze Li","doi":"10.1214/22-aoas1618","DOIUrl":"10.1214/22-aoas1618","url":null,"abstract":"<p><p>Accurate HIV incidence estimation based on individual recent infection status (recent vs long-term infection) is important for monitoring the epidemic, targeting interventions to those at greatest risk of new infection, and evaluating existing programs of prevention and treatment. Starting from 2015, the Population-based HIV Impact Assessment (PHIA) individual-level surveys are implemented in the most-affected countries in sub-Saharan Africa. PHIA is a nationally-representative HIV-focused survey that combines household visits with key questions and cutting-edge technologies such as biomarker tests for HIV antibody and HIV viral load which offer the unique opportunity of distinguishing between recent infection and long-term infection, and providing relevant HIV information by age, gender, and location. In this article, we propose a semi-supervised logistic regression model for estimating individual level HIV recency status. It incorporates information from multiple data sources - the PHIA survey where the true HIV recency status is unknown, and the cohort studies provided in the literature where the relationship between HIV recency status and the covariates are presented in the form of a contingency table. It also utilizes the national level HIV incidence estimates from the epidemiology model. Applying the proposed model to Malawi PHIA data, we demonstrate that our approach is more accurate for the individual level estimation and more appropriate for estimating HIV recency rates at aggregated levels than the current practice - the binary classification tree (BCT).</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"17 1","pages":"108-129"},"PeriodicalIF":1.8,"publicationDate":"2023-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10577400/pdf/nihms-1886688.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41240660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-12-01Epub Date: 2022-09-26DOI: 10.1214/22-aoas1610
Alexandra Larsen, Shu Yang, Brian J Reich, Ana G Rappold
Wildland fire smoke contains hazardous levels of fine particulate matter (PM2.5), a pollutant shown to adversely effect health. Estimating fire attributable PM2.5 concentrations is key to quantifying the impact on air quality and subsequent health burden. This is a challenging problem since only total PM2.5 is measured at monitoring stations and both fire-attributable PM2.5 and PM2.5 from all other sources are correlated in space and time. We propose a framework for estimating fire-contributed PM2.5 and PM2.5 from all other sources using a novel causal inference framework and bias-adjusted chemical model representations of PM2.5 under counterfactual scenarios. The chemical model representation of PM2.5 for this analysis is simulated using Community Multiscale Air Quality Modeling System (CMAQ), run with and without fire emissions across the contiguous U.S. for the 2008-2012 wildfire seasons. The CMAQ output is calibrated with observations from monitoring sites for the same spatial domain and time period. We use a Bayesian model that accounts for spatial variation to estimate the effect of wildland fires on PM2.5 and state assumptions under which the estimate has a valid causal interpretation. Our results include estimates of the contributions of wildfire smoke to PM2.5 for the contiguous U.S. Additionally, we compute the health burden associated with the PM2.5 attributable to wildfire smoke.
{"title":"A SPATIAL CAUSAL ANALYSIS OF WILDLAND FIRE-CONTRIBUTED PM<sub>2.5</sub> USING NUMERICAL MODEL OUTPUT.","authors":"Alexandra Larsen, Shu Yang, Brian J Reich, Ana G Rappold","doi":"10.1214/22-aoas1610","DOIUrl":"10.1214/22-aoas1610","url":null,"abstract":"<p><p>Wildland fire smoke contains hazardous levels of fine particulate matter (PM<sub>2.5</sub>), a pollutant shown to adversely effect health. Estimating fire attributable PM<sub>2.5</sub> concentrations is key to quantifying the impact on air quality and subsequent health burden. This is a challenging problem since only total PM<sub>2.5</sub> is measured at monitoring stations and both fire-attributable PM<sub>2.5</sub> and PM<sub>2.5</sub> from all other sources are correlated in space and time. We propose a framework for estimating fire-contributed PM<sub>2.5</sub> and PM<sub>2.5</sub> from all other sources using a novel causal inference framework and bias-adjusted chemical model representations of PM<sub>2.5</sub> under counterfactual scenarios. The chemical model representation of PM<sub>2.5</sub> for this analysis is simulated using Community Multiscale Air Quality Modeling System (CMAQ), run with and without fire emissions across the contiguous U.S. for the 2008-2012 wildfire seasons. The CMAQ output is calibrated with observations from monitoring sites for the same spatial domain and time period. We use a Bayesian model that accounts for spatial variation to estimate the effect of wildland fires on PM<sub>2.5</sub> and state assumptions under which the estimate has a valid causal interpretation. Our results include estimates of the contributions of wildfire smoke to PM<sub>2.5</sub> for the contiguous U.S. Additionally, we compute the health burden associated with the PM<sub>2.5</sub> attributable to wildfire smoke.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"16 4","pages":"2714-2731"},"PeriodicalIF":1.3,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10181852/pdf/nihms-1846188.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9468690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-12-01Epub Date: 2022-09-26DOI: 10.1214/21-aoas1574
Gabriel Loewinger, Prasad Patil, Kenneth T Kishida, Giovanni Parmigiani
We propose the "study strap ensemble", which combines advantages of two common approaches to fitting prediction models when multiple training datasets ("studies") are available: pooling studies and fitting one model versus averaging predictions from multiple models each fit to individual studies. The study strap ensemble fits models to bootstrapped datasets, or "pseudo-studies." These are generated by resampling from multiple studies with a hierarchical resampling scheme that generalizes the randomized cluster bootstrap. The study strap is controlled by a tuning parameter that determines the proportion of observations to draw from each study. When the parameter is set to its lowest value, each pseudo-study is resampled from only a single study. When it is high, the study strap ignores the multi-study structure and generates pseudo-studies by merging the datasets and drawing observations like a standard bootstrap. We empirically show the optimal tuning value often lies in between, and prove that special cases of the study strap draw the merged dataset and the set of original studies as pseudo-studies. We extend the study strap approach with an ensemble weighting scheme that utilizes information in the distribution of the covariates of the test dataset. Our work is motivated by neuroscience experiments using real-time neurochemical sensing during awake behavior in humans. Current techniques to perform this kind of research require measurements from an electrode placed in the brain during awake neurosurgery and rely on prediction models to estimate neurotransmitter concentrations from the electrical measurements recorded by the electrode. These models are trained by combining multiple datasets that are collected in vitro under heterogeneous conditions in order to promote accuracy of the models when applied to data collected in the brain. A prevailing challenge is deciding how to combine studies or ensemble models trained on different studies to enhance model generalizability. Our methods produce marked improvements in simulations and in this application. All methods are available in the studyStrap CRAN package.
{"title":"Hierarchical resampling for bagging in multistudy prediction with applications to human neurochemical sensing.","authors":"Gabriel Loewinger, Prasad Patil, Kenneth T Kishida, Giovanni Parmigiani","doi":"10.1214/21-aoas1574","DOIUrl":"10.1214/21-aoas1574","url":null,"abstract":"<p><p>We propose the \"study strap ensemble\", which combines advantages of two common approaches to fitting prediction models when multiple training datasets (\"studies\") are available: pooling studies and fitting one model versus averaging predictions from multiple models each fit to individual studies. The study strap ensemble fits models to bootstrapped datasets, or \"pseudo-studies.\" These are generated by resampling from multiple studies with a hierarchical resampling scheme that generalizes the randomized cluster bootstrap. The study strap is controlled by a tuning parameter that determines the proportion of observations to draw from each study. When the parameter is set to its lowest value, each pseudo-study is resampled from only a single study. When it is high, the study strap ignores the multi-study structure and generates pseudo-studies by merging the datasets and drawing observations like a standard bootstrap. We empirically show the optimal tuning value often lies in between, and prove that special cases of the study strap draw the merged dataset and the set of original studies as pseudo-studies. We extend the study strap approach with an ensemble weighting scheme that utilizes information in the distribution of the covariates of the test dataset. Our work is motivated by neuroscience experiments using real-time neurochemical sensing during awake behavior in humans. Current techniques to perform this kind of research require measurements from an electrode placed in the brain during awake neurosurgery and rely on prediction models to estimate neurotransmitter concentrations from the electrical measurements recorded by the electrode. These models are trained by combining multiple datasets that are collected <i>in vitro</i> under heterogeneous conditions in order to promote accuracy of the models when applied to data collected in the brain. A prevailing challenge is deciding how to combine studies or ensemble models trained on different studies to enhance model generalizability. Our methods produce marked improvements in simulations and in this application. All methods are available in the studyStrap CRAN package.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"16 4","pages":"2145-2165"},"PeriodicalIF":1.8,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9586160/pdf/nihms-1800688.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10733907","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-12-01Epub Date: 2022-09-26DOI: 10.1214/21-aoas1581
Sen Zhao, Ali Shojaie
Identifying differences in networks has become a canonical problem in many biological applications. Existing methods try to accomplish this goal by either directly comparing the estimated structures of two networks, or testing the null hypothesis that the covariance or inverse covariance matrices in two populations are identical. However, estimation approaches do not provide measures of uncertainty, e.g., p-values, whereas existing testing approaches could lead to misleading results, as we illustrate in this paper. To address these shortcomings, we propose a qualitative hypothesis testing framework, which tests whether the connectivity structures in the two networks are the same. our framework is especially appropriate if the goal is to identify nodes or edges that are differentially connected. No existing approach could test such hypotheses and provide corresponding measures of uncertainty. Theoretically, we show that under appropriate conditions, our proposal correctly controls the type-I error rate in testing the qualitative hypothesis. Empirically, we demonstrate the performance of our proposal using simulation studies and applications in cancer genomics.
{"title":"NETWORK DIFFERENTIAL CONNECTIVITY ANALYSIS.","authors":"Sen Zhao, Ali Shojaie","doi":"10.1214/21-aoas1581","DOIUrl":"10.1214/21-aoas1581","url":null,"abstract":"<p><p>Identifying differences in networks has become a canonical problem in many biological applications. Existing methods try to accomplish this goal by either directly comparing the estimated structures of two networks, or testing the null hypothesis that the covariance or inverse covariance matrices in two populations are identical. However, estimation approaches do not provide measures of uncertainty, e.g., <i>p</i>-values, whereas existing testing approaches could lead to misleading results, as we illustrate in this paper. To address these shortcomings, we propose a <i>qualitative</i> hypothesis testing framework, which tests whether the connectivity <i>structures</i> in the two networks are the same. our framework is especially appropriate if the goal is to identify nodes or edges that are differentially connected. No existing approach could test such hypotheses and provide corresponding measures of uncertainty. Theoretically, we show that under appropriate conditions, our proposal correctly controls the type-I error rate in testing the qualitative hypothesis. Empirically, we demonstrate the performance of our proposal using simulation studies and applications in cancer genomics.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"16 4","pages":"2166-2182"},"PeriodicalIF":1.3,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10569671/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41240659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}