Annals of Applied Statistics最新文献_第5页

Robust joint modelling of left-censored longitudinal data and survival data with application to HIV vaccine studies. 左删失纵向数据和生存数据的稳健联合建模，并将其应用于艾滋病疫苗研究。

IF 1.8 4区数学 Q2 STATISTICS & PROBABILITY

Annals of Applied Statistics

Pub Date : 2023-06-01 Epub Date: 2023-05-01 DOI: 10.1214/22-aoas1656

Tingting Yu, Lang Wu, Jin Qiu, Peter B Gilbert

In jointly modelling longitudinal and survival data, the longitudinal data may be complex in the sense that they may contain outliers and may be left censored. Motivated from an HIV vaccine study, we propose a robust method for joint models of longitudinal and survival data, where the outliers in longitudinal data are addressed using a multivariate t-distribution for b-outliers and using an M-estimator for e-outliers. We also propose a computationally efficient method for approximate likelihood inference. The proposed method is evaluated by simulation studies. Based on the proposed models and method, we analyze the HIV vaccine data and find a strong association between longitudinal biomarkers and the risk of HIV infection.

在对纵向数据和生存数据进行联合建模时，纵向数据可能比较复杂，因为它们可能包含异常值，也可能会被留存。受一项艾滋病疫苗研究的启发，我们提出了一种用于纵向数据和生存数据联合建模的稳健方法，其中对 b 型离群值使用多元 t 分布，对 e 型离群值使用 M 估计器来处理纵向数据中的离群值。我们还提出了一种计算效率高的近似似然推断方法。我们通过模拟研究对提出的方法进行了评估。根据提出的模型和方法，我们分析了 HIV 疫苗数据，发现纵向生物标志物与 HIV 感染风险之间存在密切联系。

引用次数: 0

DYNAMIC RISK PREDICTION TRIGGERED BY INTERMEDIATE EVENTS USING SURVIVAL TREE ENSEMBLES. 利用存活树集合对中间事件引发的动态风险进行预测。

IF 1.8 4区数学 Q2 STATISTICS & PROBABILITY

Annals of Applied Statistics

Pub Date : 2023-06-01 Epub Date: 2023-05-01 DOI: 10.1214/22-aoas1674

Yifei Sun, Sy Han Chiou, Colin O Wu, Meghan McGarry, Chiung-Yu Huang

With the availability of massive amounts of data from electronic health records and registry databases, incorporating time-varying patient information to improve risk prediction has attracted great attention. To exploit the growing amount of predictor information over time, we develop a unified framework for landmark prediction using survival tree ensembles, where an updated prediction can be performed when new information becomes available. Compared to conventional landmark prediction with fixed landmark times, our methods allow the landmark times to be subject-specific and triggered by an intermediate clinical event. Moreover, the nonparametric approach circumvents the thorny issue of model incompatibility at different landmark times. In our framework, both the longitudinal predictors and the event time outcome are subject to right censoring, and thus existing tree-based approaches cannot be directly applied. To tackle the analytical challenges, we propose a risk-set-based ensemble procedure by averaging martingale estimating equations from individual trees. Extensive simulation studies are conducted to evaluate the performance of our methods. The methods are applied to the Cystic Fibrosis Foundation Patient Registry (CFFPR) data to perform dynamic prediction of lung disease in cystic fibrosis patients and to identify important prognosis factors.

随着电子健康记录和登记数据库中海量数据的出现，结合随时间变化的患者信息来改进风险预测已引起人们的极大关注。为了利用随时间变化而不断增加的预测信息量，我们开发了一种使用生存树集合进行地标预测的统一框架，当有新信息出现时，可以进行更新预测。与具有固定地标时间的传统地标预测相比，我们的方法允许地标时间针对特定受试者，并由中间临床事件触发。此外，非参数方法还避免了不同地标时间模型不兼容的棘手问题。在我们的框架中，纵向预测因子和事件时间结果都受到右删减的影响，因此不能直接应用现有的基于树的方法。为了解决分析上的难题，我们提出了一种基于风险集的集合程序，通过平均各个树的马氏估计方程来实现。我们进行了广泛的模拟研究，以评估我们方法的性能。我们将这些方法应用于囊性纤维化基金会患者登记（CFFPR）数据，对囊性纤维化患者的肺部疾病进行动态预测，并找出重要的预后因素。

{"title":"DYNAMIC RISK PREDICTION TRIGGERED BY INTERMEDIATE EVENTS USING SURVIVAL TREE ENSEMBLES.","authors":"Yifei Sun, Sy Han Chiou, Colin O Wu, Meghan McGarry, Chiung-Yu Huang","doi":"10.1214/22-aoas1674","DOIUrl":"10.1214/22-aoas1674","url":null,"abstract":"With the availability of massive amounts of data from electronic health records and registry databases, incorporating time-varying patient information to improve risk prediction has attracted great attention. To exploit the growing amount of predictor information over time, we develop a unified framework for landmark prediction using survival tree ensembles, where an updated prediction can be performed when new information becomes available. Compared to conventional landmark prediction with fixed landmark times, our methods allow the landmark times to be subject-specific and triggered by an intermediate clinical event. Moreover, the nonparametric approach circumvents the thorny issue of model incompatibility at different landmark times. In our framework, both the longitudinal predictors and the event time outcome are subject to right censoring, and thus existing tree-based approaches cannot be directly applied. To tackle the analytical challenges, we propose a risk-set-based ensemble procedure by averaging martingale estimating equations from individual trees. Extensive simulation studies are conducted to evaluate the performance of our methods. The methods are applied to the Cystic Fibrosis Foundation Patient Registry (CFFPR) data to perform dynamic prediction of lung disease in cystic fibrosis patients and to identify important prognosis factors.","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"17 2","pages":"1375-1397"},"PeriodicalIF":1.8,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10241448/pdf/nihms-1846847.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9974256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

INDIVIDUALIZED RISK ASSESSMENT OF PREOPERATIVE OPIOID USE BY INTERPRETABLE NEURAL NETWORK REGRESSION. 可解释神经网络回归对术前阿片类药物使用的个体化风险评估。

IF 1.8 4区数学 Q2 STATISTICS & PROBABILITY

Annals of Applied Statistics

Pub Date : 2023-03-01 DOI: 10.1214/22-aoas1634

Yuming Sun, Jian Kang, Chad Brummett, Yi Li

Preoperative opioid use has been reported to be associated with higher preoperative opioid demand, worse postoperative outcomes, and increased postoperative healthcare utilization and expenditures. Understanding the risk of preoperative opioid use helps establish patient-centered pain management. In the field of machine learning, deep neural network (DNN) has emerged as a powerful means for risk assessment because of its superb prediction power; however, the blackbox algorithms may make the results less interpretable than statistical models. Bridging the gap between the statistical and machine learning fields, we propose a novel Interpretable Neural Network Regression (INNER), which combines the strengths of statistical and DNN models. We use the proposed INNER to conduct individualized risk assessment of preoperative opioid use. Intensive simulations and an analysis of 34,186 patients expecting surgery in the Analgesic Outcomes Study (AOS) show that the proposed INNER not only can accurately predict the preoperative opioid use using preoperative characteristics as DNN, but also can estimate the patient-specific odds of opioid use without pain and the odds ratio of opioid use for a unit increase in the reported overall body pain, leading to more straight-forward interpretations of the tendency to use opioids than DNN. Our results identify the patient characteristics that are strongly associated with opioid use and is largely consistent with the previous findings, providing evidence that INNER is a useful tool for individualized risk assessment of preoperative opioid use.

据报道，术前阿片类药物使用与术前阿片类药物需求增加、术后结果恶化以及术后医疗保健利用和支出增加有关。了解术前使用阿片类药物的风险有助于建立以患者为中心的疼痛管理。在机器学习领域，深度神经网络(deep neural network, DNN)因其卓越的预测能力而成为风险评估的有力手段;然而，与统计模型相比，黑盒算法可能会使结果的可解释性降低。为了弥合统计和机器学习领域之间的差距，我们提出了一种新的可解释神经网络回归(INNER)，它结合了统计和深度神经网络模型的优势。我们使用拟议的INNER进行术前阿片类药物使用的个体化风险评估。密集的模拟和分析34186例外科手术中的镇痛效果研究(代谢)表明,该内部不仅可以准确地预测术前阿片类药物使用使用术前特征作为款,但也可以估计不同的阿片类药物使用的几率没有痛苦和阿片类药物使用的优势比单位增加报道全身疼痛,导致更多的直接解释的倾向比款使用阿片类药物。我们的研究结果确定了与阿片类药物使用密切相关的患者特征，并且与先前的研究结果在很大程度上一致，为INNER是术前阿片类药物使用个体化风险评估的有用工具提供了证据。

{"title":"INDIVIDUALIZED RISK ASSESSMENT OF PREOPERATIVE OPIOID USE BY INTERPRETABLE NEURAL NETWORK REGRESSION.","authors":"Yuming Sun, Jian Kang, Chad Brummett, Yi Li","doi":"10.1214/22-aoas1634","DOIUrl":"https://doi.org/10.1214/22-aoas1634","url":null,"abstract":"Preoperative opioid use has been reported to be associated with higher preoperative opioid demand, worse postoperative outcomes, and increased postoperative healthcare utilization and expenditures. Understanding the risk of preoperative opioid use helps establish patient-centered pain management. In the field of machine learning, deep neural network (DNN) has emerged as a powerful means for risk assessment because of its superb prediction power; however, the blackbox algorithms may make the results less interpretable than statistical models. Bridging the gap between the statistical and machine learning fields, we propose a novel Interpretable Neural Network Regression (INNER), which combines the strengths of statistical and DNN models. We use the proposed INNER to conduct individualized risk assessment of preoperative opioid use. Intensive simulations and an analysis of 34,186 patients expecting surgery in the Analgesic Outcomes Study (AOS) show that the proposed INNER not only can accurately predict the preoperative opioid use using preoperative characteristics as DNN, but also can estimate the patient-specific odds of opioid use without pain and the odds ratio of opioid use for a unit increase in the reported overall body pain, leading to more straight-forward interpretations of the tendency to use opioids than DNN. Our results identify the patient characteristics that are strongly associated with opioid use and is largely consistent with the previous findings, providing evidence that INNER is a useful tool for individualized risk assessment of preoperative opioid use.","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"17 1","pages":"434-453"},"PeriodicalIF":1.8,"publicationDate":"2023-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10065608/pdf/nihms-1836641.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9282926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

TOPOLOGICAL LEARNING FOR BRAIN NETWORKS. 脑网络拓扑学习

IF 1.3 4区数学 Q2 STATISTICS & PROBABILITY

Annals of Applied Statistics

Pub Date : 2023-03-01 Epub Date: 2023-01-24 DOI: 10.1214/22-aoas1633

Tananun Songdechakraiwut, Moo K Chung

This paper proposes a novel topological learning framework that integrates networks of different sizes and topology through persistent homology. Such challenging task is made possible through the introduction of a computationally efficient topological loss. The use of the proposed loss bypasses the intrinsic computational bottleneck associated with matching networks. We validate the method in extensive statistical simulations to assess its effectiveness when discriminating networks with different topology. The method is further demonstrated in a twin brain imaging study where we determine if brain networks are genetically heritable. The challenge here is due to the difficulty of overlaying the topologically different functional brain networks obtained from resting-state functional MRI onto the template structural brain network obtained through diffusion MRI.

本文提出了一种新颖的拓扑学习框架，通过持久同源性整合不同规模和拓扑结构的网络。通过引入计算效率高的拓扑损耗，这项具有挑战性的任务得以实现。使用所提出的损失可以绕过与匹配网络相关的内在计算瓶颈。我们在大量统计模拟中验证了这种方法，以评估它在区分不同拓扑结构的网络时的有效性。我们还在一项双胞胎大脑成像研究中进一步验证了该方法，并确定了大脑网络是否具有遗传性。我们面临的挑战是如何将静息态功能磁共振成像获得的拓扑结构不同的大脑功能网络叠加到通过扩散磁共振成像获得的大脑结构网络模板上。

引用次数: 0

MODELING CELL POPULATIONS MEASURED BY FLOW CYTOMETRY WITH COVARIATES USING SPARSE MIXTURE OF REGRESSIONS. 用稀疏混合回归的协变量对流式细胞术测量的细胞群进行建模。

IF 1.8 4区数学 Q2 STATISTICS & PROBABILITY

Annals of Applied Statistics

Pub Date : 2023-03-01 DOI: 10.1214/22-aoas1631

By Sangwon Hyun, Mattias Rolf Cape, Francois Ribalet, Jacob Bien

The ocean is filled with microscopic microalgae, called phytoplankton, which together are responsible for as much photosynthesis as all plants on land combined. Our ability to predict their response to the warming ocean relies on understanding how the dynamics of phytoplankton populations is influenced by changes in environmental conditions. One powerful technique to study the dynamics of phytoplankton is flow cytometry which measures the optical properties of thousands of individual cells per second. Today, oceanographers are able to collect flow cytometry data in real time onboard a moving ship, providing them with fine-scale resolution of the distribution of phytoplankton across thousands of kilometers. One of the current challenges is to understand how these small- and large-scale variations relate to environmental conditions, such as nutrient availability, temperature, light and ocean currents. In this paper we propose a novel sparse mixture of multivariate regressions model to estimate the time-varying phytoplankton subpopulations while simultaneously identifying the specific environmental covariates that are predictive of the observed changes to these subpopulations. We demonstrate the usefulness and interpretability of the approach using both synthetic data and real observations collected on an oceanographic cruise conducted in the northeast Pacific in the spring of 2017.

海洋中充满了被称为浮游植物的微小微藻，它们的光合作用相当于陆地上所有植物的光合作用总和。我们预测它们对海洋变暖的反应的能力依赖于了解浮游植物种群的动态是如何受到环境条件变化的影响的。流式细胞术是研究浮游植物动力学的一种强有力的技术，它可以每秒测量数千个单个细胞的光学特性。今天，海洋学家能够在移动的船上实时收集流式细胞仪数据，为他们提供数千公里范围内浮游植物分布的精细分辨率。目前的挑战之一是了解这些小的和大的变化与环境条件的关系，如营养物质的可用性、温度、光线和洋流。在本文中，我们提出了一种新的稀疏混合多元回归模型来估计随时间变化的浮游植物亚群，同时确定预测这些亚群观测变化的特定环境协变量。我们利用2017年春季在东北太平洋进行的海洋巡航收集的合成数据和实际观测数据证明了该方法的有效性和可解释性。

{"title":"MODELING CELL POPULATIONS MEASURED BY FLOW CYTOMETRY WITH COVARIATES USING SPARSE MIXTURE OF REGRESSIONS.","authors":"By Sangwon Hyun, Mattias Rolf Cape, Francois Ribalet, Jacob Bien","doi":"10.1214/22-aoas1631","DOIUrl":"https://doi.org/10.1214/22-aoas1631","url":null,"abstract":"The ocean is filled with microscopic microalgae, called phytoplankton, which together are responsible for as much photosynthesis as all plants on land combined. Our ability to predict their response to the warming ocean relies on understanding how the dynamics of phytoplankton populations is influenced by changes in environmental conditions. One powerful technique to study the dynamics of phytoplankton is flow cytometry which measures the optical properties of thousands of individual cells per second. Today, oceanographers are able to collect flow cytometry data in real time onboard a moving ship, providing them with fine-scale resolution of the distribution of phytoplankton across thousands of kilometers. One of the current challenges is to understand how these small- and large-scale variations relate to environmental conditions, such as nutrient availability, temperature, light and ocean currents. In this paper we propose a novel sparse mixture of multivariate regressions model to estimate the time-varying phytoplankton subpopulations while simultaneously identifying the specific environmental covariates that are predictive of the observed changes to these subpopulations. We demonstrate the usefulness and interpretability of the approach using both synthetic data and real observations collected on an oceanographic cruise conducted in the northeast Pacific in the spring of 2017.","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"17 1","pages":"357-377"},"PeriodicalIF":1.8,"publicationDate":"2023-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10360992/pdf/nihms-1917146.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9905301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Fitting stochastic epidemic models to gene genealogies using linear noise approximation. 用线性噪声近似拟合随机流行病模型的基因谱系。

IF 1.8 4区数学 Q2 STATISTICS & PROBABILITY

Annals of Applied Statistics

Pub Date : 2023-03-01 DOI: 10.1214/21-aoas1583

Mingwei Tang, Gytis Dudas, Trevor Bedford, Vladimir N Minin

Phylodynamics is a set of population genetics tools that aim at reconstructing demographic history of a population based on molecular sequences of individuals sampled from the population of interest. One important task in phylodynamics is to estimate changes in (effective) population size. When applied to infectious disease sequences such estimation of population size trajectories can provide information about changes in the number of infections. To model changes in the number of infected individuals, current phylodynamic methods use non-parametric approaches (e.g., Bayesian curve-fitting based on change-point models or Gaussian process priors), parametric approaches (e.g., based on differential equations), and stochastic modeling in conjunction with likelihood-free Bayesian methods. The first class of methods yields results that are hard to interpret epidemiologically. The second class of methods provides estimates of important epidemiological parameters, such as infection and removal/recovery rates, but ignores variation in the dynamics of infectious disease spread. The third class of methods is the most advantageous statistically, but relies on computationally intensive particle filtering techniques that limits its applications. We propose a Bayesian model that combines phylodynamic inference and stochastic epidemic models, and achieves computational tractability by using a linear noise approximation (LNA) - a technique that allows us to approximate probability densities of stochastic epidemic model trajectories. LNA opens the door for using modern Markov chain Monte Carlo tools to approximate the joint posterior distribution of the disease transmission parameters and of high dimensional vectors describing unobserved changes in the stochastic epidemic model compartment sizes (e.g., numbers of infectious and susceptible individuals). In a simulation study, we show that our method can successfully recover parameters of stochastic epidemic models. We apply our estimation technique to Ebola genealogies estimated using viral genetic data from the 2014 epidemic in Sierra Leone and Liberia.

系统动力学是一套群体遗传学工具，旨在根据从感兴趣的群体中抽样的个体的分子序列重建群体的人口统计学历史。系统动力学的一个重要任务是估计(有效)种群大小的变化。当应用于传染病序列时，这种对种群大小轨迹的估计可以提供有关感染数量变化的信息。为了模拟感染人数的变化，目前的系统动力学方法使用非参数方法(例如，基于变化点模型或高斯过程先验的贝叶斯曲线拟合)、参数方法(例如，基于微分方程)和结合无似然贝叶斯方法的随机建模。第一类方法产生的结果很难从流行病学上解释。第二类方法提供了重要的流行病学参数的估计，例如感染率和清除/恢复率，但忽略了传染病传播动态的变化。第三类方法在统计上是最有利的，但依赖于计算密集型的粒子滤波技术，限制了它的应用。我们提出了一种贝叶斯模型，结合了系统动力学推断和随机流行病模型，并通过使用线性噪声近似(LNA)实现了计算可追溯性-一种允许我们近似随机流行病模型轨迹的概率密度的技术。LNA为使用现代马尔可夫链蒙特卡罗工具来近似疾病传播参数和描述随机流行病模型隔室大小(例如，感染和易感个体的数量)中未观察到的变化的高维向量的联合后向分布打开了大门。仿真研究表明，该方法可以成功地恢复随机流行病模型的参数。我们利用2014年塞拉利昂和利比里亚疫情的病毒遗传数据，将我们的估计技术应用于埃博拉谱系估计。

{"title":"Fitting stochastic epidemic models to gene genealogies using linear noise approximation.","authors":"Mingwei Tang, Gytis Dudas, Trevor Bedford, Vladimir N Minin","doi":"10.1214/21-aoas1583","DOIUrl":"https://doi.org/10.1214/21-aoas1583","url":null,"abstract":"Phylodynamics is a set of population genetics tools that aim at reconstructing demographic history of a population based on molecular sequences of individuals sampled from the population of interest. One important task in phylodynamics is to estimate changes in (effective) population size. When applied to infectious disease sequences such estimation of population size trajectories can provide information about changes in the number of infections. To model changes in the number of infected individuals, current phylodynamic methods use non-parametric approaches (e.g., Bayesian curve-fitting based on change-point models or Gaussian process priors), parametric approaches (e.g., based on differential equations), and stochastic modeling in conjunction with likelihood-free Bayesian methods. The first class of methods yields results that are hard to interpret epidemiologically. The second class of methods provides estimates of important epidemiological parameters, such as infection and removal/recovery rates, but ignores variation in the dynamics of infectious disease spread. The third class of methods is the most advantageous statistically, but relies on computationally intensive particle filtering techniques that limits its applications. We propose a Bayesian model that combines phylodynamic inference and stochastic epidemic models, and achieves computational tractability by using a linear noise approximation (LNA) - a technique that allows us to approximate probability densities of stochastic epidemic model trajectories. LNA opens the door for using modern Markov chain Monte Carlo tools to approximate the joint posterior distribution of the disease transmission parameters and of high dimensional vectors describing unobserved changes in the stochastic epidemic model compartment sizes (e.g., numbers of infectious and susceptible individuals). In a simulation study, we show that our method can successfully recover parameters of stochastic epidemic models. We apply our estimation technique to Ebola genealogies estimated using viral genetic data from the 2014 epidemic in Sierra Leone and Liberia.","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"17 1","pages":"1-22"},"PeriodicalIF":1.8,"publicationDate":"2023-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10237588/pdf/nihms-1891709.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9955586","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Probabilistic HIV recency classification-a logistic regression without labeled individual level training data. 概率HIV近期分类——一种没有标记个人水平训练数据的逻辑回归。

IF 1.8 4区数学 Q2 STATISTICS & PROBABILITY

Annals of Applied Statistics

Pub Date : 2023-03-01 Epub Date: 2023-01-24 DOI: 10.1214/22-aoas1618

Ben Sheng, Changcheng Li, Le Bao, Runze Li

Accurate HIV incidence estimation based on individual recent infection status (recent vs long-term infection) is important for monitoring the epidemic, targeting interventions to those at greatest risk of new infection, and evaluating existing programs of prevention and treatment. Starting from 2015, the Population-based HIV Impact Assessment (PHIA) individual-level surveys are implemented in the most-affected countries in sub-Saharan Africa. PHIA is a nationally-representative HIV-focused survey that combines household visits with key questions and cutting-edge technologies such as biomarker tests for HIV antibody and HIV viral load which offer the unique opportunity of distinguishing between recent infection and long-term infection, and providing relevant HIV information by age, gender, and location. In this article, we propose a semi-supervised logistic regression model for estimating individual level HIV recency status. It incorporates information from multiple data sources - the PHIA survey where the true HIV recency status is unknown, and the cohort studies provided in the literature where the relationship between HIV recency status and the covariates are presented in the form of a contingency table. It also utilizes the national level HIV incidence estimates from the epidemiology model. Applying the proposed model to Malawi PHIA data, we demonstrate that our approach is more accurate for the individual level estimation and more appropriate for estimating HIV recency rates at aggregated levels than the current practice - the binary classification tree (BCT).

根据个人近期感染状况（近期感染与长期感染）准确估计艾滋病毒发病率，对于监测疫情、针对新感染风险最大的人群进行干预以及评估现有的预防和治疗方案至关重要。从2015年开始，在撒哈拉以南非洲受影响最严重的国家实施基于人口的艾滋病毒影响评估（PHIA）个人层面调查。PHIA是一项具有全国代表性的以艾滋病毒为重点的调查，它将家访与关键问题和尖端技术相结合，如艾滋病毒抗体和艾滋病毒病毒载量的生物标志物测试，为区分近期感染和长期感染提供了独特的机会，并按年龄、性别和地点提供了相关的艾滋病毒信息。在这篇文章中，我们提出了一个半监督逻辑回归模型来估计个体水平的HIV近期状况。它结合了来自多个数据来源的信息——PHIA调查，其中真实的HIV近期状况未知，以及文献中提供的队列研究，其中HIV近期状况和协变量之间的关系以列联表的形式呈现。它还利用了流行病学模型中对国家一级艾滋病毒发病率的估计。将所提出的模型应用于马拉维PHIA数据，我们证明，与当前的实践——二叉分类树（BCT）相比，我们的方法更准确地用于个体水平的估计，也更适合于在总体水平上估计艾滋病毒感染率。

{"title":"Probabilistic HIV recency classification-a logistic regression without labeled individual level training data.","authors":"Ben Sheng, Changcheng Li, Le Bao, Runze Li","doi":"10.1214/22-aoas1618","DOIUrl":"10.1214/22-aoas1618","url":null,"abstract":"Accurate HIV incidence estimation based on individual recent infection status (recent vs long-term infection) is important for monitoring the epidemic, targeting interventions to those at greatest risk of new infection, and evaluating existing programs of prevention and treatment. Starting from 2015, the Population-based HIV Impact Assessment (PHIA) individual-level surveys are implemented in the most-affected countries in sub-Saharan Africa. PHIA is a nationally-representative HIV-focused survey that combines household visits with key questions and cutting-edge technologies such as biomarker tests for HIV antibody and HIV viral load which offer the unique opportunity of distinguishing between recent infection and long-term infection, and providing relevant HIV information by age, gender, and location. In this article, we propose a semi-supervised logistic regression model for estimating individual level HIV recency status. It incorporates information from multiple data sources - the PHIA survey where the true HIV recency status is unknown, and the cohort studies provided in the literature where the relationship between HIV recency status and the covariates are presented in the form of a contingency table. It also utilizes the national level HIV incidence estimates from the epidemiology model. Applying the proposed model to Malawi PHIA data, we demonstrate that our approach is more accurate for the individual level estimation and more appropriate for estimating HIV recency rates at aggregated levels than the current practice - the binary classification tree (BCT).","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"17 1","pages":"108-129"},"PeriodicalIF":1.8,"publicationDate":"2023-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10577400/pdf/nihms-1886688.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41240660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

A SPATIAL CAUSAL ANALYSIS OF WILDLAND FIRE-CONTRIBUTED PM_2.5 USING NUMERICAL MODEL OUTPUT. 利用数值模型输出对野地火灾造成的 pm2.5 进行空间因果分析。

IF 1.3 4区数学 Q2 STATISTICS & PROBABILITY

Annals of Applied Statistics

Pub Date : 2022-12-01 Epub Date: 2022-09-26 DOI: 10.1214/22-aoas1610

Alexandra Larsen, Shu Yang, Brian J Reich, Ana G Rappold

Wildland fire smoke contains hazardous levels of fine particulate matter (PM_2.5), a pollutant shown to adversely effect health. Estimating fire attributable PM_2.5 concentrations is key to quantifying the impact on air quality and subsequent health burden. This is a challenging problem since only total PM_2.5 is measured at monitoring stations and both fire-attributable PM_2.5 and PM_2.5 from all other sources are correlated in space and time. We propose a framework for estimating fire-contributed PM_2.5 and PM_2.5 from all other sources using a novel causal inference framework and bias-adjusted chemical model representations of PM_2.5 under counterfactual scenarios. The chemical model representation of PM_2.5 for this analysis is simulated using Community Multiscale Air Quality Modeling System (CMAQ), run with and without fire emissions across the contiguous U.S. for the 2008-2012 wildfire seasons. The CMAQ output is calibrated with observations from monitoring sites for the same spatial domain and time period. We use a Bayesian model that accounts for spatial variation to estimate the effect of wildland fires on PM_2.5 and state assumptions under which the estimate has a valid causal interpretation. Our results include estimates of the contributions of wildfire smoke to PM_2.5 for the contiguous U.S. Additionally, we compute the health burden associated with the PM_2.5 attributable to wildfire smoke.

野外火灾烟雾中含有有害水平的细颗粒物 (PM2.5)，这种污染物已被证明会对健康产生不利影响。估算可归因于火灾的 PM2.5 浓度是量化对空气质量的影响和后续健康负担的关键。这是一个具有挑战性的问题，因为监测站只能测量 PM2.5 总量，而火灾引起的 PM2.5 和所有其他来源的 PM2.5 在空间和时间上都是相关的。我们提出了一个框架，利用新颖的因果推理框架和反事实情景下经过偏差调整的 PM2.5 化学模型表征，估算火灾贡献的 PM2.5 和所有其他来源的 PM2.5。用于本分析的 PM2.5 化学模型表示是使用社区多尺度空气质量建模系统（CMAQ）模拟的，在 2008-2012 年野火季节，在有和没有火灾排放的情况下在美国毗连地区运行。CMAQ 的输出结果与同一空间域和时间段内监测点的观测结果进行了校准。我们使用贝叶斯模型来估算野火对 PM2.5 的影响，并说明在哪些假设条件下估算结果具有有效的因果解释。我们的结果包括野火烟雾对美国毗连地区 PM2.5 贡献的估计值。此外，我们还计算了与野火烟雾造成的 PM2.5 相关的健康负担。

{"title":"A SPATIAL CAUSAL ANALYSIS OF WILDLAND FIRE-CONTRIBUTED PM2.5 USING NUMERICAL MODEL OUTPUT.","authors":"Alexandra Larsen, Shu Yang, Brian J Reich, Ana G Rappold","doi":"10.1214/22-aoas1610","DOIUrl":"10.1214/22-aoas1610","url":null,"abstract":"Wildland fire smoke contains hazardous levels of fine particulate matter (PM2.5), a pollutant shown to adversely effect health. Estimating fire attributable PM2.5 concentrations is key to quantifying the impact on air quality and subsequent health burden. This is a challenging problem since only total PM2.5 is measured at monitoring stations and both fire-attributable PM2.5 and PM2.5 from all other sources are correlated in space and time. We propose a framework for estimating fire-contributed PM2.5 and PM2.5 from all other sources using a novel causal inference framework and bias-adjusted chemical model representations of PM2.5 under counterfactual scenarios. The chemical model representation of PM2.5 for this analysis is simulated using Community Multiscale Air Quality Modeling System (CMAQ), run with and without fire emissions across the contiguous U.S. for the 2008-2012 wildfire seasons. The CMAQ output is calibrated with observations from monitoring sites for the same spatial domain and time period. We use a Bayesian model that accounts for spatial variation to estimate the effect of wildland fires on PM2.5 and state assumptions under which the estimate has a valid causal interpretation. Our results include estimates of the contributions of wildfire smoke to PM2.5 for the contiguous U.S. Additionally, we compute the health burden associated with the PM2.5 attributable to wildfire smoke.","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"16 4","pages":"2714-2731"},"PeriodicalIF":1.3,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10181852/pdf/nihms-1846188.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9468690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Hierarchical resampling for bagging in multistudy prediction with applications to human neurochemical sensing. 多研究预测中的分级重采样（Hierarchical resampling for bagging），应用于人类神经化学传感。

IF 1.8 4区数学 Q2 STATISTICS & PROBABILITY

Annals of Applied Statistics

Pub Date : 2022-12-01 Epub Date: 2022-09-26 DOI: 10.1214/21-aoas1574

Gabriel Loewinger, Prasad Patil, Kenneth T Kishida, Giovanni Parmigiani

We propose the "study strap ensemble", which combines advantages of two common approaches to fitting prediction models when multiple training datasets ("studies") are available: pooling studies and fitting one model versus averaging predictions from multiple models each fit to individual studies. The study strap ensemble fits models to bootstrapped datasets, or "pseudo-studies." These are generated by resampling from multiple studies with a hierarchical resampling scheme that generalizes the randomized cluster bootstrap. The study strap is controlled by a tuning parameter that determines the proportion of observations to draw from each study. When the parameter is set to its lowest value, each pseudo-study is resampled from only a single study. When it is high, the study strap ignores the multi-study structure and generates pseudo-studies by merging the datasets and drawing observations like a standard bootstrap. We empirically show the optimal tuning value often lies in between, and prove that special cases of the study strap draw the merged dataset and the set of original studies as pseudo-studies. We extend the study strap approach with an ensemble weighting scheme that utilizes information in the distribution of the covariates of the test dataset. Our work is motivated by neuroscience experiments using real-time neurochemical sensing during awake behavior in humans. Current techniques to perform this kind of research require measurements from an electrode placed in the brain during awake neurosurgery and rely on prediction models to estimate neurotransmitter concentrations from the electrical measurements recorded by the electrode. These models are trained by combining multiple datasets that are collected in vitro under heterogeneous conditions in order to promote accuracy of the models when applied to data collected in the brain. A prevailing challenge is deciding how to combine studies or ensemble models trained on different studies to enhance model generalizability. Our methods produce marked improvements in simulations and in this application. All methods are available in the studyStrap CRAN package.

我们提出了 "研究带集合"，它结合了在有多个训练数据集（"研究"）的情况下拟合预测模型的两种常用方法的优点：集合研究和拟合一个模型与平均每个研究拟合的多个模型的预测结果。研究带集合拟合模型适用于自引导数据集或 "伪研究"。这些数据集是通过对多项研究进行重采样产生的，重采样方案采用了分层重采样方法，对随机分组自举法进行了推广。研究带由一个调整参数控制，该参数决定了从每项研究中抽取观察值的比例。当参数设置为最低值时，每个伪研究只从单个研究中进行重采样。当参数值较高时，研究表带会忽略多研究结构，通过合并数据集生成伪研究，并像标准自举法一样抽取观察值。我们的经验表明，最佳调整值往往介于两者之间，并证明了研究带的特殊情况是将合并数据集和原始研究集作为伪研究。我们通过利用测试数据集协变量分布信息的集合加权方案扩展了研究带方法。我们的工作源于在人类清醒行为中使用实时神经化学传感的神经科学实验。目前进行此类研究的技术需要在清醒神经外科手术过程中通过放置在大脑中的电极进行测量，并依靠预测模型从电极记录的电测量值估算神经递质浓度。这些模型的训练方法是将在体外不同条件下收集的多个数据集结合起来，以提高模型应用于大脑中收集的数据时的准确性。一个普遍存在的挑战是决定如何将不同研究或在不同研究中训练的集合模型结合起来，以提高模型的通用性。我们的方法在模拟和应用方面都有明显的改进。所有方法都可以在 studyStrap CRAN 软件包中找到。

{"title":"Hierarchical resampling for bagging in multistudy prediction with applications to human neurochemical sensing.","authors":"Gabriel Loewinger, Prasad Patil, Kenneth T Kishida, Giovanni Parmigiani","doi":"10.1214/21-aoas1574","DOIUrl":"10.1214/21-aoas1574","url":null,"abstract":"We propose the \"study strap ensemble\", which combines advantages of two common approaches to fitting prediction models when multiple training datasets (\"studies\") are available: pooling studies and fitting one model versus averaging predictions from multiple models each fit to individual studies. The study strap ensemble fits models to bootstrapped datasets, or \"pseudo-studies.\" These are generated by resampling from multiple studies with a hierarchical resampling scheme that generalizes the randomized cluster bootstrap. The study strap is controlled by a tuning parameter that determines the proportion of observations to draw from each study. When the parameter is set to its lowest value, each pseudo-study is resampled from only a single study. When it is high, the study strap ignores the multi-study structure and generates pseudo-studies by merging the datasets and drawing observations like a standard bootstrap. We empirically show the optimal tuning value often lies in between, and prove that special cases of the study strap draw the merged dataset and the set of original studies as pseudo-studies. We extend the study strap approach with an ensemble weighting scheme that utilizes information in the distribution of the covariates of the test dataset. Our work is motivated by neuroscience experiments using real-time neurochemical sensing during awake behavior in humans. Current techniques to perform this kind of research require measurements from an electrode placed in the brain during awake neurosurgery and rely on prediction models to estimate neurotransmitter concentrations from the electrical measurements recorded by the electrode. These models are trained by combining multiple datasets that are collected in vitro under heterogeneous conditions in order to promote accuracy of the models when applied to data collected in the brain. A prevailing challenge is deciding how to combine studies or ensemble models trained on different studies to enhance model generalizability. Our methods produce marked improvements in simulations and in this application. All methods are available in the studyStrap CRAN package.","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"16 4","pages":"2145-2165"},"PeriodicalIF":1.8,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9586160/pdf/nihms-1800688.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10733907","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

NETWORK DIFFERENTIAL CONNECTIVITY ANALYSIS. 网络差分连接性分析。

IF 1.3 4区数学 Q2 STATISTICS & PROBABILITY

Annals of Applied Statistics

Pub Date : 2022-12-01 Epub Date: 2022-09-26 DOI: 10.1214/21-aoas1581

Sen Zhao, Ali Shojaie

Identifying differences in networks has become a canonical problem in many biological applications. Existing methods try to accomplish this goal by either directly comparing the estimated structures of two networks, or testing the null hypothesis that the covariance or inverse covariance matrices in two populations are identical. However, estimation approaches do not provide measures of uncertainty, e.g., p-values, whereas existing testing approaches could lead to misleading results, as we illustrate in this paper. To address these shortcomings, we propose a qualitative hypothesis testing framework, which tests whether the connectivity structures in the two networks are the same. our framework is especially appropriate if the goal is to identify nodes or edges that are differentially connected. No existing approach could test such hypotheses and provide corresponding measures of uncertainty. Theoretically, we show that under appropriate conditions, our proposal correctly controls the type-I error rate in testing the qualitative hypothesis. Empirically, we demonstrate the performance of our proposal using simulation studies and applications in cancer genomics.

识别网络中的差异已经成为许多生物学应用中的一个典型问题。现有的方法试图通过直接比较两个网络的估计结构，或者测试两个群体中的协方差矩阵或逆协方差矩阵相同的零假设来实现这一目标。然而，正如我们在本文中所说明的，估计方法不能提供不确定性的测量，例如p值，而现有的测试方法可能会导致误导性的结果。为了解决这些缺点，我们提出了一个定性假设测试框架，该框架测试两个网络中的连接结构是否相同。如果目标是识别差异连接的节点或边，那么我们的框架尤其合适。现有的任何方法都无法检验这些假设并提供相应的不确定性度量。从理论上讲，我们证明了在适当的条件下，我们的建议在检验定性假设时正确地控制了I型错误率。根据经验，我们使用癌症基因组学中的模拟研究和应用来证明我们的提案的性能。

{"title":"NETWORK DIFFERENTIAL CONNECTIVITY ANALYSIS.","authors":"Sen Zhao, Ali Shojaie","doi":"10.1214/21-aoas1581","DOIUrl":"10.1214/21-aoas1581","url":null,"abstract":"Identifying differences in networks has become a canonical problem in many biological applications. Existing methods try to accomplish this goal by either directly comparing the estimated structures of two networks, or testing the null hypothesis that the covariance or inverse covariance matrices in two populations are identical. However, estimation approaches do not provide measures of uncertainty, e.g., p-values, whereas existing testing approaches could lead to misleading results, as we illustrate in this paper. To address these shortcomings, we propose a qualitative hypothesis testing framework, which tests whether the connectivity structures in the two networks are the same. our framework is especially appropriate if the goal is to identify nodes or edges that are differentially connected. No existing approach could test such hypotheses and provide corresponding measures of uncertainty. Theoretically, we show that under appropriate conditions, our proposal correctly controls the type-I error rate in testing the qualitative hypothesis. Empirically, we demonstrate the performance of our proposal using simulation studies and applications in cancer genomics.","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"16 4","pages":"2166-2182"},"PeriodicalIF":1.3,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10569671/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41240659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0