With the growing cost of health care in the United States, the need to improve efficiency and efficacy has become increasingly urgent. There has been a keen interest in developing interventions to effectively coordinate the typically fragmented care of patients with many comorbidities. Evaluation of such interventions is often challenging given their long-term nature and their differential effectiveness among different patients. Furthermore, care coordination interventions are often highly resource-intensive. Hence there is pressing need to identify which patients would benefit the most from a care coordination program. In this work we introduce a subgroup identification procedure for long-term interventions whose effects are expected to change smoothly over time. We allow differential effects of an intervention to vary over time and encourage these effects to be more similar for closer time points by utilizing a fused lasso penalty. Our approach allows for flexible modeling of temporally changing intervention effects while also borrowing strength in estimation over time. We utilize our approach to construct a personalized enrollment decision rule for a complex case management intervention in a large health system and demonstrate that the enrollment decision rule results in improvement in health outcomes and care costs. The proposed methodology could have broad usage for the analysis of different types of long-term interventions or treatments including other interventions commonly implemented in health systems.
{"title":"FUSED COMPARATIVE INTERVENTION SCORING FOR HETEROGENEITY OF LONGITUDINAL INTERVENTION EFFECTS.","authors":"Jared D Huling, Menggang Yu, Maureen Smith","doi":"10.1214/18-aoas1216","DOIUrl":"https://doi.org/10.1214/18-aoas1216","url":null,"abstract":"<p><p>With the growing cost of health care in the United States, the need to improve efficiency and efficacy has become increasingly urgent. There has been a keen interest in developing interventions to effectively coordinate the typically fragmented care of patients with many comorbidities. Evaluation of such interventions is often challenging given their long-term nature and their differential effectiveness among different patients. Furthermore, care coordination interventions are often highly resource-intensive. Hence there is pressing need to identify which patients would benefit the most from a care coordination program. In this work we introduce a subgroup identification procedure for long-term interventions whose effects are expected to change smoothly over time. We allow differential effects of an intervention to vary over time and encourage these effects to be more similar for closer time points by utilizing a fused lasso penalty. Our approach allows for flexible modeling of temporally changing intervention effects while also borrowing strength in estimation over time. We utilize our approach to construct a personalized enrollment decision rule for a complex case management intervention in a large health system and demonstrate that the enrollment decision rule results in improvement in health outcomes and care costs. The proposed methodology could have broad usage for the analysis of different types of long-term interventions or treatments including other interventions commonly implemented in health systems.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"13 2","pages":"824-847"},"PeriodicalIF":1.8,"publicationDate":"2019-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1214/18-aoas1216","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9455781","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-03-01Epub Date: 2019-04-10DOI: 10.1214/18-AOAS1188
Zhiguang Huo, Chi Song, George Tseng
Due to the rapid development of high-throughput experimental techniques and fast-dropping prices, many transcriptomic datasets have been generated and accumulated in the public domain. Meta-analysis combining multiple transcriptomic studies can increase the statistical power to detect disease-related biomarkers. In this paper, we introduce a Bayesian latent hierarchical model to perform transcriptomic meta-analysis. This method is capable of detecting genes that are differentially expressed (DE) in only a subset of the combined studies, and the latent variables help quantify homogeneous and heterogeneous differential expression signals across studies. A tight clustering algorithm is applied to detected biomarkers to capture differential meta-patterns that are informative to guide further biological investigation. Simulations and three examples, including a microarray dataset from metabolism-related knockout mice, an RNA-seq dataset from HIV transgenic rats, and cross-platform datasets from human breast cancer, are used to demonstrate the performance of the proposed method.
{"title":"BAYESIAN LATENT HIERARCHICAL MODEL FOR TRANSCRIPTOMIC META-ANALYSIS TO DETECT BIOMARKERS WITH CLUSTERED META-PATTERNS OF DIFFERENTIAL EXPRESSION SIGNALS.","authors":"Zhiguang Huo, Chi Song, George Tseng","doi":"10.1214/18-AOAS1188","DOIUrl":"10.1214/18-AOAS1188","url":null,"abstract":"<p><p>Due to the rapid development of high-throughput experimental techniques and fast-dropping prices, many transcriptomic datasets have been generated and accumulated in the public domain. Meta-analysis combining multiple transcriptomic studies can increase the statistical power to detect disease-related biomarkers. In this paper, we introduce a Bayesian latent hierarchical model to perform transcriptomic meta-analysis. This method is capable of detecting genes that are differentially expressed (DE) in only a subset of the combined studies, and the latent variables help quantify homogeneous and heterogeneous differential expression signals across studies. A tight clustering algorithm is applied to detected biomarkers to capture differential meta-patterns that are informative to guide further biological investigation. Simulations and three examples, including a microarray dataset from metabolism-related knockout mice, an RNA-seq dataset from HIV transgenic rats, and cross-platform datasets from human breast cancer, are used to demonstrate the performance of the proposed method.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"13 1","pages":"340-366"},"PeriodicalIF":1.3,"publicationDate":"2019-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6472949/pdf/nihms-977410.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37171811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-03-01Epub Date: 2019-04-10DOI: 10.1214/18-AOAS1182
Juhee Lee, Peter F Thall, Steven H Lin
We propose a Bayesian semiparametric joint regression model for a recurrent event process and survival time. Assuming independent latent subject frailties, we define marginal models for the recurrent event process intensity and survival distribution as functions of the subject's frailty and baseline covariates. A robust Bayesian model, called Joint-DP, is obtained by assuming a Dirichlet process for the frailty distribution. We present a simulation study that compares posterior estimates under the Joint-DP model to a Bayesian joint model with lognormal frailties, a frequentist joint model, and marginal models for either the recurrent event process or survival time. The simulations show that the Joint-DP model does a good job of correcting for treatment assignment bias, and has favorable estimation reliability and accuracy compared with the alternative models. The Joint-DP model is applied to analyze an observational dataset from esophageal cancer patients treated with chemo-radiation, including the times of recurrent effusions of fluid to the heart or lungs, survival time, prognostic covariates, and radiation therapy modality.
{"title":"Bayesian Semiparametric Joint Regression Analysis of Recurrent Adverse Events and Survival in Esophageal Cancer Patients.","authors":"Juhee Lee, Peter F Thall, Steven H Lin","doi":"10.1214/18-AOAS1182","DOIUrl":"10.1214/18-AOAS1182","url":null,"abstract":"<p><p>We propose a Bayesian semiparametric joint regression model for a recurrent event process and survival time. Assuming independent latent subject frailties, we define marginal models for the recurrent event process intensity and survival distribution as functions of the subject's frailty and baseline covariates. A robust Bayesian model, called Joint-DP, is obtained by assuming a Dirichlet process for the frailty distribution. We present a simulation study that compares posterior estimates under the Joint-DP model to a Bayesian joint model with lognormal frailties, a frequentist joint model, and marginal models for either the recurrent event process or survival time. The simulations show that the Joint-DP model does a good job of correcting for treatment assignment bias, and has favorable estimation reliability and accuracy compared with the alternative models. The Joint-DP model is applied to analyze an observational dataset from esophageal cancer patients treated with chemo-radiation, including the times of recurrent effusions of fluid to the heart or lungs, survival time, prognostic covariates, and radiation therapy modality.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"13 1","pages":"221-247"},"PeriodicalIF":1.3,"publicationDate":"2019-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6824476/pdf/nihms969597.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41219382","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-03-01Epub Date: 2019-04-10DOI: 10.1214/18-AOAS1187
Xiaoyue Niu, Peter D Hoff
Health exams determine a patient's health status by comparing the patient's measurement with a population reference range, a 95% interval derived from a homogeneous reference population. Similarly, most of the established relation among health problems are assumed to hold for the entire population. We use data from the 2009-2010 National Health and Nutrition Examination Survey (NHANES) on four major health problems in the U.S. and apply a joint mean and covariance model to study how the reference ranges and associations of those health outcomes could vary among subpopulations. We discuss guidelines for model selection and evaluation, using standard criteria such as AIC in conjunction with posterior predictive checks. The results from the proposed model can help identify subpopulations in which more data need to be collected to refine the reference range and to study the specific associations among those health problems.
{"title":"JOINT MEAN AND COVARIANCE MODELING OF MULTIPLE HEALTH OUTCOME MEASURES.","authors":"Xiaoyue Niu, Peter D Hoff","doi":"10.1214/18-AOAS1187","DOIUrl":"https://doi.org/10.1214/18-AOAS1187","url":null,"abstract":"<p><p>Health exams determine a patient's health status by comparing the patient's measurement with a population reference range, a 95% interval derived from a homogeneous reference population. Similarly, most of the established relation among health problems are assumed to hold for the entire population. We use data from the 2009-2010 National Health and Nutrition Examination Survey (NHANES) on four major health problems in the U.S. and apply a joint mean and covariance model to study how the reference ranges and associations of those health outcomes could vary among subpopulations. We discuss guidelines for model selection and evaluation, using standard criteria such as AIC in conjunction with posterior predictive checks. The results from the proposed model can help identify subpopulations in which more data need to be collected to refine the reference range and to study the specific associations among those health problems.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"13 1","pages":"321-339"},"PeriodicalIF":1.8,"publicationDate":"2019-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1214/18-AOAS1187","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41219384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We propose a new approach for estimating causal effects when the exposure is measured with error and confounding adjustment is performed via a generalized propensity score (GPS). Using validation data, we propose a regression calibration (RC)-based adjustment for a continuous error-prone exposure combined with GPS to adjust for confounding (RC-GPS). The outcome analysis is conducted after transforming the corrected continuous exposure into a categorical exposure. We consider confounding adjustment in the context of GPS subclassification, inverse probability treatment weighting (IPTW) and matching. In simulations with varying degrees of exposure error and confounding bias, RC-GPS eliminates bias from exposure error and confounding compared to standard approaches that rely on the error-prone exposure. We applied RC-GPS to a rich data platform to estimate the causal effect of long-term exposure to fine particles (PM2.5) on mortality in New England for the period from 2000 to 2012. The main study consists of 2202 zip codes covered by 217,660 1 km × 1 km grid cells with yearly mortality rates, yearly PM2.5 averages estimated from a spatio-temporal model (error-prone exposure) and several potential confounders. The internal validation study includes a subset of 83 1 km × 1 km grid cells within 75 zip codes from the main study with error-free yearly PM2.5 exposures obtained from monitor stations. Under assumptions of noninterference and weak unconfoundedness, using matching we found that exposure to moderate levels of PM2.5 (8 < PM2.5 ≤ 10 μg/m3) causes a 2.8% (95% CI: 0.6%, 3.6%) increase in all-cause mortality compared to low exposure (PM2.5 ≤ 8 μg/m3).
{"title":"CAUSAL INFERENCE IN THE CONTEXT OF AN ERROR PRONE EXPOSURE: AIR POLLUTION AND MORTALITY.","authors":"Xiao Wu, Danielle Braun, Marianthi-Anna Kioumourtzoglou, Christine Choirat, Qian Di, Francesca Dominici","doi":"10.1214/18-AOAS1206","DOIUrl":"https://doi.org/10.1214/18-AOAS1206","url":null,"abstract":"<p><p>We propose a new approach for estimating causal effects when the exposure is measured with error and confounding adjustment is performed via a generalized propensity score (GPS). Using validation data, we propose a regression calibration (RC)-based adjustment for a continuous error-prone exposure combined with GPS to adjust for confounding (RC-GPS). The outcome analysis is conducted after transforming the corrected continuous exposure into a categorical exposure. We consider confounding adjustment in the context of GPS subclassification, inverse probability treatment weighting (IPTW) and matching. In simulations with varying degrees of exposure error and confounding bias, RC-GPS eliminates bias from exposure error and confounding compared to standard approaches that rely on the error-prone exposure. We applied RC-GPS to a rich data platform to estimate the causal effect of long-term exposure to fine particles (PM<sub>2.5</sub>) on mortality in New England for the period from 2000 to 2012. The main study consists of 2202 zip codes covered by 217,660 1 km × 1 km grid cells with yearly mortality rates, yearly PM<sub>2.5</sub> averages estimated from a spatio-temporal model (error-prone exposure) and several potential confounders. The internal validation study includes a subset of 83 1 km × 1 km grid cells within 75 zip codes from the main study with error-free yearly PM<sub>2.5</sub> exposures obtained from monitor stations. Under assumptions of noninterference and weak unconfoundedness, using matching we found that exposure to moderate levels of PM<sub>2.5</sub> (8 < PM<sub>2.5</sub> ≤ 10 <i>μ</i>g/m<sup>3</sup>) causes a 2.8% (95% CI: 0.6%, 3.6%) increase in all-cause mortality compared to low exposure (PM<sub>2.5</sub> ≤ 8 <i>μ</i>g/m<sup>3</sup>).</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"13 1","pages":"520-547"},"PeriodicalIF":1.8,"publicationDate":"2019-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1214/18-AOAS1206","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41219383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-03-01Epub Date: 2019-04-10DOI: 10.1214/18-AOAS1185
Eugene Katsevich, Chiara Sabatti
We tackle the problem of selecting from among a large number of variables those that are "important" for an outcome. We consider situations where groups of variables are also of interest. For example, each variable might be a genetic polymorphism, and we might want to study how a trait depends on variability in genes, segments of DNA that typically contain multiple such polymorphisms. In this context, to discover that a variable is relevant for the outcome implies discovering that the larger entity it represents is also important. To guarantee meaningful results with high chance of replicability, we suggest controlling the rate of false discoveries for findings at the level of individual variables and at the level of groups. Building on the knockoff construction of Barber and Candès [Ann. Statist.43 (2015) 2055-2085] and the multilayer testing framework of Barber and Ramdas [J. Roy. Statist. Soc. Ser. B79 (2017) 1247-1268], we introduce the multilayer knockoff filter (MKF). We prove that MKF simultaneously controls the FDR at each resolution and use simulations to show that it incurs little power loss compared to methods that provide guarantees only for the discoveries of individual variables. We apply MKF to analyze a genetic dataset and find that it successfully reduces the number of false gene discoveries without a significant reduction in power.
{"title":"MULTILAYER KNOCKOFF FILTER: CONTROLLED VARIABLE SELECTION AT MULTIPLE RESOLUTIONS.","authors":"Eugene Katsevich, Chiara Sabatti","doi":"10.1214/18-AOAS1185","DOIUrl":"https://doi.org/10.1214/18-AOAS1185","url":null,"abstract":"<p><p>We tackle the problem of selecting from among a large number of variables those that are \"important\" for an outcome. We consider situations where groups of variables are also of interest. For example, each variable might be a genetic polymorphism, and we might want to study how a trait depends on variability in genes, segments of DNA that typically contain multiple such polymorphisms. In this context, to discover that a variable is relevant for the outcome implies discovering that the larger entity it represents is also important. To guarantee meaningful results with high chance of replicability, we suggest controlling the rate of false discoveries for findings at the level of individual variables and at the level of groups. Building on the knockoff construction of Barber and Candès [<i>Ann. Statist.</i> <b>43</b> (2015) 2055-2085] and the multilayer testing framework of Barber and Ramdas [<i>J. Roy. Statist. Soc. Ser. B</i> <b>79</b> (2017) 1247-1268], we introduce the multilayer knockoff filter (MKF). We prove that MKF simultaneously controls the FDR at each resolution and use simulations to show that it incurs little power loss compared to methods that provide guarantees only for the discoveries of individual variables. We apply MKF to analyze a genetic dataset and find that it successfully reduces the number of false gene discoveries without a significant reduction in power.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"13 1","pages":"1-33"},"PeriodicalIF":1.8,"publicationDate":"2019-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1214/18-AOAS1185","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41219385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-03-01Epub Date: 2019-04-10DOI: 10.1214/18-aoas1199
Jonggyu Baek, Bin Zhu, Peter X K Song
Early infancy from at-birth to 3 years is critical for cognitive, emotional and social development of infants. During this period, infant's developmental tempo and outcomes are potentially impacted by in utero exposure to endocrine disrupting compounds (EDCs), such as bisphenol A (BPA) and phthalates. We investigate effects of ten ubiquitous EDCs on the infant growth dynamics of body mass index (BMI) in a birth cohort study.Modeling growth acceleration is proposed to understand the "force of growth" through a class of semiparametric stochastic velocity models. The great flexibility of such a dynamic model enables us to capture subject-specific dynamics of growth trajectories and to assess effects of the EDCs on potential delay of growth. We adopted a Bayesian method with the Ornstein-Uhlenbeck process as the prior for the growth rate function, in which the World Health Organization global infant's growth curves were integrated into our analysis. We found that BPA and most of phthalates exposed during the first trimester of pregnancy were inversely associated with BMI growth acceleration, resulting in a delayed achievement of infant BMI peak. Such early growth deficiency has been reported as a profound impact on health outcomes in puberty (e.g., timing of sexual maturation) and adulthood.
{"title":"BAYESIAN ANALYSIS OF INFANT'S GROWTH DYNAMICS WITH <i>IN UTERO</i> EXPOSURE TO ENVIRONMENTAL TOXICANTS.","authors":"Jonggyu Baek, Bin Zhu, Peter X K Song","doi":"10.1214/18-aoas1199","DOIUrl":"10.1214/18-aoas1199","url":null,"abstract":"<p><p>Early infancy from at-birth to 3 years is critical for cognitive, emotional and social development of infants. During this period, infant's developmental tempo and outcomes are potentially impacted by <i>in utero</i> exposure to endocrine disrupting compounds (EDCs), such as bisphenol A (BPA) and phthalates. We investigate effects of ten ubiquitous EDCs on the infant growth dynamics of body mass index (BMI) in a birth cohort study.Modeling growth acceleration is proposed to understand the \"force of growth\" through a class of semiparametric stochastic velocity models. The great flexibility of such a dynamic model enables us to capture subject-specific dynamics of growth trajectories and to assess effects of the EDCs on potential delay of growth. We adopted a Bayesian method with the Ornstein-Uhlenbeck process as the prior for the growth rate function, in which the World Health Organization global infant's growth curves were integrated into our analysis. We found that BPA and most of phthalates exposed during the first trimester of pregnancy were inversely associated with BMI growth acceleration, resulting in a delayed achievement of infant BMI peak. Such early growth deficiency has been reported as a profound impact on health outcomes in puberty (e.g., timing of sexual maturation) and adulthood.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"13 1","pages":"297-320"},"PeriodicalIF":1.8,"publicationDate":"2019-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10617987/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71428742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-12-01Epub Date: 2018-11-13DOI: 10.1214/18-AOAS1162
Sean Jewell, Daniela Witten
In recent years new technologies in neuroscience have made it possible to measure the activities of large numbers of neurons simultaneously in behaving animals. For each neuron a fluorescence trace is measured; this can be seen as a first-order approximation of the neuron's activity over time. Determining the exact time at which a neuron spikes on the basis of its fluorescence trace is an important open problem in the field of computational neuroscience. Recently, a convex optimization problem involving an ℓ1 penalty was proposed for this task. In this paper we slightly modify that recent proposal by replacing the ℓ1 penalty with an ℓ0 penalty. In stark contrast to the conventional wisdom that ℓ0 optimization problems are computationally intractable, we show that the resulting optimization problem can be efficiently solved for the global optimum using an extremely simple and efficient dynamic programming algorithm. Our R-language implementation of the proposed algorithm runs in a few minutes on fluorescence traces of 100,000 timesteps. Furthermore, our proposal leads to substantial improvements over the previous ℓ1 proposal, in simulations as well as on two calcium imaging datasets. R-language software for our proposal is available on CRAN in the package LZeroSpikeInference. Instructions for running this software in python can be found at https://github.com/jewellsean/LZeroSpikeInference.
{"title":"EXACT SPIKE TRAIN INFERENCE VIA ℓ<sub>0</sub> OPTIMIZATION.","authors":"Sean Jewell, Daniela Witten","doi":"10.1214/18-AOAS1162","DOIUrl":"10.1214/18-AOAS1162","url":null,"abstract":"<p><p>In recent years new technologies in neuroscience have made it possible to measure the activities of large numbers of neurons simultaneously in behaving animals. For each neuron a <i>fluorescence trace</i> is measured; this can be seen as a first-order approximation of the neuron's activity over time. Determining the exact time at which a neuron spikes on the basis of its fluorescence trace is an important open problem in the field of computational neuroscience. Recently, a convex optimization problem involving an ℓ<sub>1</sub> penalty was proposed for this task. In this paper we slightly modify that recent proposal by replacing the ℓ<sub>1</sub> penalty with an ℓ<sub>0</sub> penalty. In stark contrast to the conventional wisdom that ℓ<sub>0</sub> optimization problems are computationally intractable, we show that the resulting optimization problem can be efficiently solved for the global optimum using an extremely simple and efficient dynamic programming algorithm. Our R-language implementation of the proposed algorithm runs in a few minutes on fluorescence traces of 100,000 timesteps. Furthermore, our proposal leads to substantial improvements over the previous ℓ<sub>1</sub> proposal, in simulations as well as on two calcium imaging datasets. R-language software for our proposal is available on CRAN in the package LZeroSpikeInference. Instructions for running this software in python can be found at https://github.com/jewellsean/LZeroSpikeInference.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"12 4","pages":"2457-2482"},"PeriodicalIF":1.8,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6322847/pdf/nihms-997321.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36849823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-12-01Epub Date: 2018-11-13DOI: 10.1214/18-AOAS1156
Heping Zhang, Dungang Liu, Jiwei Zhao, Xuan Bi
We propose a novel multivariate model for analyzing hybrid traits and identifying genetic factors for comorbid conditions. Comorbidity is a common phenomenon in mental health in which an individual suffers from multiple disorders simultaneously. For example, in the Study of Addiction: Genetics and Environment (SAGE), alcohol and nicotine addiction were recorded through multiple assessments that we refer to as hybrid traits. Statistical inference for studying the genetic basis of hybrid traits has not been well-developed. Recent rank-based methods have been utilized for conducting association analyses of hybrid traits but do not inform the strength or direction of effects. To overcome this limitation, a parametric modeling framework is imperative. Although such parametric frameworks have been proposed in theory, they are neither well-developed nor extensively used in practice due to their reliance on complicated likelihood functions that have high computational complexity. Many existing parametric frameworks tend to instead use pseudo-likelihoods to reduce computational burdens. Here, we develop a model fitting algorithm for the full likelihood. Our extensive simulation studies demonstrate that inference based on the full likelihood can control the type-I error rate, and gains power and improves the effect size estimation when compared with several existing methods for hybrid models. These advantages remain even if the distribution of the latent variables is misspecified. After analyzing the SAGE data, we identify three genetic variants (rs7672861, rs958331, rs879330) that are significantly associated with the comorbidity of alcohol and nicotine addiction at the chromosome-wide level. Moreover, our approach has greater power in this analysis than several existing methods for hybrid traits.Although the analysis of the SAGE data motivated us to develop the model, it can be broadly applied to analyze any hybrid responses.
{"title":"Modeling Hybrid Traits for Comorbidity and Genetic Studies of Alcohol and Nicotine Co-Dependence.","authors":"Heping Zhang, Dungang Liu, Jiwei Zhao, Xuan Bi","doi":"10.1214/18-AOAS1156","DOIUrl":"10.1214/18-AOAS1156","url":null,"abstract":"<p><p>We propose a novel multivariate model for analyzing hybrid traits and identifying genetic factors for comorbid conditions. Comorbidity is a common phenomenon in mental health in which an individual suffers from multiple disorders simultaneously. For example, in the Study of Addiction: Genetics and Environment (SAGE), alcohol and nicotine addiction were recorded through multiple assessments that we refer to as hybrid traits. Statistical inference for studying the genetic basis of hybrid traits has not been well-developed. Recent rank-based methods have been utilized for conducting association analyses of hybrid traits but do not inform the strength or direction of effects. To overcome this limitation, a parametric modeling framework is imperative. Although such parametric frameworks have been proposed in theory, they are neither well-developed nor extensively used in practice due to their reliance on complicated likelihood functions that have high computational complexity. Many existing parametric frameworks tend to instead use pseudo-likelihoods to reduce computational burdens. Here, we develop a model fitting algorithm for the full likelihood. Our extensive simulation studies demonstrate that inference based on the full likelihood can control the type-I error rate, and gains power and improves the effect size estimation when compared with several existing methods for hybrid models. These advantages remain even if the distribution of the latent variables is misspecified. After analyzing the SAGE data, we identify three genetic variants (rs7672861, rs958331, rs879330) that are significantly associated with the comorbidity of alcohol and nicotine addiction at the chromosome-wide level. Moreover, our approach has greater power in this analysis than several existing methods for hybrid traits.Although the analysis of the SAGE data motivated us to develop the model, it can be broadly applied to analyze any hybrid responses.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"12 4","pages":"2359-2378"},"PeriodicalIF":1.8,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6338437/pdf/nihms-997314.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36883672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-12-01Epub Date: 2018-11-13DOI: 10.1214/18-AOAS1151
Maryclare Griffin, Krista J Gile, Karen I Fredricksen-Goldsen, Mark S Handcock, Elena A Erosheva
Respondent-driven sampling (RDS) is a method for sampling from a target population by leveraging social connections. RDS is invaluable to the study of hard-to-reach populations. However, RDS is costly and can be infeasible. RDS is infeasible when RDS point estimators have small effective sample sizes (large design effects) or when RDS interval estimators have large confidence intervals relative to estimates obtained in previous studies or poor coverage. As a result, researchers need tools to assess whether or not estimation of certain characteristics of interest for specific populations is feasible in advance. In this paper, we develop a simulation-based framework for using pilot data-in the form of a convenience sample of aggregated, egocentric data and estimates of subpopulation sizes within the target population-to assess whether or not RDS is feasible for estimating characteristics of a target population. in doing so, we assume that more is known about egos than alters in the pilot data, which is often the case with aggregated, egocentric data in practice. We build on existing methods for estimating the structure of social networks from aggregated, egocentric sample data and estimates of subpopulation sizes within the target population. We apply this framework to assess the feasibility of estimating the proportion male, proportion bisexual, proportion depressed and proportion infected with HIV/AIDS within three spatially distinct target populations of older lesbian, gay and bisexual adults using pilot data from the caring and Aging with Pride Study and the Gallup Daily Tracking Survey. We conclude that using an RDS sample of 300 subjects is infeasible for estimating the proportion male, but feasible for estimating the proportion bisexual, proportion depressed and proportion infected with HIV/AIDS in all three target populations.
{"title":"A SIMULATION-BASED FRAMEWORK FOR ASSESSING THE FEASIBILITY OF RESPONDENT-DRIVEN SAMPLING FOR ESTIMATING CHARACTERISTICS IN POPULATIONS OF LESBIAN, GAY AND BISEXUAL OLDER ADULTS.","authors":"Maryclare Griffin, Krista J Gile, Karen I Fredricksen-Goldsen, Mark S Handcock, Elena A Erosheva","doi":"10.1214/18-AOAS1151","DOIUrl":"10.1214/18-AOAS1151","url":null,"abstract":"<p><p>Respondent-driven sampling (RDS) is a method for sampling from a target population by leveraging social connections. RDS is invaluable to the study of hard-to-reach populations. However, RDS is costly and can be infeasible. RDS is infeasible when RDS point estimators have small effective sample sizes (large design effects) or when RDS interval estimators have large confidence intervals relative to estimates obtained in previous studies or poor coverage. As a result, researchers need tools to assess whether or not estimation of certain characteristics of interest for specific populations is feasible in advance. In this paper, we develop a simulation-based framework for using pilot data-in the form of a convenience sample of aggregated, egocentric data and estimates of subpopulation sizes within the target population-to assess whether or not RDS is feasible for estimating characteristics of a target population. in doing so, we assume that more is known about egos than alters in the pilot data, which is often the case with aggregated, egocentric data in practice. We build on existing methods for estimating the structure of social networks from aggregated, egocentric sample data and estimates of subpopulation sizes within the target population. We apply this framework to assess the feasibility of estimating the proportion male, proportion bisexual, proportion depressed and proportion infected with HIV/AIDS within three spatially distinct target populations of older lesbian, gay and bisexual adults using pilot data from the caring and Aging with Pride Study and the Gallup Daily Tracking Survey. We conclude that using an RDS sample of 300 subjects is infeasible for estimating the proportion male, but feasible for estimating the proportion bisexual, proportion depressed and proportion infected with HIV/AIDS in all three target populations.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"12 4","pages":"2252-2278"},"PeriodicalIF":1.8,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6800244/pdf/nihms-1052724.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41219381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}