Pub Date : 2018-12-01Epub Date: 2018-11-13DOI: 10.1214/18-AOAS1162
Sean Jewell, Daniela Witten
In recent years new technologies in neuroscience have made it possible to measure the activities of large numbers of neurons simultaneously in behaving animals. For each neuron a fluorescence trace is measured; this can be seen as a first-order approximation of the neuron's activity over time. Determining the exact time at which a neuron spikes on the basis of its fluorescence trace is an important open problem in the field of computational neuroscience. Recently, a convex optimization problem involving an ℓ1 penalty was proposed for this task. In this paper we slightly modify that recent proposal by replacing the ℓ1 penalty with an ℓ0 penalty. In stark contrast to the conventional wisdom that ℓ0 optimization problems are computationally intractable, we show that the resulting optimization problem can be efficiently solved for the global optimum using an extremely simple and efficient dynamic programming algorithm. Our R-language implementation of the proposed algorithm runs in a few minutes on fluorescence traces of 100,000 timesteps. Furthermore, our proposal leads to substantial improvements over the previous ℓ1 proposal, in simulations as well as on two calcium imaging datasets. R-language software for our proposal is available on CRAN in the package LZeroSpikeInference. Instructions for running this software in python can be found at https://github.com/jewellsean/LZeroSpikeInference.
{"title":"EXACT SPIKE TRAIN INFERENCE VIA ℓ<sub>0</sub> OPTIMIZATION.","authors":"Sean Jewell, Daniela Witten","doi":"10.1214/18-AOAS1162","DOIUrl":"10.1214/18-AOAS1162","url":null,"abstract":"<p><p>In recent years new technologies in neuroscience have made it possible to measure the activities of large numbers of neurons simultaneously in behaving animals. For each neuron a <i>fluorescence trace</i> is measured; this can be seen as a first-order approximation of the neuron's activity over time. Determining the exact time at which a neuron spikes on the basis of its fluorescence trace is an important open problem in the field of computational neuroscience. Recently, a convex optimization problem involving an ℓ<sub>1</sub> penalty was proposed for this task. In this paper we slightly modify that recent proposal by replacing the ℓ<sub>1</sub> penalty with an ℓ<sub>0</sub> penalty. In stark contrast to the conventional wisdom that ℓ<sub>0</sub> optimization problems are computationally intractable, we show that the resulting optimization problem can be efficiently solved for the global optimum using an extremely simple and efficient dynamic programming algorithm. Our R-language implementation of the proposed algorithm runs in a few minutes on fluorescence traces of 100,000 timesteps. Furthermore, our proposal leads to substantial improvements over the previous ℓ<sub>1</sub> proposal, in simulations as well as on two calcium imaging datasets. R-language software for our proposal is available on CRAN in the package LZeroSpikeInference. Instructions for running this software in python can be found at https://github.com/jewellsean/LZeroSpikeInference.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"12 4","pages":"2457-2482"},"PeriodicalIF":1.8,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6322847/pdf/nihms-997321.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36849823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-12-01Epub Date: 2018-11-13DOI: 10.1214/18-AOAS1156
Heping Zhang, Dungang Liu, Jiwei Zhao, Xuan Bi
We propose a novel multivariate model for analyzing hybrid traits and identifying genetic factors for comorbid conditions. Comorbidity is a common phenomenon in mental health in which an individual suffers from multiple disorders simultaneously. For example, in the Study of Addiction: Genetics and Environment (SAGE), alcohol and nicotine addiction were recorded through multiple assessments that we refer to as hybrid traits. Statistical inference for studying the genetic basis of hybrid traits has not been well-developed. Recent rank-based methods have been utilized for conducting association analyses of hybrid traits but do not inform the strength or direction of effects. To overcome this limitation, a parametric modeling framework is imperative. Although such parametric frameworks have been proposed in theory, they are neither well-developed nor extensively used in practice due to their reliance on complicated likelihood functions that have high computational complexity. Many existing parametric frameworks tend to instead use pseudo-likelihoods to reduce computational burdens. Here, we develop a model fitting algorithm for the full likelihood. Our extensive simulation studies demonstrate that inference based on the full likelihood can control the type-I error rate, and gains power and improves the effect size estimation when compared with several existing methods for hybrid models. These advantages remain even if the distribution of the latent variables is misspecified. After analyzing the SAGE data, we identify three genetic variants (rs7672861, rs958331, rs879330) that are significantly associated with the comorbidity of alcohol and nicotine addiction at the chromosome-wide level. Moreover, our approach has greater power in this analysis than several existing methods for hybrid traits.Although the analysis of the SAGE data motivated us to develop the model, it can be broadly applied to analyze any hybrid responses.
{"title":"Modeling Hybrid Traits for Comorbidity and Genetic Studies of Alcohol and Nicotine Co-Dependence.","authors":"Heping Zhang, Dungang Liu, Jiwei Zhao, Xuan Bi","doi":"10.1214/18-AOAS1156","DOIUrl":"10.1214/18-AOAS1156","url":null,"abstract":"<p><p>We propose a novel multivariate model for analyzing hybrid traits and identifying genetic factors for comorbid conditions. Comorbidity is a common phenomenon in mental health in which an individual suffers from multiple disorders simultaneously. For example, in the Study of Addiction: Genetics and Environment (SAGE), alcohol and nicotine addiction were recorded through multiple assessments that we refer to as hybrid traits. Statistical inference for studying the genetic basis of hybrid traits has not been well-developed. Recent rank-based methods have been utilized for conducting association analyses of hybrid traits but do not inform the strength or direction of effects. To overcome this limitation, a parametric modeling framework is imperative. Although such parametric frameworks have been proposed in theory, they are neither well-developed nor extensively used in practice due to their reliance on complicated likelihood functions that have high computational complexity. Many existing parametric frameworks tend to instead use pseudo-likelihoods to reduce computational burdens. Here, we develop a model fitting algorithm for the full likelihood. Our extensive simulation studies demonstrate that inference based on the full likelihood can control the type-I error rate, and gains power and improves the effect size estimation when compared with several existing methods for hybrid models. These advantages remain even if the distribution of the latent variables is misspecified. After analyzing the SAGE data, we identify three genetic variants (rs7672861, rs958331, rs879330) that are significantly associated with the comorbidity of alcohol and nicotine addiction at the chromosome-wide level. Moreover, our approach has greater power in this analysis than several existing methods for hybrid traits.Although the analysis of the SAGE data motivated us to develop the model, it can be broadly applied to analyze any hybrid responses.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"12 4","pages":"2359-2378"},"PeriodicalIF":1.8,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6338437/pdf/nihms-997314.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36883672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-12-01Epub Date: 2018-11-13DOI: 10.1214/18-AOAS1151
Maryclare Griffin, Krista J Gile, Karen I Fredricksen-Goldsen, Mark S Handcock, Elena A Erosheva
Respondent-driven sampling (RDS) is a method for sampling from a target population by leveraging social connections. RDS is invaluable to the study of hard-to-reach populations. However, RDS is costly and can be infeasible. RDS is infeasible when RDS point estimators have small effective sample sizes (large design effects) or when RDS interval estimators have large confidence intervals relative to estimates obtained in previous studies or poor coverage. As a result, researchers need tools to assess whether or not estimation of certain characteristics of interest for specific populations is feasible in advance. In this paper, we develop a simulation-based framework for using pilot data-in the form of a convenience sample of aggregated, egocentric data and estimates of subpopulation sizes within the target population-to assess whether or not RDS is feasible for estimating characteristics of a target population. in doing so, we assume that more is known about egos than alters in the pilot data, which is often the case with aggregated, egocentric data in practice. We build on existing methods for estimating the structure of social networks from aggregated, egocentric sample data and estimates of subpopulation sizes within the target population. We apply this framework to assess the feasibility of estimating the proportion male, proportion bisexual, proportion depressed and proportion infected with HIV/AIDS within three spatially distinct target populations of older lesbian, gay and bisexual adults using pilot data from the caring and Aging with Pride Study and the Gallup Daily Tracking Survey. We conclude that using an RDS sample of 300 subjects is infeasible for estimating the proportion male, but feasible for estimating the proportion bisexual, proportion depressed and proportion infected with HIV/AIDS in all three target populations.
{"title":"A SIMULATION-BASED FRAMEWORK FOR ASSESSING THE FEASIBILITY OF RESPONDENT-DRIVEN SAMPLING FOR ESTIMATING CHARACTERISTICS IN POPULATIONS OF LESBIAN, GAY AND BISEXUAL OLDER ADULTS.","authors":"Maryclare Griffin, Krista J Gile, Karen I Fredricksen-Goldsen, Mark S Handcock, Elena A Erosheva","doi":"10.1214/18-AOAS1151","DOIUrl":"10.1214/18-AOAS1151","url":null,"abstract":"<p><p>Respondent-driven sampling (RDS) is a method for sampling from a target population by leveraging social connections. RDS is invaluable to the study of hard-to-reach populations. However, RDS is costly and can be infeasible. RDS is infeasible when RDS point estimators have small effective sample sizes (large design effects) or when RDS interval estimators have large confidence intervals relative to estimates obtained in previous studies or poor coverage. As a result, researchers need tools to assess whether or not estimation of certain characteristics of interest for specific populations is feasible in advance. In this paper, we develop a simulation-based framework for using pilot data-in the form of a convenience sample of aggregated, egocentric data and estimates of subpopulation sizes within the target population-to assess whether or not RDS is feasible for estimating characteristics of a target population. in doing so, we assume that more is known about egos than alters in the pilot data, which is often the case with aggregated, egocentric data in practice. We build on existing methods for estimating the structure of social networks from aggregated, egocentric sample data and estimates of subpopulation sizes within the target population. We apply this framework to assess the feasibility of estimating the proportion male, proportion bisexual, proportion depressed and proportion infected with HIV/AIDS within three spatially distinct target populations of older lesbian, gay and bisexual adults using pilot data from the caring and Aging with Pride Study and the Gallup Daily Tracking Survey. We conclude that using an RDS sample of 300 subjects is infeasible for estimating the proportion male, but feasible for estimating the proportion bisexual, proportion depressed and proportion infected with HIV/AIDS in all three target populations.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"12 4","pages":"2252-2278"},"PeriodicalIF":1.8,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6800244/pdf/nihms-1052724.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41219381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-12-01Epub Date: 2018-11-13DOI: 10.1214/18-AOAS1159
Ashley Petersen, Noah Simon, Daniela Witten
In the past few years, new technologies in the field of neuroscience have made it possible to simultaneously image activity in large populations of neurons at cellular resolution in behaving animals. In mid-2016, a huge repository of this so-called "calcium imaging" data was made publicly available. The availability of this large-scale data resource opens the door to a host of scientific questions for which new statistical methods must be developed. In this paper we consider the first step in the analysis of calcium imaging data-namely, identifying the neurons in a calcium imaging video. We propose a dictionary learning approach for this task. First, we perform image segmentation to develop a dictionary containing a huge number of candidate neurons. Next, we refine the dictionary using clustering. Finally, we apply the dictionary to select neurons and estimate their corresponding activity over time, using a sparse group lasso optimization problem. We assess performance on simulated calcium imaging data and apply our proposal to three calcium imaging data sets. Our proposed approach is implemented in the R package scalpel, which is available on CRAN.
{"title":"SCALPEL: EXTRACTING NEURONS FROM CALCIUM IMAGING DATA.","authors":"Ashley Petersen, Noah Simon, Daniela Witten","doi":"10.1214/18-AOAS1159","DOIUrl":"10.1214/18-AOAS1159","url":null,"abstract":"In the past few years, new technologies in the field of neuroscience have made it possible to simultaneously image activity in large populations of neurons at cellular resolution in behaving animals. In mid-2016, a huge repository of this so-called \"calcium imaging\" data was made publicly available. The availability of this large-scale data resource opens the door to a host of scientific questions for which new statistical methods must be developed. In this paper we consider the first step in the analysis of calcium imaging data-namely, identifying the neurons in a calcium imaging video. We propose a dictionary learning approach for this task. First, we perform image segmentation to develop a dictionary containing a huge number of candidate neurons. Next, we refine the dictionary using clustering. Finally, we apply the dictionary to select neurons and estimate their corresponding activity over time, using a sparse group lasso optimization problem. We assess performance on simulated calcium imaging data and apply our proposal to three calcium imaging data sets. Our proposed approach is implemented in the R package scalpel, which is available on CRAN.","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"12 4","pages":"2430-2456"},"PeriodicalIF":1.8,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1214/18-AOAS1159","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36746524","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-12-01Epub Date: 2018-11-13DOI: 10.1214/18-AOAS1144
Jonathon J O'Brien, Harsha P Gunawardena, Joao A Paulo, Xian Chen, Joseph G Ibrahim, Steven P Gygi, Bahjat F Qaqish
An idealized version of a label-free discovery mass spectrometry proteomics experiment would provide absolute abundance measurements for a whole proteome, across varying conditions. Unfortunately, this ideal is not realized. Measurements are made on peptides requiring an inferential step to obtain protein level estimates. The inference is complicated by experimental factors that necessitate relative abundance estimation and result in widespread non-ignorable missing data. Relative abundance on the log scale takes the form of parameter contrasts. In a complete-case analysis, contrast estimates may be biased by missing data and a substantial amount of useful information will often go unused. To avoid problems with missing data, many analysts have turned to single imputation solutions. Unfortunately, these methods often create further difficulties by hiding inestimable contrasts, preventing the recovery of interblock information and failing to account for imputation uncertainty. To mitigate many of the problems caused by missing values, we propose the use of a Bayesian selection model. Our model is tested on simulated data, real data with simulated missing values, and on a ground truth dilution experiment where all of the true relative changes are known. The analysis suggests that our model, compared with various imputation strategies and complete-case analyses, can increase accuracy and provide substantial improvements to interval coverage.
{"title":"The effects of nonignorable missing data on label-free mass spectrometry proteomics experiments.","authors":"Jonathon J O'Brien, Harsha P Gunawardena, Joao A Paulo, Xian Chen, Joseph G Ibrahim, Steven P Gygi, Bahjat F Qaqish","doi":"10.1214/18-AOAS1144","DOIUrl":"10.1214/18-AOAS1144","url":null,"abstract":"<p><p>An idealized version of a label-free discovery mass spectrometry proteomics experiment would provide absolute abundance measurements for a whole proteome, across varying conditions. Unfortunately, this ideal is not realized. Measurements are made on peptides requiring an inferential step to obtain protein level estimates. The inference is complicated by experimental factors that necessitate relative abundance estimation and result in widespread non-ignorable missing data. Relative abundance on the log scale takes the form of parameter contrasts. In a complete-case analysis, contrast estimates may be biased by missing data and a substantial amount of useful information will often go unused. To avoid problems with missing data, many analysts have turned to single imputation solutions. Unfortunately, these methods often create further difficulties by hiding inestimable contrasts, preventing the recovery of interblock information and failing to account for imputation uncertainty. To mitigate many of the problems caused by missing values, we propose the use of a Bayesian selection model. Our model is tested on simulated data, real data with simulated missing values, and on a ground truth dilution experiment where all of the true relative changes are known. The analysis suggests that our model, compared with various imputation strategies and complete-case analyses, can increase accuracy and provide substantial improvements to interval coverage.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"12 4","pages":"2075-2095"},"PeriodicalIF":1.8,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1214/18-AOAS1144","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36763424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-09-01Epub Date: 2018-09-11DOI: 10.1214/16-aoas915
Alexander M Franks, Florian Markowetz, Edoardo M Airoldi
Improving current models and hypotheses of cellular pathways is one of the major challenges of systems biology and functional genomics. There is a need for methods to build on established expert knowledge and reconcile it with results of new high-throughput studies. Moreover, the available sources of data are heterogeneous, and the data need to be integrated in different ways depending on which part of the pathway they are most informative for. In this paper, we introduce a compartment specific strategy to integrate edge, node and path data for refining a given network hypothesis. To carry out inference, we use a local-move Gibbs sampler for updating the pathway hypothesis from a compendium of heterogeneous data sources, and a new network regression idea for integrating protein attributes. We demonstrate the utility of this approach in a case study of the pheromone response MAPK pathway in the yeast S. cerevisiae.
{"title":"REFINING CELLULAR PATHWAY MODELS USING AN ENSEMBLE OF HETEROGENEOUS DATA SOURCES.","authors":"Alexander M Franks, Florian Markowetz, Edoardo M Airoldi","doi":"10.1214/16-aoas915","DOIUrl":"10.1214/16-aoas915","url":null,"abstract":"<p><p>Improving current models and hypotheses of cellular pathways is one of the major challenges of systems biology and functional genomics. There is a need for methods to build on established expert knowledge and reconcile it with results of new high-throughput studies. Moreover, the available sources of data are heterogeneous, and the data need to be integrated in different ways depending on which part of the pathway they are most informative for. In this paper, we introduce a compartment specific strategy to integrate edge, node and path data for refining a given network hypothesis. To carry out inference, we use a local-move Gibbs sampler for updating the pathway hypothesis from a compendium of heterogeneous data sources, and a new network regression idea for integrating protein attributes. We demonstrate the utility of this approach in a case study of the pheromone response MAPK pathway in the yeast <i>S. cerevisiae</i>.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"12 3","pages":"1361-1384"},"PeriodicalIF":1.8,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9733905/pdf/nihms-1823482.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10366316","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-06-01Epub Date: 2018-07-28DOI: 10.1214/17-AOAS1083
By Kelly Bodwin, Kai Zhang, Andrew Nobel
Given data obtained under two sampling conditions, it is often of interest to identify variables that behave differently in one condition than in the other. We introduce a method for differential analysis of second-order behavior called Differential Correlation Mining (DCM). The DCM method identifies differentially correlated sets of variables, with the property that the average pairwise correlation between variables in a set is higher under one sample condition than the other. DCM is based on an iterative search procedure that adaptively updates the size and elements of a candidate variable set. Updates are performed via hypothesis testing of individual variables, based on the asymptotic distribution of their average differential correlation. We investigate the performance of DCM by applying it to simulated data as well as to recent experimental datasets in genomics and brain imaging.
{"title":"A TESTING BASED APPROACH TO THE DISCOVERY OF DIFFERENTIALLY CORRELATED VARIABLE SETS.","authors":"By Kelly Bodwin, Kai Zhang, Andrew Nobel","doi":"10.1214/17-AOAS1083","DOIUrl":"10.1214/17-AOAS1083","url":null,"abstract":"<p><p>Given data obtained under two sampling conditions, it is often of interest to identify variables that behave differently in one condition than in the other. We introduce a method for differential analysis of second-order behavior called Differential Correlation Mining (DCM). The DCM method identifies differentially correlated sets of variables, with the property that the average pairwise correlation between variables in a set is higher under one sample condition than the other. DCM is based on an iterative search procedure that adaptively updates the size and elements of a candidate variable set. Updates are performed via hypothesis testing of individual variables, based on the asymptotic distribution of their average differential correlation. We investigate the performance of DCM by applying it to simulated data as well as to recent experimental datasets in genomics and brain imaging.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"12 2","pages":"1180-1203"},"PeriodicalIF":1.8,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1214/17-AOAS1083","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37486780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-06-01Epub Date: 2018-07-28DOI: 10.1214/18-AOAS1190
Giuseppe Vinci, Valérie Ventura, Matthew A Smith, Robert E Kass
A major challenge in contemporary neuroscience is to analyze data from large numbers of neurons recorded simultaneously across many experimental replications (trials), where the data are counts of neural firing events, and one of the basic problems is to characterize the dependence structure among such multivariate counts. Methods of estimating high-dimensional covariation based on ℓ1-regularization are most appropriate when there are a small number of relatively large partial correlations, but in neural data there are often large numbers of relatively small partial correlations. Furthermore, the variation across trials is often confounded by Poisson-like variation within trials. To overcome these problems we introduce a comprehensive methodology that imbeds a Gaussian graphical model into a hierarchical structure: the counts are assumed Poisson, conditionally on latent variables that follow a Gaussian graphical model, and the graphical model parameters, in turn, are assumed to depend on physiologically-motivated covariates, which can greatly improve correct detection of interactions (non-zero partial correlations). We develop a Bayesian approach to fitting this covariate-adjusted generalized graphical model and we demonstrate its success in simulation studies. We then apply it to data from an experiment on visual attention, where we assess functional interactions between neurons recorded from two brain areas.
{"title":"ADJUSTED REGULARIZATION IN LATENT GRAPHICAL MODELS: APPLICATION TO MULTIPLE-NEURON SPIKE COUNT DATA.","authors":"Giuseppe Vinci, Valérie Ventura, Matthew A Smith, Robert E Kass","doi":"10.1214/18-AOAS1190","DOIUrl":"10.1214/18-AOAS1190","url":null,"abstract":"<p><p>A major challenge in contemporary neuroscience is to analyze data from large numbers of neurons recorded simultaneously across many experimental replications (trials), where the data are counts of neural firing events, and one of the basic problems is to characterize the dependence structure among such multivariate counts. Methods of estimating high-dimensional covariation based on <i>ℓ</i> <sub>1</sub>-regularization are most appropriate when there are a small number of relatively large partial correlations, but in neural data there are often large numbers of relatively small partial correlations. Furthermore, the variation across trials is often confounded by Poisson-like variation within trials. To overcome these problems we introduce a comprehensive methodology that imbeds a Gaussian graphical model into a hierarchical structure: the counts are assumed Poisson, conditionally on latent variables that follow a Gaussian graphical model, and the graphical model parameters, in turn, are assumed to depend on physiologically-motivated covariates, which can greatly improve correct detection of interactions (non-zero partial correlations). We develop a Bayesian approach to fitting this covariate-adjusted generalized graphical model and we demonstrate its success in simulation studies. We then apply it to data from an experiment on visual attention, where we assess functional interactions between neurons recorded from two brain areas.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"12 2","pages":"1068-1095"},"PeriodicalIF":1.3,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6879176/pdf/nihms-1014977.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49684619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-06-01Epub Date: 2018-07-28DOI: 10.1214/18-aoas1175
Jonathan J Azose, Adrian E Raftery
The United Nations is the major organization producing and regularly updating probabilistic population projections for all countries. International migration is a critical component of such projections, and between-country correlations are important for forecasts of regional aggregates. However, in the data we consider there are 200 countries and only 12 data points, each one corresponding to a five-year time period. Thus a 200 × 200 correlation matrix must be estimated on the basis of 12 data points. Using Pearson correlations produces many spurious correlations. We propose a maximum a posteriori estimator for the correlation matrix with an interpretable informative prior distribution. The prior serves to regularize the correlation matrix, shrinking a priori untrustworthy elements towards zero. Our estimated correlation structure improves projections of net migration for regional aggregates, producing narrower projections of migration for Africa as a whole and wider projections for Europe. A simulation study confirms that our estimator outperforms both the Pearson correlation matrix and a simple shrinkage estimator when estimating a sparse correlation matrix.
{"title":"Estimating Large Correlation Matrices for International Migration.","authors":"Jonathan J Azose, Adrian E Raftery","doi":"10.1214/18-aoas1175","DOIUrl":"10.1214/18-aoas1175","url":null,"abstract":"<p><p>The United Nations is the major organization producing and regularly updating probabilistic population projections for all countries. International migration is a critical component of such projections, and between-country correlations are important for forecasts of regional aggregates. However, in the data we consider there are 200 countries and only 12 data points, each one corresponding to a five-year time period. Thus a 200 × 200 correlation matrix must be estimated on the basis of 12 data points. Using Pearson correlations produces many spurious correlations. We propose a maximum <i>a posteriori</i> estimator for the correlation matrix with an interpretable informative prior distribution. The prior serves to regularize the correlation matrix, shrinking <i>a priori</i> untrustworthy elements towards zero. Our estimated correlation structure improves projections of net migration for regional aggregates, producing narrower projections of migration for Africa as a whole and wider projections for Europe. A simulation study confirms that our estimator outperforms both the Pearson correlation matrix and a simple shrinkage estimator when estimating a sparse correlation matrix.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"12 2","pages":"940-970"},"PeriodicalIF":1.3,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7164801/pdf/nihms-1029425.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37851577","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-03-01Epub Date: 2018-03-09DOI: 10.1214/17-AOAS1102
Timothy W Randolph, Sen Zhao, Wade Copeland, Meredith Hullar, Ali Shojaie
The analysis of human microbiome data is often based on dimension-reduced graphical displays and clusterings derived from vectors of microbial abundances in each sample. Common to these ordination methods is the use of biologically motivated definitions of similarity. Principal coordinate analysis, in particular, is often performed using ecologically defined distances, allowing analyses to incorporate context-dependent, non-Euclidean structure. In this paper, we go beyond dimension-reduced ordination methods and describe a framework of high-dimensional regression models that extends these distance-based methods. In particular, we use kernel-based methods to show how to incorporate a variety of extrinsic information, such as phylogeny, into penalized regression models that estimate taxonspecific associations with a phenotype or clinical outcome. Further, we show how this regression framework can be used to address the compositional nature of multivariate predictors comprised of relative abundances; that is, vectors whose entries sum to a constant. We illustrate this approach with several simulations using data from two recent studies on gut and vaginal microbiomes. We conclude with an application to our own data, where we also incorporate a significance test for the estimated coefficients that represent associations between microbial abundance and a percent fat.
{"title":"KERNEL-PENALIZED REGRESSION FOR ANALYSIS OF MICROBIOME DATA.","authors":"Timothy W Randolph, Sen Zhao, Wade Copeland, Meredith Hullar, Ali Shojaie","doi":"10.1214/17-AOAS1102","DOIUrl":"10.1214/17-AOAS1102","url":null,"abstract":"<p><p>The analysis of human microbiome data is often based on dimension-reduced graphical displays and clusterings derived from vectors of microbial abundances in each sample. Common to these ordination methods is the use of biologically motivated definitions of similarity. Principal coordinate analysis, in particular, is often performed using ecologically defined distances, allowing analyses to incorporate context-dependent, non-Euclidean structure. In this paper, we go beyond dimension-reduced ordination methods and describe a framework of high-dimensional regression models that extends these distance-based methods. In particular, we use kernel-based methods to show how to incorporate a variety of extrinsic information, such as phylogeny, into penalized regression models that estimate taxonspecific associations with a phenotype or clinical outcome. Further, we show how this regression framework can be used to address the compositional nature of multivariate predictors comprised of relative abundances; that is, vectors whose entries sum to a constant. We illustrate this approach with several simulations using data from two recent studies on gut and vaginal microbiomes. We conclude with an application to our own data, where we also incorporate a significance test for the estimated coefficients that represent associations between microbial abundance and a percent fat.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"12 1","pages":"540-566"},"PeriodicalIF":1.8,"publicationDate":"2018-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1214/17-AOAS1102","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36500481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}