首页 > 最新文献

Annals of Applied Statistics最新文献

英文 中文
SCALPEL: EXTRACTING NEURONS FROM CALCIUM IMAGING DATA. 手术刀:从钙成像数据中提取神经元。
IF 1.8 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2018-12-01 Epub Date: 2018-11-13 DOI: 10.1214/18-AOAS1159
Ashley Petersen, Noah Simon, Daniela Witten
In the past few years, new technologies in the field of neuroscience have made it possible to simultaneously image activity in large populations of neurons at cellular resolution in behaving animals. In mid-2016, a huge repository of this so-called "calcium imaging" data was made publicly available. The availability of this large-scale data resource opens the door to a host of scientific questions for which new statistical methods must be developed. In this paper we consider the first step in the analysis of calcium imaging data-namely, identifying the neurons in a calcium imaging video. We propose a dictionary learning approach for this task. First, we perform image segmentation to develop a dictionary containing a huge number of candidate neurons. Next, we refine the dictionary using clustering. Finally, we apply the dictionary to select neurons and estimate their corresponding activity over time, using a sparse group lasso optimization problem. We assess performance on simulated calcium imaging data and apply our proposal to three calcium imaging data sets. Our proposed approach is implemented in the R package scalpel, which is available on CRAN.
在过去的几年里,神经科学领域的新技术使得以细胞分辨率同时对行为动物的大量神经元活动进行成像成为可能。2016年年中,一个庞大的所谓“钙成像”数据库被公开。这种大规模数据资源的可用性为一系列科学问题打开了大门,必须开发新的统计方法。在本文中,我们考虑分析钙成像数据的第一步,即识别钙成像视频中的神经元。我们为这项任务提出了一种字典学习方法。首先,我们执行图像分割以开发包含大量候选神经元的字典。接下来,我们使用聚类来细化字典。最后,我们应用字典来选择神经元,并使用稀疏组套索优化问题来估计它们随时间的相应活动。我们评估了模拟钙成像数据的性能,并将我们的建议应用于三个钙成像数据集。我们提出的方法在CRAN上提供的R包手术刀中得到了实施。
{"title":"SCALPEL: EXTRACTING NEURONS FROM CALCIUM IMAGING DATA.","authors":"Ashley Petersen, Noah Simon, Daniela Witten","doi":"10.1214/18-AOAS1159","DOIUrl":"10.1214/18-AOAS1159","url":null,"abstract":"In the past few years, new technologies in the field of neuroscience have made it possible to simultaneously image activity in large populations of neurons at cellular resolution in behaving animals. In mid-2016, a huge repository of this so-called \"calcium imaging\" data was made publicly available. The availability of this large-scale data resource opens the door to a host of scientific questions for which new statistical methods must be developed. In this paper we consider the first step in the analysis of calcium imaging data-namely, identifying the neurons in a calcium imaging video. We propose a dictionary learning approach for this task. First, we perform image segmentation to develop a dictionary containing a huge number of candidate neurons. Next, we refine the dictionary using clustering. Finally, we apply the dictionary to select neurons and estimate their corresponding activity over time, using a sparse group lasso optimization problem. We assess performance on simulated calcium imaging data and apply our proposal to three calcium imaging data sets. Our proposed approach is implemented in the R package scalpel, which is available on CRAN.","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"12 4","pages":"2430-2456"},"PeriodicalIF":1.8,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1214/18-AOAS1159","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36746524","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 34
The effects of nonignorable missing data on label-free mass spectrometry proteomics experiments. 不可忽略的缺失数据对无标记质谱蛋白质组学实验的影响。
IF 1.8 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2018-12-01 Epub Date: 2018-11-13 DOI: 10.1214/18-AOAS1144
Jonathon J O'Brien, Harsha P Gunawardena, Joao A Paulo, Xian Chen, Joseph G Ibrahim, Steven P Gygi, Bahjat F Qaqish

An idealized version of a label-free discovery mass spectrometry proteomics experiment would provide absolute abundance measurements for a whole proteome, across varying conditions. Unfortunately, this ideal is not realized. Measurements are made on peptides requiring an inferential step to obtain protein level estimates. The inference is complicated by experimental factors that necessitate relative abundance estimation and result in widespread non-ignorable missing data. Relative abundance on the log scale takes the form of parameter contrasts. In a complete-case analysis, contrast estimates may be biased by missing data and a substantial amount of useful information will often go unused. To avoid problems with missing data, many analysts have turned to single imputation solutions. Unfortunately, these methods often create further difficulties by hiding inestimable contrasts, preventing the recovery of interblock information and failing to account for imputation uncertainty. To mitigate many of the problems caused by missing values, we propose the use of a Bayesian selection model. Our model is tested on simulated data, real data with simulated missing values, and on a ground truth dilution experiment where all of the true relative changes are known. The analysis suggests that our model, compared with various imputation strategies and complete-case analyses, can increase accuracy and provide substantial improvements to interval coverage.

无标记发现质谱蛋白质组学实验的理想化版本将在不同条件下为整个蛋白质组提供绝对丰度测量。不幸的是,这个理想没有实现。对需要推断步骤以获得蛋白质水平估计的肽进行测量。实验因素使推断变得复杂,这些因素需要相对丰度估计,并导致广泛的不可忽略的数据缺失。对数尺度上的相对丰度采用参数对比的形式。在一个完整的案例分析中,对比度估计可能会因数据缺失而产生偏差,大量有用的信息往往会被闲置。为了避免数据缺失的问题,许多分析师已经转向单一插补解决方案。不幸的是,这些方法往往会隐藏不可估量的对比,阻止块间信息的恢复,并且未能考虑插补的不确定性,从而造成进一步的困难。为了减轻因缺失值而引起的许多问题,我们建议使用贝叶斯选择模型。我们的模型在模拟数据、具有模拟缺失值的真实数据以及已知所有真实相对变化的真实稀释实验上进行了测试。分析表明,与各种插补策略和完整的案例分析相比,我们的模型可以提高准确性,并大幅提高区间覆盖率。
{"title":"The effects of nonignorable missing data on label-free mass spectrometry proteomics experiments.","authors":"Jonathon J O'Brien,&nbsp;Harsha P Gunawardena,&nbsp;Joao A Paulo,&nbsp;Xian Chen,&nbsp;Joseph G Ibrahim,&nbsp;Steven P Gygi,&nbsp;Bahjat F Qaqish","doi":"10.1214/18-AOAS1144","DOIUrl":"10.1214/18-AOAS1144","url":null,"abstract":"<p><p>An idealized version of a label-free discovery mass spectrometry proteomics experiment would provide absolute abundance measurements for a whole proteome, across varying conditions. Unfortunately, this ideal is not realized. Measurements are made on peptides requiring an inferential step to obtain protein level estimates. The inference is complicated by experimental factors that necessitate relative abundance estimation and result in widespread non-ignorable missing data. Relative abundance on the log scale takes the form of parameter contrasts. In a complete-case analysis, contrast estimates may be biased by missing data and a substantial amount of useful information will often go unused. To avoid problems with missing data, many analysts have turned to single imputation solutions. Unfortunately, these methods often create further difficulties by hiding inestimable contrasts, preventing the recovery of interblock information and failing to account for imputation uncertainty. To mitigate many of the problems caused by missing values, we propose the use of a Bayesian selection model. Our model is tested on simulated data, real data with simulated missing values, and on a ground truth dilution experiment where all of the true relative changes are known. The analysis suggests that our model, compared with various imputation strategies and complete-case analyses, can increase accuracy and provide substantial improvements to interval coverage.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"12 4","pages":"2075-2095"},"PeriodicalIF":1.8,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1214/18-AOAS1144","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36763424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 31
REFINING CELLULAR PATHWAY MODELS USING AN ENSEMBLE OF HETEROGENEOUS DATA SOURCES. 利用异构数据源组合完善细胞通路模型。
IF 1.8 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2018-09-01 Epub Date: 2018-09-11 DOI: 10.1214/16-aoas915
Alexander M Franks, Florian Markowetz, Edoardo M Airoldi

Improving current models and hypotheses of cellular pathways is one of the major challenges of systems biology and functional genomics. There is a need for methods to build on established expert knowledge and reconcile it with results of new high-throughput studies. Moreover, the available sources of data are heterogeneous, and the data need to be integrated in different ways depending on which part of the pathway they are most informative for. In this paper, we introduce a compartment specific strategy to integrate edge, node and path data for refining a given network hypothesis. To carry out inference, we use a local-move Gibbs sampler for updating the pathway hypothesis from a compendium of heterogeneous data sources, and a new network regression idea for integrating protein attributes. We demonstrate the utility of this approach in a case study of the pheromone response MAPK pathway in the yeast S. cerevisiae.

改进细胞通路的现有模型和假设是系统生物学和功能基因组学的主要挑战之一。需要有方法以已有的专家知识为基础,并与新的高通量研究结果相协调。此外,可用的数据源是多种多样的,需要根据通路中信息量最大的部分,以不同的方式对数据进行整合。在本文中,我们介绍了一种整合边缘、节点和路径数据的车厢特定策略,以完善给定的网络假设。为了进行推理,我们使用了一种局部移动吉布斯采样器(local-move Gibbs sampler)来更新来自异构数据源汇编的通路假设,并使用了一种新的网络回归思想来整合蛋白质属性。我们在对麦角酵母中信息素响应 MAPK 通路的案例研究中展示了这种方法的实用性。
{"title":"REFINING CELLULAR PATHWAY MODELS USING AN ENSEMBLE OF HETEROGENEOUS DATA SOURCES.","authors":"Alexander M Franks, Florian Markowetz, Edoardo M Airoldi","doi":"10.1214/16-aoas915","DOIUrl":"10.1214/16-aoas915","url":null,"abstract":"<p><p>Improving current models and hypotheses of cellular pathways is one of the major challenges of systems biology and functional genomics. There is a need for methods to build on established expert knowledge and reconcile it with results of new high-throughput studies. Moreover, the available sources of data are heterogeneous, and the data need to be integrated in different ways depending on which part of the pathway they are most informative for. In this paper, we introduce a compartment specific strategy to integrate edge, node and path data for refining a given network hypothesis. To carry out inference, we use a local-move Gibbs sampler for updating the pathway hypothesis from a compendium of heterogeneous data sources, and a new network regression idea for integrating protein attributes. We demonstrate the utility of this approach in a case study of the pheromone response MAPK pathway in the yeast <i>S. cerevisiae</i>.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"12 3","pages":"1361-1384"},"PeriodicalIF":1.8,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9733905/pdf/nihms-1823482.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10366316","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A TESTING BASED APPROACH TO THE DISCOVERY OF DIFFERENTIALLY CORRELATED VARIABLE SETS. 发现差异相关变量集的一种基于测试的方法。
IF 1.8 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2018-06-01 Epub Date: 2018-07-28 DOI: 10.1214/17-AOAS1083
By Kelly Bodwin, Kai Zhang, Andrew Nobel

Given data obtained under two sampling conditions, it is often of interest to identify variables that behave differently in one condition than in the other. We introduce a method for differential analysis of second-order behavior called Differential Correlation Mining (DCM). The DCM method identifies differentially correlated sets of variables, with the property that the average pairwise correlation between variables in a set is higher under one sample condition than the other. DCM is based on an iterative search procedure that adaptively updates the size and elements of a candidate variable set. Updates are performed via hypothesis testing of individual variables, based on the asymptotic distribution of their average differential correlation. We investigate the performance of DCM by applying it to simulated data as well as to recent experimental datasets in genomics and brain imaging.

给定在两种采样条件下获得的数据,识别在一种条件下表现不同于另一种条件的变量通常是令人感兴趣的。我们介绍了一种用于二阶行为微分分析的方法,称为微分相关挖掘(DCM)。DCM方法识别差异相关的变量集,其特性是在一个样本条件下,一集中变量之间的平均成对相关性高于另一个样本情况。DCM基于迭代搜索过程,该过程自适应地更新候选变量集的大小和元素。更新是通过对单个变量的假设检验进行的,基于其平均微分相关性的渐近分布。我们通过将DCM应用于基因组学和脑成像的模拟数据以及最近的实验数据集来研究DCM的性能。
{"title":"A TESTING BASED APPROACH TO THE DISCOVERY OF DIFFERENTIALLY CORRELATED VARIABLE SETS.","authors":"By Kelly Bodwin,&nbsp;Kai Zhang,&nbsp;Andrew Nobel","doi":"10.1214/17-AOAS1083","DOIUrl":"10.1214/17-AOAS1083","url":null,"abstract":"<p><p>Given data obtained under two sampling conditions, it is often of interest to identify variables that behave differently in one condition than in the other. We introduce a method for differential analysis of second-order behavior called Differential Correlation Mining (DCM). The DCM method identifies differentially correlated sets of variables, with the property that the average pairwise correlation between variables in a set is higher under one sample condition than the other. DCM is based on an iterative search procedure that adaptively updates the size and elements of a candidate variable set. Updates are performed via hypothesis testing of individual variables, based on the asymptotic distribution of their average differential correlation. We investigate the performance of DCM by applying it to simulated data as well as to recent experimental datasets in genomics and brain imaging.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"12 2","pages":"1180-1203"},"PeriodicalIF":1.8,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1214/17-AOAS1083","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37486780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
ADJUSTED REGULARIZATION IN LATENT GRAPHICAL MODELS: APPLICATION TO MULTIPLE-NEURON SPIKE COUNT DATA. 潜在图形模型中的调整正则化:应用于多神经元尖峰计数数据。
IF 1.3 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2018-06-01 Epub Date: 2018-07-28 DOI: 10.1214/18-AOAS1190
Giuseppe Vinci, Valérie Ventura, Matthew A Smith, Robert E Kass

A major challenge in contemporary neuroscience is to analyze data from large numbers of neurons recorded simultaneously across many experimental replications (trials), where the data are counts of neural firing events, and one of the basic problems is to characterize the dependence structure among such multivariate counts. Methods of estimating high-dimensional covariation based on 1-regularization are most appropriate when there are a small number of relatively large partial correlations, but in neural data there are often large numbers of relatively small partial correlations. Furthermore, the variation across trials is often confounded by Poisson-like variation within trials. To overcome these problems we introduce a comprehensive methodology that imbeds a Gaussian graphical model into a hierarchical structure: the counts are assumed Poisson, conditionally on latent variables that follow a Gaussian graphical model, and the graphical model parameters, in turn, are assumed to depend on physiologically-motivated covariates, which can greatly improve correct detection of interactions (non-zero partial correlations). We develop a Bayesian approach to fitting this covariate-adjusted generalized graphical model and we demonstrate its success in simulation studies. We then apply it to data from an experiment on visual attention, where we assess functional interactions between neurons recorded from two brain areas.

当代神经科学的一个主要挑战是分析在许多实验复制(试验)中同时记录的大量神经元的数据,其中数据是神经放电事件的计数,而基本问题之一是表征这些多变量计数之间的依赖结构。基于ℓ 当存在少量相对较大的偏相关时,1-正则化是最合适的,但在神经数据中通常存在大量相对较小的偏相关。此外,试验之间的差异往往被试验中的泊松样变化所混淆。为了克服这些问题,我们引入了一种将高斯图形模型嵌入层次结构的综合方法:计数被假设为泊松,有条件地取决于遵循高斯图形模型的潜在变量,而图形模型参数又被假设取决于生理动机的协变量,这可以极大地提高交互作用(非零部分相关性)的正确检测。我们开发了一种贝叶斯方法来拟合这个协变量调整的广义图形模型,并在模拟研究中证明了它的成功。然后,我们将其应用于视觉注意力实验的数据,在该实验中,我们评估了两个大脑区域记录的神经元之间的功能相互作用。
{"title":"ADJUSTED REGULARIZATION IN LATENT GRAPHICAL MODELS: APPLICATION TO MULTIPLE-NEURON SPIKE COUNT DATA.","authors":"Giuseppe Vinci, Valérie Ventura, Matthew A Smith, Robert E Kass","doi":"10.1214/18-AOAS1190","DOIUrl":"10.1214/18-AOAS1190","url":null,"abstract":"<p><p>A major challenge in contemporary neuroscience is to analyze data from large numbers of neurons recorded simultaneously across many experimental replications (trials), where the data are counts of neural firing events, and one of the basic problems is to characterize the dependence structure among such multivariate counts. Methods of estimating high-dimensional covariation based on <i>ℓ</i> <sub>1</sub>-regularization are most appropriate when there are a small number of relatively large partial correlations, but in neural data there are often large numbers of relatively small partial correlations. Furthermore, the variation across trials is often confounded by Poisson-like variation within trials. To overcome these problems we introduce a comprehensive methodology that imbeds a Gaussian graphical model into a hierarchical structure: the counts are assumed Poisson, conditionally on latent variables that follow a Gaussian graphical model, and the graphical model parameters, in turn, are assumed to depend on physiologically-motivated covariates, which can greatly improve correct detection of interactions (non-zero partial correlations). We develop a Bayesian approach to fitting this covariate-adjusted generalized graphical model and we demonstrate its success in simulation studies. We then apply it to data from an experiment on visual attention, where we assess functional interactions between neurons recorded from two brain areas.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"12 2","pages":"1068-1095"},"PeriodicalIF":1.3,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6879176/pdf/nihms-1014977.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49684619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Estimating Large Correlation Matrices for International Migration. 估算国际移民的大型相关矩阵。
IF 1.3 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2018-06-01 Epub Date: 2018-07-28 DOI: 10.1214/18-aoas1175
Jonathan J Azose, Adrian E Raftery

The United Nations is the major organization producing and regularly updating probabilistic population projections for all countries. International migration is a critical component of such projections, and between-country correlations are important for forecasts of regional aggregates. However, in the data we consider there are 200 countries and only 12 data points, each one corresponding to a five-year time period. Thus a 200 × 200 correlation matrix must be estimated on the basis of 12 data points. Using Pearson correlations produces many spurious correlations. We propose a maximum a posteriori estimator for the correlation matrix with an interpretable informative prior distribution. The prior serves to regularize the correlation matrix, shrinking a priori untrustworthy elements towards zero. Our estimated correlation structure improves projections of net migration for regional aggregates, producing narrower projections of migration for Africa as a whole and wider projections for Europe. A simulation study confirms that our estimator outperforms both the Pearson correlation matrix and a simple shrinkage estimator when estimating a sparse correlation matrix.

联合国是为所有国家编制和定期更新概率人口预测的主要组织。国际移民是此类预测的重要组成部分,而国家间的相关性对于预测地区总量非常重要。然而,在我们考虑的数据中,有 200 个国家,只有 12 个数据点,每个数据点对应一个五年时间段。因此,必须根据 12 个数据点估算出 200 × 200 的相关矩阵。使用皮尔逊相关性会产生许多虚假相关性。我们提出了一种相关矩阵的最大后验估计方法,它具有可解释的信息先验分布。先验分布用于规范相关矩阵,将不可信的先验元素缩减为零。我们所估计的相关结构改进了对区域总体净移民的预测,使整个非洲的移民预测范围更窄,欧洲的移民预测范围更宽。模拟研究证实,在估计稀疏相关矩阵时,我们的估计方法优于皮尔逊相关矩阵和简单的收缩估计方法。
{"title":"Estimating Large Correlation Matrices for International Migration.","authors":"Jonathan J Azose, Adrian E Raftery","doi":"10.1214/18-aoas1175","DOIUrl":"10.1214/18-aoas1175","url":null,"abstract":"<p><p>The United Nations is the major organization producing and regularly updating probabilistic population projections for all countries. International migration is a critical component of such projections, and between-country correlations are important for forecasts of regional aggregates. However, in the data we consider there are 200 countries and only 12 data points, each one corresponding to a five-year time period. Thus a 200 × 200 correlation matrix must be estimated on the basis of 12 data points. Using Pearson correlations produces many spurious correlations. We propose a maximum <i>a posteriori</i> estimator for the correlation matrix with an interpretable informative prior distribution. The prior serves to regularize the correlation matrix, shrinking <i>a priori</i> untrustworthy elements towards zero. Our estimated correlation structure improves projections of net migration for regional aggregates, producing narrower projections of migration for Africa as a whole and wider projections for Europe. A simulation study confirms that our estimator outperforms both the Pearson correlation matrix and a simple shrinkage estimator when estimating a sparse correlation matrix.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"12 2","pages":"940-970"},"PeriodicalIF":1.3,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7164801/pdf/nihms-1029425.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37851577","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
KERNEL-PENALIZED REGRESSION FOR ANALYSIS OF MICROBIOME DATA. 用于微生物组数据分析的KERNEL-PENALIZED回归。
IF 1.8 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2018-03-01 Epub Date: 2018-03-09 DOI: 10.1214/17-AOAS1102
Timothy W Randolph, Sen Zhao, Wade Copeland, Meredith Hullar, Ali Shojaie

The analysis of human microbiome data is often based on dimension-reduced graphical displays and clusterings derived from vectors of microbial abundances in each sample. Common to these ordination methods is the use of biologically motivated definitions of similarity. Principal coordinate analysis, in particular, is often performed using ecologically defined distances, allowing analyses to incorporate context-dependent, non-Euclidean structure. In this paper, we go beyond dimension-reduced ordination methods and describe a framework of high-dimensional regression models that extends these distance-based methods. In particular, we use kernel-based methods to show how to incorporate a variety of extrinsic information, such as phylogeny, into penalized regression models that estimate taxonspecific associations with a phenotype or clinical outcome. Further, we show how this regression framework can be used to address the compositional nature of multivariate predictors comprised of relative abundances; that is, vectors whose entries sum to a constant. We illustrate this approach with several simulations using data from two recent studies on gut and vaginal microbiomes. We conclude with an application to our own data, where we also incorporate a significance test for the estimated coefficients that represent associations between microbial abundance and a percent fat.

人类微生物组数据的分析通常基于降维图形显示和聚类,这些显示和聚类来自每个样本中微生物丰度的载体。这些排序方法的共同点是使用基于生物学动机的相似性定义。尤其是主坐标分析,通常使用生态定义的距离进行,允许分析结合上下文相关的非欧几里得结构。在本文中,我们超越了降维排序方法,并描述了一个高维回归模型的框架,该框架扩展了这些基于距离的方法。特别是,我们使用基于核的方法来展示如何将各种外在信息(如系统发育)纳入惩罚回归模型,该模型估计与表型或临床结果的分类特异性关联。此外,我们展示了如何使用该回归框架来解决由相对丰度组成的多元预测因子的组成性质;即其条目总和为常数的向量。我们使用最近两项关于肠道和阴道微生物组的研究数据进行了几次模拟,以说明这种方法。最后,我们对自己的数据进行了应用,其中我们还对代表微生物丰度和脂肪百分比之间关系的估计系数进行了显著性检验。
{"title":"KERNEL-PENALIZED REGRESSION FOR ANALYSIS OF MICROBIOME DATA.","authors":"Timothy W Randolph,&nbsp;Sen Zhao,&nbsp;Wade Copeland,&nbsp;Meredith Hullar,&nbsp;Ali Shojaie","doi":"10.1214/17-AOAS1102","DOIUrl":"10.1214/17-AOAS1102","url":null,"abstract":"<p><p>The analysis of human microbiome data is often based on dimension-reduced graphical displays and clusterings derived from vectors of microbial abundances in each sample. Common to these ordination methods is the use of biologically motivated definitions of similarity. Principal coordinate analysis, in particular, is often performed using ecologically defined distances, allowing analyses to incorporate context-dependent, non-Euclidean structure. In this paper, we go beyond dimension-reduced ordination methods and describe a framework of high-dimensional regression models that extends these distance-based methods. In particular, we use kernel-based methods to show how to incorporate a variety of extrinsic information, such as phylogeny, into penalized regression models that estimate taxonspecific associations with a phenotype or clinical outcome. Further, we show how this regression framework can be used to address the compositional nature of multivariate predictors comprised of relative abundances; that is, vectors whose entries sum to a constant. We illustrate this approach with several simulations using data from two recent studies on gut and vaginal microbiomes. We conclude with an application to our own data, where we also incorporate a significance test for the estimated coefficients that represent associations between microbial abundance and a percent fat.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"12 1","pages":"540-566"},"PeriodicalIF":1.8,"publicationDate":"2018-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1214/17-AOAS1102","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36500481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 38
A MULTI-RESOLUTION MODEL FOR NON-GAUSSIAN RANDOM FIELDS ON A SPHERE WITH APPLICATION TO IONOSPHERIC ELECTROSTATIC POTENTIALS. 球上非高斯随机场的多分辨率模型及其在电离层静电势中的应用。
IF 1.8 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2018-03-01 Epub Date: 2018-03-09 DOI: 10.1214/17-AOAS1104
Minjie Fan, Debashis Paul, Thomas C M Lee, Tomoko Matsuo

Gaussian random fields have been one of the most popular tools for analyzing spatial data. However, many geophysical and environmental processes often display non-Gaussian characteristics. In this paper, we propose a new class of spatial models for non-Gaussian random fields on a sphere based on a multi-resolution analysis. Using a special wavelet frame, named spherical needlets, as building blocks, the proposed model is constructed in the form of a sparse random effects model. The spatial localization of needlets, together with carefully chosen random coefficients, ensure the model to be non-Gaussian and isotropic. The model can also be expanded to include a spatially varying variance profile. The special formulation of the model enables us to develop efficient estimation and prediction procedures, in which an adaptive MCMC algorithm is used. We investigate the accuracy of parameter estimation of the proposed model, and compare its predictive performance with that of two Gaussian models by extensive numerical experiments. Practical utility of the proposed model is demonstrated through an application of the methodology to a data set of high-latitude ionospheric electrostatic potentials, generated from the LFM-MIX model of the magnetosphere-ionosphere system.

高斯随机场一直是分析空间数据最流行的工具之一。然而,许多地球物理和环境过程往往表现出非高斯特征。在本文中,我们基于多分辨率分析,提出了一类新的球面上非高斯随机场的空间模型。使用一个特殊的小波框架,称为球面针状,作为构建块,该模型以稀疏随机效应模型的形式构建。针的空间定位,加上精心选择的随机系数,确保了模型是非高斯和各向同性的。该模型还可以被扩展以包括空间变化的方差轮廓。该模型的特殊公式使我们能够开发高效的估计和预测程序,其中使用了自适应MCMC算法。我们研究了所提出的模型参数估计的准确性,并通过大量的数值实验将其预测性能与两个高斯模型的预测性能进行了比较。通过将该方法应用于磁层-电离层系统LFM-MIX模型生成的高纬度电离层静电势数据集,证明了该模型的实用性。
{"title":"A MULTI-RESOLUTION MODEL FOR NON-GAUSSIAN RANDOM FIELDS ON A SPHERE WITH APPLICATION TO IONOSPHERIC ELECTROSTATIC POTENTIALS.","authors":"Minjie Fan,&nbsp;Debashis Paul,&nbsp;Thomas C M Lee,&nbsp;Tomoko Matsuo","doi":"10.1214/17-AOAS1104","DOIUrl":"https://doi.org/10.1214/17-AOAS1104","url":null,"abstract":"<p><p>Gaussian random fields have been one of the most popular tools for analyzing spatial data. However, many geophysical and environmental processes often display non-Gaussian characteristics. In this paper, we propose a new class of spatial models for non-Gaussian random fields on a sphere based on a multi-resolution analysis. Using a special wavelet frame, named <i>spherical needlets</i>, as building blocks, the proposed model is constructed in the form of a sparse random effects model. The spatial localization of needlets, together with carefully chosen random coefficients, ensure the model to be non-Gaussian and isotropic. The model can also be expanded to include a spatially varying variance profile. The special formulation of the model enables us to develop efficient estimation and prediction procedures, in which an adaptive MCMC algorithm is used. We investigate the accuracy of parameter estimation of the proposed model, and compare its predictive performance with that of two Gaussian models by extensive numerical experiments. Practical utility of the proposed model is demonstrated through an application of the methodology to a data set of high-latitude ionospheric electrostatic potentials, generated from the LFM-MIX model of the magnetosphere-ionosphere system.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"12 1","pages":"459-489"},"PeriodicalIF":1.8,"publicationDate":"2018-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1214/17-AOAS1104","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41219380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
POWERFUL TEST BASED ON CONDITIONAL EFFECTS FOR GENOME-WIDE SCREENING. 基于全基因组筛选条件效应的强大测试。
IF 1.8 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2018-03-01 Epub Date: 2018-03-09 DOI: 10.1214/17-AOAS1103
Yaowu Liu, Jun Xie

This paper considers testing procedures for screening large genome-wide data, where we examine hundreds of thousands of genetic variants, e.g., single nucleotide polymorphisms (SNP), on a quantitative phenotype. We screen the whole genome by SNP sets and propose a new test that is based on conditional effects from multiple SNPs. The test statistic is developed for weak genetic effects and incorporates correlations among genetic variables, which may be very high due to linkage disequilibrium. The limiting null distribution of the test statistic and the power of the test are derived. Under appropriate conditions, the test is shown to be more powerful than the minimum p-value method, which is based on marginal SNP effects and is the most commonly used method in genome-wide screening. The proposed test is also compared with other existing methods, including the Higher Criticism (HC) test and the sequence kernel association test (SKAT), through simulations and analysis of a real genome data set. For typical genome-wide data, where effects of individual SNPs are weak and correlations among SNPs are high, the proposed test is more advantageous and clearly outperforms the other methods in the literature.

本文探讨了筛选全基因组大数据的测试程序,在这种情况下,我们要研究成千上万个遗传变异,如单核苷酸多态性(SNP),对定量表型的影响。我们通过 SNP 组对全基因组进行筛选,并根据多个 SNP 的条件效应提出了一种新的检验方法。该检验统计量是针对弱遗传效应开发的,包含了遗传变异之间的相关性,由于连锁不平衡,这种相关性可能非常高。得出了检验统计量的极限零分布和检验功率。在适当的条件下,证明该检验比最小 p 值法更强大,后者基于边际 SNP 效应,是全基因组筛选中最常用的方法。通过对真实基因组数据集的模拟和分析,还将所提出的检验方法与其他现有方法进行了比较,包括高等批判(HC)检验和序列核关联检验(SKAT)。对于典型的全基因组数据,即单个 SNPs 的效应较弱而 SNPs 之间的相关性较高的情况,所提出的检验方法更具优势,明显优于文献中的其他方法。
{"title":"POWERFUL TEST BASED ON CONDITIONAL EFFECTS FOR GENOME-WIDE SCREENING.","authors":"Yaowu Liu, Jun Xie","doi":"10.1214/17-AOAS1103","DOIUrl":"10.1214/17-AOAS1103","url":null,"abstract":"<p><p>This paper considers testing procedures for screening large genome-wide data, where we examine hundreds of thousands of genetic variants, e.g., single nucleotide polymorphisms (SNP), on a quantitative phenotype. We screen the whole genome by SNP sets and propose a new test that is based on conditional effects from multiple SNPs. The test statistic is developed for weak genetic effects and incorporates correlations among genetic variables, which may be very high due to linkage disequilibrium. The limiting null distribution of the test statistic and the power of the test are derived. Under appropriate conditions, the test is shown to be more powerful than the minimum p-value method, which is based on marginal SNP effects and is the most commonly used method in genome-wide screening. The proposed test is also compared with other existing methods, including the Higher Criticism (HC) test and the sequence kernel association test (SKAT), through simulations and analysis of a real genome data set. For typical genome-wide data, where effects of individual SNPs are weak and correlations among SNPs are high, the proposed test is more advantageous and clearly outperforms the other methods in the literature.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"12 1","pages":"567-585"},"PeriodicalIF":1.8,"publicationDate":"2018-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5931742/pdf/nihms910242.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36077138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MSIQ: JOINT MODELING OF MULTIPLE RNA-SEQ SAMPLES FOR ACCURATE ISOFORM QUANTIFICATION. Msiq:多个rna-seq样品的联合建模,用于精确的异构体定量。
IF 1.8 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2018-03-01 Epub Date: 2018-03-09 DOI: 10.1214/17-AOAS1100
Wei Vivian Li, Anqi Zhao, Shihua Zhang, Jingyi Jessica Li

Next-generation RNA sequencing (RNA-seq) technology has been widely used to assess full-length RNA isoform abundance in a high-throughput manner. RNA-seq data offer insight into gene expression levels and transcriptome structures, enabling us to better understand the regulation of gene expression and fundamental biological processes. Accurate isoform quantification from RNA-seq data is challenging due to the information loss in sequencing experiments. A recent accumulation of multiple RNA-seq data sets from the same tissue or cell type provides new opportunities to improve the accuracy of isoform quantification. However, existing statistical or computational methods for multiple RNA-seq samples either pool the samples into one sample or assign equal weights to the samples when estimating isoform abundance. These methods ignore the possible heterogeneity in the quality of different samples and could result in biased and unrobust estimates. In this article, we develop a method, which we call "joint modeling of multiple RNA-seq samples for accurate isoform quantification" (MSIQ), for more accurate and robust isoform quantification by integrating multiple RNA-seq samples under a Bayesian framework. Our method aims to (1) identify a consistent group of samples with homogeneous quality and (2) improve isoform quantification accuracy by jointly modeling multiple RNA-seq samples by allowing for higher weights on the consistent group. We show that MSIQ provides a consistent estimator of isoform abundance, and we demonstrate the accuracy and effectiveness of MSIQ compared with alternative methods through simulation studies on D. melanogaster genes. We justify MSIQ's advantages over existing approaches via application studies on real RNA-seq data from human embryonic stem cells, brain tissues, and the HepG2 immortalized cell line. We also perform a comprehensive analysis of how the isoform quantification accuracy would be affected by RNA-seq sample heterogeneity and different experimental protocols.

下一代RNA测序(RNA-seq)技术已被广泛用于高通量评估全长RNA异构体丰度。RNA-seq数据提供了对基因表达水平和转录组结构的深入了解,使我们能够更好地了解基因表达的调控和基本的生物学过程。由于测序实验中的信息丢失,从RNA-seq数据中准确定量异构体具有挑战性。最近来自同一组织或细胞类型的多个RNA-seq数据集的积累为提高同种异构体定量的准确性提供了新的机会。然而,现有的用于多个RNA-seq样本的统计或计算方法,要么将样本汇集到一个样本中,要么在估计异构体丰度时为样本分配相同的权重。这些方法忽略了不同样本质量可能存在的异质性,可能导致有偏和不稳健的估计。在本文中,我们开发了一种方法,我们称之为“多RNA-seq样本的联合建模,用于精确的异构体定量”(MSIQ),通过在贝叶斯框架下整合多个RNA-seq样本,实现更准确和稳健的异构体定量。我们的方法旨在(1)鉴定出一组质量一致的样本;(2)通过对一致性组赋予更高的权重,对多个RNA-seq样本进行联合建模,从而提高同种异型的定量准确性。我们证明了MSIQ提供了一个一致的异构体丰度估计,并通过对黑腹龙基因的模拟研究,与其他方法相比,我们证明了MSIQ的准确性和有效性。通过对人类胚胎干细胞、脑组织和HepG2永生化细胞系的真实RNA-seq数据的应用研究,我们证明了MSIQ优于现有方法的优势。我们还全面分析了RNA-seq样品异质性和不同实验方案对同种异构体定量准确性的影响。
{"title":"MSIQ: JOINT MODELING OF MULTIPLE RNA-SEQ SAMPLES FOR ACCURATE ISOFORM QUANTIFICATION.","authors":"Wei Vivian Li, Anqi Zhao, Shihua Zhang, Jingyi Jessica Li","doi":"10.1214/17-AOAS1100","DOIUrl":"10.1214/17-AOAS1100","url":null,"abstract":"<p><p>Next-generation RNA sequencing (RNA-seq) technology has been widely used to assess full-length RNA isoform abundance in a high-throughput manner. RNA-seq data offer insight into gene expression levels and transcriptome structures, enabling us to better understand the regulation of gene expression and fundamental biological processes. Accurate isoform quantification from RNA-seq data is challenging due to the information loss in sequencing experiments. A recent accumulation of multiple RNA-seq data sets from the same tissue or cell type provides new opportunities to improve the accuracy of isoform quantification. However, existing statistical or computational methods for multiple RNA-seq samples either pool the samples into one sample or assign equal weights to the samples when estimating isoform abundance. These methods ignore the possible heterogeneity in the quality of different samples and could result in biased and unrobust estimates. In this article, we develop a method, which we call \"joint modeling of multiple RNA-seq samples for accurate isoform quantification\" (MSIQ), for more accurate and robust isoform quantification by integrating multiple RNA-seq samples under a Bayesian framework. Our method aims to (1) identify a consistent group of samples with homogeneous quality and (2) improve isoform quantification accuracy by jointly modeling multiple RNA-seq samples by allowing for higher weights on the consistent group. We show that MSIQ provides a consistent estimator of isoform abundance, and we demonstrate the accuracy and effectiveness of MSIQ compared with alternative methods through simulation studies on <i>D. melanogaster</i> genes. We justify MSIQ's advantages over existing approaches via application studies on real RNA-seq data from human embryonic stem cells, brain tissues, and the HepG2 immortalized cell line. We also perform a comprehensive analysis of how the isoform quantification accuracy would be affected by RNA-seq sample heterogeneity and different experimental protocols.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"12 1","pages":"510-539"},"PeriodicalIF":1.8,"publicationDate":"2018-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1214/17-AOAS1100","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36077139","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
期刊
Annals of Applied Statistics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1